Full Time Job

Site Reliability Engineer

Electronic Arts

Vancouver, BC 01-21-2021
Apply @ Employer
  • Paid
  • Full Time
  • Mid (2-5 years) Experience
Job Description

We Are EA

QVS Proton Team

The Proton Team is part of the Quality, Verification, and Standards engineering group, responsible for developing and maintaining automation and tools that support Quality Assurance and development processes for EA's games.

We partner directly with the different game studios in EA to help them run reliable infrastructure for their QA platforms. By working with different teams, we're uniquely positioned to learn from each team's experiences and share their ideas and best practices throughout EA's Quality Engineering teams.

We are looking for an experienced DevOps / Site Reliability Engineer to help us implement and promote SRE values and patterns within our team and across the partners we work with.

Reporting to the Proton Engineering Manager, you will work with team members and partners worldwide; contribute to the implementation and maintenance of large-scale projects, and troubleshoot and resolve issues and improve reliability across QA's systems. To achieve this we require candidates who will focus on communication, uptime, and predictability.

You're someone who can:
• Work with peers and partners across many locations to identify, implement, and support a shared set of goals and applications.
• Directly support and help engineer environments through all stages of development.
• Troubleshoot application, network, and server-level problems.
• Perform Windows and Linux configuration and management.
• Support datastore back-up and recovery.
• Support CI (Jenkins and GitLab CI) and CD operations.
• Perform cause analyses and ensure improvements are identified and prioritized.
• Maintain documentation of systems configurations and procedures, including runbooks.
• Implement monitoring and alerting which identifies service disruptions
• Design and develop scripting and productivity enhancing tools for automation of system administration tasks.
• Evaluate and adopt technologies which improve the team's efficiency and capabilities.
• Make decisions based on data.
• Document your insights and best practices.

You also bring the following skills or experiences to our team:
• An understanding of what it takes to design and run services at scale - and achieve the capabilities with infrastructure as code.
• 3+ years of experience in a technical role focused on development or operation of diverse and complex services or legacy systems.
• 4+ years of experience configuring and troubleshooting Windows or Linux environments.
• Working knowledge of the following scripting languages:
• PowerShell, Python, Shell, Groovy/Jenkinsfile
• Experience with and working knowledge of:
• CI/CD and pipelines (Jenkins or GitLab CI)
• An interest in the purpose and application of DevOps methodologies and technologies, such as:
• Docker
• Kubernetes
• Hashicorp tools
• Ansible/Chef
• Distributed source control such as Git
• Visualization and Alerting systems (such as Grafana, Kibana)

Bonus points for experience with:
• Security
• Authorization/authentication
• Scaling and load testing
• Diverse enterprise IT environments

In a typical week, the Site Reliability Engineer could:
• Work with developers and IT to investigate and troubleshoot performance or functionality issues.
• Implement Infrastructure as code-based solutions to satisfy product requirements.
• Contribute to common SRE tooling.
• Provide first line of support on operations - monitor support channels, troubleshoot and correct issues raised.
• Identify and advocate for architecture and service changes to improve reliability and performance, using a data-driven approach.
• Mentor application developers become more capable in the SRE space so they can improve the reliability and performance of their application and infrastructure.
• Experiment with new technologies to solve current challenges.

You'll build relationships and work with:
• Developers and product managers on the teams you work with to promote SRE principles as part of the team's values.
• Other members of the SRE family to build reusable tools and patterns to improve reliability and reduce toil.

Company Profile
Electronic Arts

Electronic Arts Inc. is a global leader in digital interactive entertainment. EA develops and delivers games, content and online services for Internet-connected consoles, mobile devices and personal computers.