Senior Site Reliability Engineer
Sony Interactive Entertainment
The Challenge Ahead
We are a group of Site Reliability Engineers who collaborate with multiple teams to provide online services that enhance the game experience. We support a multi-billion-dollar video game ecosystem and various non-development business units within EA – our portfolio is wide. Our environments are continuously challenged by marketing promotions, game launches, and security threats. We are passionate about automation and ensuring high standards.
Who You Are
• A self-starter with a considerable breadth of technical knowledge and the ability to dig deep when necessary.
• Someone who communicates well with people across dozens of teams and practices.
• An engineer with a passion for excellence, a devotion to automation, and an eye for efficiency.
• A consummate problem solver.
• An engineer with development experience in at least one of these languages: Java, Go, C# and/or Python; Strong skills in reading, understanding and writing code in the same.
Who We Are
We are a multi-discipline team of engineers supporting our live services and the developers who create them. As Site Reliability Engineers our role covers the entire life-cycle of a product, from helping the developers with architecture and delivery to on-call incident response and triage. We focus heavily on automation and continuous integration/delivery with an emphasis on solving operations issues using software, ensuring that everything we deliver is robust, efficient, and supportable. Our responsibilities include:
• Creating and maintaining monitoring, alerting and dashboarding solutions that improve the visibility into our applications' performance and business metrics.
• Hands-on design, analysis, development and troubleshooting of highly-distributed large-scale production systems spanning on-prem and cloud based hosting
• Performing root cause analysis and post-mortems with an eye towards future prevention
• Being the escalation path for on-call incident response and triage
• Using automation technologies to ensure repeatability, eliminating toil, reducing mean time to detection and resolution (MTTD & MTTR) and repair services
• Using scale testing to measure, tune and optimize system performance
• Designing and implementing CI/CD pipelines for all that we build
• Preemptively creating stability, security, and performance improvements via metric/monitoring analysis
• Making sure every service has a complete high-availability and disaster recovery story
• Maintaining security standards across everything we support
• Producing documentation, runbooks, and support tooling for online support teams
The systems we support are incredibly diverse, produced by dozens of teams from around the world. Accordingly the ideal candidate will have a diverse skill set and always be eager to expand it. More importantly, they will be able to apply their conceptual understanding to new technologies and tools rapidly. Being a self-starter and having a personal dedication to continuous learning is key. The below is a representative but non-exhaustive list of the skills we are looking for in a successful candidate.
• Cross functional knowledge with system, storage, networking, security and databases.
• Experience in monitoring infrastructure and application uptime and availability to ensure SLI and SLO.
• Systems Administration: a strong understanding of *nix is mandatory. Familiarity with both RHEL and Debian family distros is preferred.
• Understanding of standard networking protocols and components such as :HTTP, DNS, ECMP, TCP/IP, UDP, ICMP, the OSI Model, Subnetting and Load Balancing strategies.
• Automation and orchestration skills Chef, Puppet, Terraform, Packer, Jenkins
• Experience in languages such as Python, Ruby, Bash, Java, Go, Perl, C/C++; Strong skills in reading, understanding and writing code in the same.
• A strong understanding of distributed systems is a must. An understanding of the CAP theorem, Microservices, Twelve Factor Apps and techniques for high availability, service discovery, secret management, etc.
• Virtualization, Containerization, Cloud Computing: AWS (preferred), GCP, Azure, VMWare ecosystem, Kubernetes (preferred), Docker, Vagrant, etc.
Jobcode: Reference SBJ-g4w0xy-52-23-219-12-42 in your application.
Electronic Arts Inc. is a global leader in digital interactive entertainment. EA develops and delivers games, content and online services for Internet-connected consoles, mobile devices and personal computers.