Full Time Job

Principal Site Reliability Engineer

Epic Games

Bellevue, WA 06-09-2021
  • Paid
  • Full Time
Job Description

What makes us Epic?

At the core of Epic's success are talented, passionate people. Epic prides itself on creating a collaborative, welcoming, and creative environment. Whether it's building award-winning games or crafting engine technology that enables others to make visually stunning interactive experiences, we're always innovating.

Epic Games is hiring a Principal Site Reliability Engineer for the Reliability Engineering team focusing on building services and tooling to improve reliability for our platforms, games and online services. This role will focus on helping development teams with operational excellence and service ownership.

In the role of a Site Reliability Engineer you will tackle problems that impact reliability of our products as a whole. Part of this role is analyzing gaps or risk areas for our products and working with engineering to determine the best course of action. You will participate in post incident reviews, readiness programs and engineering and development efforts. This role is expected to have breadth over depth, but depth in building and running reliable systems.

At Epic we embrace a Service Owner (You build it, you run it) mentality. In this role we are stewards for operational excellence and we are service owners for tools, systems and services that we build.
Our team's mission is to keep our games and platform up and running.

Post Incident Review
There is always an interesting form of something not working as we expect. We focus on how we learn from these production surprises and improve our systems and processes to be more reliable over time. We work with a diverse set of development teams on helping understand incidents.

Production, Event and Launch Readiness
We run large scale production events and we work with many teams on readiness and operational excellence. We own the process and review for service and product launches and game events.

Development focused on Reliability
While we help with incidents and readiness, we also work on engineering on tooling, services or other systems and processes that can improve our systems reliability.

We do this by...
• Building tooling to make service ownership easier
• Facilitating and following up with learnings from incidents
• Work across the organization to help distribute learnings or help in understanding the entire ecosystem
• Deep diving into systems to understand risk and communicate this outward to teams or leadership
• Fixing things that are broken - our landscape is wide and vast
• Connecting the dots between groups for experience or knowledge sharing
• Tracking progress of focus areas over time
• Providing recommendations to teams while also getting our ''hands dirty''

What you'll do…
• Write code and develop systems and services that help us with operational excellence; most of our tools will require web interfaces and API
• Contribute to services, tools and code across the organization that focuses on our team goals
• Help develop best practices across our organization and tools that help us distribute those
• Work with development teams on understanding systems and helping them be successful with service ownership
• Work on cloud based services in AWS

Who you are…
• You have worked cross functionally or across a large number of teams in an organization
• You have experience working with and building reliable services on AWS
• A passion for the reliability engineering space

Jobcode: Reference SBJ-d2no82-3-236-98-69-42 in your application.

Company Profile
Epic Games

Founded in 1991, Epic Games is a leading interactive entertainment company and provider of 3D engine technology. Epic operates Fortnite, one of the world’s largest games with over 350 million accounts and 2.5 billion friend connections. Epic also develops Unreal Engine, which powers the world’s leading games and is also adopted across industries such as film and television, architecture, automotive, manufacturing, and simulation.