Job Description
ONLINE INFRASTRUCTURE
What We Do
We enable Epic's online services teams to build, deploy, and manage services that are used by more than half a billion players around the world. Our mission is to provide world class tools and platforms to improve the experience of our developers and make it easier, faster, and safer to build, operate, and scale their applications. We operate at massive scale as one of the largest cloud computing users in the world.
What You'll Do
Our Observability team is looking for a Lead to help us manage and grow our small team of observability focused SREs to build and operate the infrastructure our teams rely on to keep our platforms, games, and online services running. Our Observability team works across all of Epic to implement industry best practices and develop new monitoring capabilities. You'll be leading a small team of contractors and FTEs to manage the development of Epic's observability platform. This will be a hands-on leadership role where you'll be directly engaged in the roll out of observability tools. This team is responsible for company-wide metrics, logging, exception handling, and dashboarding solutions. We will also be exploring tracing solutions in the near future as the main missing gap in our overall platform offering. The team will stay relatively small (4-5 people max) for the foreseeable future so the right candidate is just as comfortable reviewing terraform PRs as they are meeting with vendors to discuss long term strategy.
In this role, you will
• Own the roadmap for observability, working closely with the team to prioritize the right strategic initiatives for Epic
• Help to modernize key portions of our observability infrastructure, building new data processing pipelines for telemetry data as well as writing software to automate processes and generate new insights
• Work with teams across Epic as an observability subject matter expert to provide guidance on observability best practices
• Foster a healthy and collaborative culture both within the team and in interacting with your peers
• Make educated decisions on build vs. buy, be the main point of contact for observability vendors
What we're looking for
• Experience with executing meaningful change in a fast-paced interrupt driven environment
• Self-starter, you approach challenges creatively and methodically, seeing them through to final resolution
• Experience managing vendor relationships
• Ability to adapt and be effective in new situations within a highly dynamic environment
• Experience working with large scale systems in AWS, mostly deployed via Kubernetes
• Comfortable in a very terraform heavy environment, both reviewing PRs as well as contributing yourself
• Are familiar with application/service monitoring strategies and technologies, examples include OpenTelemetry, Prometheus, Grafana, FluentD, New Relic, Datadog, Grafana, Sentry, and Sumo Logic
Note to Recruitment Agencies: Epic does not accept any unsolicited resumes or approaches from any unauthorized third party (including recruitment or placement agencies) (i.e., a third party with whom we do not have a negotiated and validly executed agreement). We will not pay any fees to any unauthorized third party. Further details on these matters can be found here.
Jobcode: Reference SBJ-dyzmjm-35-171-45-182-42 in your application.