company_logo

Full Time Job

Manager - Site Reliability Engineer, Core

Netflix

Los Gatos, CA 09-08-2020
 
  • Paid
  • Full Time
Job Description

The Team's Goals

At Netflix, we strive to bring joy to people across the world through amazing stories. As we grow internationally, we are continually enhancing our cloud-based infrastructure to improve our performance, scalability, and reliability.

The streaming SRE team's goal is to ensure customer joy by successfully managing risk and minimizing impact. We do this through cross-functional engagement with other engineering teams, handling issues when they happen, as well as promoting reliability and resilience practices throughout the organization. The team is adding a new manager to the leadership team to focus on the future reliability of the Netflix streaming service and the SRE team that supports it.

About the role

As the leader of the streaming SRE team, you'll ensure the balance between the reactive work that's part of incident management and the proactive work of reducing future issues. You'll use the information gathered to educate and inform the engineering organization about patterns and behaviors that lead to more reliable and available systems.

This role benefits from an experienced engineering leader with excellent cross-functional and communications skills. You'll be leading and supported by a world-class group of senior engineers with backgrounds in operations, software development, systems performance and resilience engineering. Together you'll focus on quality of experience for our customers. It's a rewarding and ever changing set of challenges for a leader that cares about the complexities of socio-technical systems that service millions of members around the world.

What the team does
• Drive incidents to resolution by coordinating with multiple engineering teams
• Identify sources of instability in large-scale distributed systems and drive operational excellence
• Analyze complex systems from a reliability and resilience perspective
• Lead cross-organizational efforts with different teams to diagnose operational surprises and carry forward improvements
• Improve reliability and drive down the burden of toil with tooling and automation

You'll lead the team to
• Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks
• Increase our reliability through establishing guidance and methods of improvement
• Form and maintain relationships with internal and external partners
• Develop deeper insights and analysis into the quality of experience for our customers

Your new responsibilities include
• Leading reliability efforts for the entire Netflix streaming service
• Maintain a team of high-performing leadership-minded engineers
• Collaborate with other teams and managers
• Ability to communicate, discuss, and champion reliability efforts
• Understanding the near, mid, and long term needs of the business and how the work of the team contributes

Your experience
• Leading reliability efforts in a technical organization
• Leadership of an engineering team
• Experience with modern SRE practices
• Hands-on experience with active incident management
• Cross functional work with teams of different expertise
• Large scale distributed systems management

We Value
• Curiosity about how complex socio-technical systems successfully operate at scale when failure is inevitable
• People who see influence as their preferred tool for cultivating relationships
• Collaboration and continuous improvement
• A desire to learn and readiness to teach
• Iteration as the path forward

Things that show how we think
• Keeping Customers Streaming - The Centralized Site Reliability Practice at Netflix
• How Did Things Go Right? Learning More from Incidents
• Capacity Management Made Easy with Autoscaling
• Antics, Drift, and Chaos
• Day in the Life of a Netflix Engineer
• How we keep Netflix up and running

Jobcode: Reference SBJ-g359xr-18-223-125-219-42 in your application.