Full Time Job

Manager - Site Reliability Engineer, Core


Los Gatos, CA 07-02-2020
  • Paid
  • Full Time
Job Description

At Netflix, we strive to bring joy to people across the world through amazing stories. As we grow internationally, we are continually enhancing our cloud-based infrastructure to improve our performance, scalability, and reliability. The streaming SRE team's goal is to ensure customer joy by successfully managing risk and minimizing impact across Netflix. We do this through cross-functional engagement with other engineering teams, handling issues when they happen, as well as promoting reliability and resilience practices throughout the organization. The team is adding a new manager to the leadership team to focus on the future reliability of the Netflix streaming service and the SRE team that supports it.
• Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks
• Increase our reliability through establishing guidance and methods of improvement
• Form and maintain relationships with internal and external partners
• Develop deeper insights and analysis into the quality of experience for our customers

We Value
• Curiosity about how complex socio-technical systems successfully operate at scale when failure is inevitable
• People who see influence as their preferred tool for cultivating relationships
• Collaboration and continuous improvement
• A desire to learn and readiness to teachIteration as the path forward
Our Work
• Drive incidents to resolution by coordinating with multiple engineering teams
• Identify sources of instability in large-scale distributed systems and drive operational excellence
• Analyze complex systems from a reliability and resilience perspective
• Engage with product teams to diagnose operational surprises and carry forward improvements
• Improve reliability and drive down the burden of toil with tooling and automation

Nice To Have
• Experience with global, continuous delivery methods
• Development with Python, Go, Java, or JavaScript/Node.js
• Involvement with incident management and response
• Knowledge of cloud platforms like AWS and microservices architecture
• Deep network analysis
• Linux systems engineering capability

Things That Show How We Think
• How Did Things Go Right? Learning More from Incidents
• Capacity Management Made Easy with Autoscaling
• Antics, Drift, and Chaos
• Day in the Life of a Netflix Engineer
• How we keep Netflix up and running

This role is rewarding for people who can collaborate in a complex environment with a wide variety of groups across Netflix. If any of these things sound interesting to you, please apply.