Applied Research Scientist - Content and Studio
Netflix
Los Gatos, CA
The Team's Goals
At Netflix, we strive to bring joy to people across the world through amazing stories. As we grow internationally, we are continually enhancing our cloud-based infrastructure to improve our performance, scalability, and reliability.
The streaming SRE team's goal is to ensure customer joy by successfully managing risk and minimizing impact. We do this through cross-functional engagement with other engineering teams, handling issues when they happen, as well as promoting reliability and resilience practices throughout the organization. The team is adding a new manager to the leadership team to focus on the future reliability of the Netflix streaming service and the SRE team that supports it.
About the role
As the leader of the streaming SRE team, you'll ensure the balance between the reactive work that's part of incident management and the proactive work of reducing future issues. You'll use the information gathered to educate and inform the engineering organization about patterns and behaviors that lead to more reliable and available systems.
This role benefits from an experienced engineering leader with excellent cross-functional and communications skills. You'll be leading and supported by a world-class group of senior engineers with backgrounds in operations, software development, systems performance and resilience engineering. Together you'll focus on quality of experience for our customers. It's a rewarding and ever changing set of challenges for a leader that cares about the complexities of socio-technical systems that service millions of members around the world.
What the team does
• Drive incidents to resolution by coordinating with multiple engineering teams
• Identify sources of instability in large-scale distributed systems and drive operational excellence
• Analyze complex systems from a reliability and resilience perspective
• Lead cross-organizational efforts with different teams to diagnose operational surprises and carry forward improvements
• Improve reliability and drive down the burden of toil with tooling and automation
You'll lead the team to
• Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks
• Increase our reliability through establishing guidance and methods of improvement
• Form and maintain relationships with internal and external partners
• Develop deeper insights and analysis into the quality of experience for our customers
Your new responsibilities include
• Leading reliability efforts for the entire Netflix streaming service
• Maintain a team of high-performing leadership-minded engineers
• Collaborate with other teams and managers
• Ability to communicate, discuss, and champion reliability efforts
• Understanding the near, mid, and long term needs of the business and how the work of the team contributes
Your experience
• Leading reliability efforts in a technical organization
• Leadership of an engineering team
• Experience with modern SRE practices
• Hands-on experience with active incident management
• Cross functional work with teams of different expertise
• Large scale distributed systems management
We Value
• Curiosity about how complex socio-technical systems successfully operate at scale when failure is inevitable
• People who see influence as their preferred tool for cultivating relationships
• Collaboration and continuous improvement
• A desire to learn and readiness to teach
• Iteration as the path forward
Things that show how we think
• Keeping Customers Streaming - The Centralized Site Reliability Practice at Netflix
• How Did Things Go Right? Learning More from Incidents
• Capacity Management Made Easy with Autoscaling
• Antics, Drift, and Chaos
• Day in the Life of a Netflix Engineer
• How we keep Netflix up and running
Jobcode: Reference SBJ-g359xr-18-223-125-219-42 in your application.