Design Program Manager
Los Gatos, CA
Building a very reliable service like Netflix requires that hundreds of services work together reliably. If that doesn't happen we've failed at our jobs. The Demand Engineering team's focus is reliability. We achieve reliability by ensuring capacity needs are met across the Netflix ecosystem. We do this by shaping traffic, scaling systems and recovering from failure.
Demand Engineering helps Netflix meet our customer reliability and overall efficiency goals by ensuring that the services that run Netflix have the compute resources they need where and when they need them to. We run the infrastructure to reactively mitigate incidents through regional evacuation without our customers noticing.
Our team sits in the middle of the action at Netflix. In order to serve over 193m members around the world Netflix needs to have capacity available when and where it's needed. More importantly we need to be able to shift that capacity on a moment's notice in the event of a problem with the infrastructure. This means predicting what compute resources are needed, when they're needed and where they're needed at any point in the day.
Our team creates the tools and techniques needed to make this all possible in addition to operating the infrastructure. Steering and scaling are powerful tools to influence the availability and latency of Netflix during normal operations as well.
We have a lot of fun problems to solve, a scale that makes them challenging, and a culture that gives us the freedom to pursue what is best for our members and the business.
Who you are
• You are intensely curious about how systems operate and fail at scale
• You reflect on design choices and trade-offs you have made
• You insightfully draw connections between minutiae of implementation details and emergent system behavior
• You think freely and independently, and are ready to share your view
• You are humble and eager to learn from mistakes and you socialize the lessons learned
What you'll do
• Take ownership of cross functional projects that impact the majority of the Netflix fleet.
• Create new solutions and see them through, from conception to production
• Write code to support our existing solutions, most of which are in Python
• Respond to problems in production in real-time while on-call
• Perform regional evacuation
• Use data analysis to improve our capacity and failover predictions
• Shape the future of capacity management and efficiency by abstracting complexity for other engineering teams
• You have built or contributed to a variety of systems, ideally in different technologies
• Experience with microservice architectures and/or the nitty-gritty of low-level concurrency concerns
• Some experience with large scale data analytics and familiarity with data science tools.
• Strong software design and development skills in modern dynamic programming languages
• Willingness to be part of an on-call rotation for regional evacuation
Nice to have
• Experience with multi-site high availability
• Experience with dynamic scaling (AWS)
• Experience with internet-scale infrastructure
Interested to learn more?
• What is Demand Engineering?
• Project Nimble
• Evolving Regional Evacuation
• Our Culture
• How will our team interview you?