Job Description
The Role
About the team
The Online Data Stores organization offers high-leverage managed datastores and abstractions at scale to meet Netflix's operational data requirements across all lines of business. These datastores encompass various types, including Caching, Relational, Search, Key-Value, and Composite abstractions. We focus on developing and maintaining high-performance, reliable, and efficient datastores. Additionally, we enhance developer productivity by providing secure, intuitive, and opinionated access layers.
ODS is seeking a new Engineering Manager in *each* of the following teams:
Caching - we are a dream team of nine stellar engineers providing caching as a centralized managed service. This service spans from in-memory caching for relatively small datasets to distributed caching for larger datasets. Our caching products currently include Hollow, Gutenberg, EVCache (memcached), and Redis. These products power critical Tier 0 services at Netflix, directly responsible for the uptime of the Netflix product. The team focuses on feature development-both on internal and open-source codebases-improving the efficiency, reliability, and scalability of the products, and enabling more use cases at Netflix. Additionally, the team is responsible for the uptime of the products, focusing on streamlining operations and keeping the operational burden in check.
Infrastructure & Composite Abstractions - We seek an Engineering Manager to lead two different pods (Infrastructure and Composite Abstractions). We are crucial in building and maintaining the reliability and scalability of our composite abstractions and the infrastructure supporting Netflix's online data ecosystem. Composite Abstractions
offers data abstractions interfacing with multiple ODS offerings (KeyValue, EVCache, Elasticsearch, etc.), overseeing key abstractions like TimeSeries, Graph, Counter, Identifier, and Tree. These abstractions are essential for diverse and complex data needs across Netflix's systems. Infrastructur e develops, deploys, and manages composable software fleets, ensuring seamless operations of our Online Data Stores and other managed database offerings. Key responsibilities include DataGateway Infrastructure, Control plane, Capacity Planner, and Stateful Compute Platform, focusing on minimizing operational overhead while maintaining high performance and reliability.
What you will do:
• Partner to deliver the vision, strategy, and adoption of current and future technologies in the space of online caching datastores.
• Form trusting cross-functional partnerships to align many engineering teams and ensure our solutions meet their needs, selflessly prioritizing work beyond the scope of your own domain
• Build, scale, and grow a team of outstanding engineers, challenge them to bring their best selves to work every day, and deliver industry-leading results in stability, performance, and efficiency
• Balance smart risks, investment in foundational technical work, paying off tech debt, and incremental improvements to deliver timely results across multiple critical strategies
• Ensure the delivery of reliable, scalable, and intuitive data solutions that meet critical business needs. Oversee the development, deployment, and maintenance of Tier 0 systems, ensuring high availability, rapid response times, and minimal operational overhead.
• Leverage deep domain expertise in distributed systems and infrastructure to guide work across critical domains. Influence strategy and decision-making beyond immediate teams, driving organizational success.
What we are looking for:
• Experienced Engineering Leader - An experienced engineering leader who can drive a team of talented, geographically distributed engineers to achieve their best work while building and maintaining strong partnerships with peers and stakeholders.
• Technical Strategy and Vision - Possesses the ability to dive deep to facilitate technical strategy trade-offs and zoom out to understand the big picture, shaping the product effectively.
• Platform & Distributed Systems Expertise - Proven experience in building and running extremely reliable distributed platform systems that support high-scale and critical Tier0 workloads. This includes managing a 24x7 on-call rotation, ensuring 99.95+% availability, fast response times, and short mean-time-to-restore services.
• Influence and Collaboration - Able to positively influence immediate and adjacent teams through curiosity and informed opinions, fostering a collaborative environment.
• Delivering Results and Prioritization - Proven track record of helping teams deliver timely results on critical priorities amidst numerous demands. Partners with product and engineering management to influence the right areas for investment.
• Talent Development - Experience maintaining high talent density in software engineering teams through coaching and mentoring, with exposure to growth and critical conversations.
• Inclusive Leadership - Experience in building and leading software engineering teams with a focus on inclusion and diversity, ensuring the attraction and retention of top talent from diverse backgrounds and that all voices are heard to inform direction.
• Customer Focus & Execution Excellence: Demonstrated ability to align technical initiatives with business needs, delivering high-impact results in critical areas. Experience in managing and maintaining Tier 0 systems focusing on reliability and performance.
• Domain Expertise & Influence: Combines deep domain expertise in Tier 0 products with the ability to influence thinking beyond immediate teams. Drives strategic direction across various critical domains.
• Strategic Decision-Making & Communication: Skilled in making strategic decisions that drive business success and communicating them concisely and assertively. Ensures strong partnerships and holds stakeholders accountable.
Blogs and Talks:
• https://netflixtechblog.com/announcing-evcache-distributed-in-memory-datastore-for-cloud-c26a698c27f7
• https://netflixtechblog.com/caching-for-a-global-netflix-7bcc457012f1
• https://netflixtechblog.medium.com/cache-warming-leveraging-ebs-for-moving-petabytes-of-data-adcf7a4a78c3
• AWS re:Invent 2023 - How Netflix uses AWS for multi-Region cache replication (NFX304)
• AWS re:Invent 2021 - How Netflix operates mission-critical data stores on AWS
• https://netflixtechblog.com/netflixoss-announcing-hollow-5f710eefca4b
• https://netflixtechblog.com/how-netflix-microservices-tackle-dataset-pub-sub-4a068adcc9a
• https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6
• https://www.infoq.com/articles/netflix-highly-reliable-stateful-systems/
• https://netflixtechblog.com/enhancing-netflix-reliability-with-service-level-prioritized-load-shedding-e735e6ce8f7d
• Towards Practical Self-Healing Distributed Databases | IEEE Conference Publication
• AWS re:Invent 2021 - How Netflix operates mission-critical data stores on AWS
• https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70
A few more things about us:
As a team, we come from many different backgrounds and countries, and our fields of education range from the humanities to engineering to computer science and we strive to give people the opportunity to wear different hats, should they choose to. We strongly believe this diversity and agility has helped us build an inclusive
[more...]
Jobcode: Reference SBJ-g3e07x-18-207-255-67-42 in your application.