Senior Software Engineer - Open Connect Cloud Platforms
Los Gatos, CA
Would you like to manage our Spark compute infrastructure and optimize the ML Spark pipelines that power Netflix recommendations? We think of the Netflix service as hundreds of millions of different products serving uniquely personalized experiences to each of our 200+ Million members.
One of the teams powering this effort is the ML Platform Data & Feature Infra team that is responsible for building a scalable and efficient compute infrastructure that is leveraged to train our personalization ML models.
In this role, you will have the opportunity to manage the Spark compute infrastructure that is used to train ML algorithms that power Netflix personalization. You will drive operational excellence through tooling and automation and will be working closely with ML researchers and engineers to scale their adhoc explorations and manage Production ML pipelines. This role will allow you to gain intimate knowledge of Netflix Personalization, while working for a unique and pioneering company that is redefining how video content is consumed globally.
Here are some examples of the types of things you would work on:
• Manage a large scale Spark cluster (several thousands of EC2 instances) that powers the ML production pipelines fueling innovation for Recommendations research
• Collaborate with our Big Data Platform teams to build, deploy and upgrade our compute infrastructure using the the latest and greatest open source libraries
• Optimize the ML Spark pipelines for both resource and latency efficiency and help do capacity planning for our compute infrastructure
• Build tools and automation to make infrastructure more robust and for reporting cluster cost utilization and efficiency
• Increase research productivity by quickly troubleshooting Spark performance issues and any roadblocks in adoption of our compute infrastructure
To learn more, here are some talks/blog posts from the team:
• Multi-tenant Spark workflows in Auto Scalable Mesos clusters
• 2018 Spark Summit presentations
• Netflix ML Platform Research website
• 4+ years of relevant experience managing large scale distributed data systems
• Strong automation mindset and a passion for root cause analysis and strategies to mitigate issues
• Experience installing, configuring, and monitoring big data technologies like Spark, Mesos/YARN/Kubernetes, HDFS or ElasticSearch
• Experience with performance tuning and debugging scalability issues of Spark applications
• Excellent communication and people engagement skills
• Expertise in scripting languages
• Experience with Cloud Computing platforms like Amazon AWS
• Exposure to functional languages like Scala
• Experience working on Notebooks such as Jupyter or Zeppelin
• Experience working on container (Docker) platforms
Netflix is an equal opportunity employer and strives to build diverse teams from all walks of life. We offer a unique culture of freedom and responsibility with a clear long-term view. We recommend reading through these to understand what working at Netflix is like.
Jobcode: Reference SBJ-gm5e0x-3-238-173-209-42 in your application.