Technology Systems and Operations Specialist
                                                Netflix 
                                                Remote / Virtual
                                            
 
        Would you like to manage our Spark compute infrastructure and optimize the ML Spark pipelines that power Netflix recommendations? We think of the Netflix service as hundreds of millions of different products serving uniquely personalized experiences to each of our 200+ Million members.
One of the teams powering this effort is the ML Platform Data & Feature Infra team that is responsible for building a scalable and efficient compute infrastructure that is leveraged to train our personalization ML models. 
The Opportunity
In this role, you will have the opportunity to manage the Spark compute infrastructure that is used to train ML algorithms that power Netflix personalization. You will drive operational excellence through tooling and automation and will be working closely with ML researchers and engineers to scale their adhoc explorations and manage Production ML pipelines. This role will allow you to gain intimate knowledge of Netflix Personalization, while working for a unique and pioneering company that is redefining how video content is consumed globally.
Here are some examples of the types of things you would work on:
• Optimize the ML Spark pipelines for both resource and latency efficiency and help do capacity planning for our compute infrastructure
• Increase research productivity by quickly troubleshooting Spark performance issues and any roadblocks in adoption of our compute infrastructure
• Build tools and automation to make infrastructure more robust and for reporting cluster cost utilization and efficiency
• Manage a large scale Spark cluster (several thousands of EC2 instances) that powers the ML production pipelines fueling innovation for Recommendations research
• Collaborate with our Big Data Platform teams to build, deploy and upgrade our compute infrastructure using the the latest and greatest open source libraries
To learn more, here are some talks/blog posts from the team:
• Multi-tenant Spark workflows in Auto Scalable Mesos clusters
• 2018 Spark Summit presentations
• Netflix ML Platform Research website
Minimum Qualifications
• 4+ years of relevant experience managing large scale distributed data systems
• Strong automation mindset and a passion for root cause analysis and strategies to mitigate issues
• Experience in big data technologies like Spark, Mesos/YARN/Kubernetes, HDFS or ElasticSearch
• Experience with performance tuning and debugging scalability issues of Spark applications
• Excellent communication and people engagement skills
• Expertise in scripting languages
• Experience with Cloud Computing platforms like Amazon AWS
Preferred Qualifications
• Exposure to functional languages like Scala
• Experience working on Notebooks such as Jupyter or Polynote
• Experience working on container (Docker) platforms
Netflix is an equal opportunity employer and strives to build diverse teams from all walks of life. We offer a unique culture of freedom and responsibility with a clear long-term view. We recommend reading through these to understand what working at Netflix is like.
Jobcode: Reference SBJ-gm5e0x-216-73-216-86-42 in your application.