., content understanding, personalization prompts, assistant use cases), integrating with batch/online inference (including vLLM-based backends) and experiment tracking to deliver reliable, reproducible metrics. 
• Operationalize benchmark coverage alongside Netflix-specific task suites and user-journey-grounded prompts; automate result collection, statistical analysis, and drift detection. 
• Develop high-quality synthetic data and labeling pipelines to expand coverage, reduce bias, and continuously refresh eval corpora; codify data provenance and sampling policies. 
• Partner deeply with model developers and platform teams to co-design APIs for submitting eval jobs, adding new tasks/metrics, and defining SLO-like quality thresholds that unblock launches while preventing regressions. 
• Contribute beyond evaluation across the GenAI/FM stack when needed:
• Research workflows (orchestration, queueing/caching/failure isolation, artifact lineage, experiment management) that k...Preferred Qualifications
• End-to-end foundation-model lifecycle exposure: pre-train checks, post-train regression, and pre-launch gates-understanding where and how evaluation fits.
• Built or contributed to an evaluation platform at scale (batch/online evals, multi-modal tasks, queueing, caching, failure isolation) with strong SLIs/SLOs.
• Experience building evaluation data pipelines (synthetic generation, labeling, sampling) with provenance and governance
• Platform mindset: craft usable APIs/UX so modeling teams can submit tasks, compare runs, and gate launches with SLO-like thresholds.
• Bonus signals across the broader stack: experience with reinforcement learning, agent modeling, AI alignment, distributed training, vector search/feature stores, routing/safety middleware for serving, and cost/perf tuning
What do we offer?...Job is open for no less than 7 days and will be removed when the position is filled.