Full Time Job

Sr Site Reliability Engineer

NBCUniversal

Englewood Cliffs, NJ 8 days ago

Apply @ Employer

Details

Paid
Full Time
Executive (10+ years) Experience

Job Description

Sr. Site Reliability Engineer (SRE)

NBCUniversal is one of the world's leading media and entertainment companies.
Our impact is rooted in improving the communities where our employees, customers, and audiences live and work. We have a rich tradition of giving back and ensuring our employees have the opportunity to serve their communities. We champion an inclusive culture and strive to attract and develop a talented workforce to create and deliver a wide range of content reflecting our world.

Job Description
As a Principal Site Reliability Engineer (SRE) overseeing our digital application portfolio, you will lead efforts to ensure the reliability, scalability, and performance of the platforms behind our web, mobile, and OTT experiences. You'll work across a diverse ecosystem of products and technologies-helping with architectural decisions, shaping reliability standards, and championing operational excellence at scale.
You will serve as a strategic partner to engineering, product, security, and infrastructure teams-guiding system design for high availability, leading incident response across critical services, and embedding SRE best practices across the software development lifecycle. Your role will include evolving observability frameworks, advancing infrastructure-as-code maturity, and automating tool to accelerate delivery while maintaining stability.
Success in this role is defined by your ability to influence engineering culture, mentor teams, and drive systemic improvements that raise the bar for operational resilience. You'll take a proactive, data-driven approach to identifying and addressing risks before they impact users. Collaboration across teams-including video engineering, content delivery, data, and customer experience-is key to delivering digital products that are not only innovative but consistently reliable.
What We Value
Site Reliability Engineers are the champions of reliability and customer trust in production. We value engineers who are driven by a desire to deliver the best possible customer experience-ensuring that every interaction across our web, mobile, CTV, and video platforms is fast, seamless, and dependable. We look for systems thinkers who act with urgency, collaborate deeply, and apply a data-driven mindset to everything they do. Curiosity, clear communication, and continuous improvement are at the heart of our culture. As a Principal SRE, you'll lead by example-mentoring others, shaping best practices, and helping us build resilient systems that scale.

Responsibilities:
• Design and implement tools, processes, and frameworks to proactively monitor, measure, and improve the performance, availability, and reliability of production applications.
• Define and maintain key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to uphold system reliability and user experience targets.
• Evaluate applications and services for production readiness-ensuring they meet operational, security, and customer experience requirements before launch.
• Establish comprehensive observability practices-including real-time monitoring, alerting, and telemetry-to ensure deep visibility into system health and user impact.
• Serve as a feedback loop to engineering teams-analyzing production behavior, identifying reliability gaps, and driving architectural and operational improvements.
• Collaborate with security and infrastructure teams to proactively address vulnerabilities and maintain compliance across production systems.
• Partner with product and platform teams to ensure operational insights inform development priorities and release strategies.
• Lead post-incident reviews and foster a culture of continuous learning, improvement, and resilience.
• Participate in a 24/7 on-call rotation to support critical services and ensure rapid incident response.
Qualifications
Must-Haves:
• Willingness to work onsite and participate in a 24/7 on-call rotation, including evenings, overnights, weekends, and holidays with minimal notice.
• Demonstrated experience supporting digital news and content platforms across web, mobile, CTV, and video-rich environments, with a strong focus on performance and user experience.
• 10+ years of experience managing and optimizing large-scale, high-traffic websites.
• 10+ years of hands-on experience with application deployment processes and CI/CD pipelines.
• 5+ years improving performance and reliability for OTT (Connected TV) and mobile applications.
• 5+ years supporting microservices and multi-tier distributed systems.
• 5+ years implementing software automation frameworks for reliability and operational efficiency.
• 5+ years of experience with cloud platforms, including AWS and Google Cloud Platform (GCP).
• 5+ years working with observability and APM tools such as Datadog, New Relic, AppDynamics, Sysdig, or Zabbix.
• 3+ years working with reverse proxies like Varnish and Content Delivery Networks (CDNs) such as Akamai.
• 5+ years scripting with languages such as Bash, Python, Perl, or Groovy.
• 5+ years using configuration management tools such as Ansible, SaltStack, Chef, or Puppet.
• 5+ years configuring and managing application servers (e.g., Tomcat, NGINX, Apache).
• 5+ years of extensive experience with load and performance testing tools/frameworks such as JMeter, k6, or similar.
• Hands-on Experience using tools like Charles Proxy or Fiddler to triage and debug issues with Web, Mobile apps and OTT devices.
• High level understanding of video streaming techniques and ability to triage issues with Mobile and OTT streaming applications.
• 3+ years using performance validation tools such as Selenium, TestNG, or equivalent to drive improvements in production.
Preferred Qualifications:
• 3+ years implementing and monitoring application/infrastructure security controls, including WAFs, site shields, and other perimeter protections.
• 3+ years applying code and infrastructure security practices, including vulnerability remediation and secure deployment pipelines.
• Relevant certifications in Performance Engineering or Site Reliability Engineering (SRE) are a plus.
Hybrid: This position has been designated as hybrid, generally contributing from the office a minimum of three days per week.
What we'll offer:
At CNBC Headquarters in Englewood Cliffs, NJ, you'll have access to great perks and amenities:
• Sweat it out -- Free onsite fitness center with state-of-the-art equipment, plus daily group classes
• Eat up -- Gourmet cafeteria with daily specials plus soup and salad bars
• Extras -- Dry cleaning, shoe shining and sneak peeks
Don't have a car? No problem! We offer free shuttle transportation to and from multiple locations in Manhattan, Brooklyn, Hoboken and Jersey City .
This position is eligible for company sponsored benefits, including medical, dental and vision insurance, 401(k), paid leave, tuition reimbursement, and a variety of other discounts and perks. Learn more about the benefits offered by NBCUniversal by visiting the Benefits page of the Careers website.
Salary Range: $155,000 - $175,000

Jobcode: Reference SBJ-vee8mw-216-73-216-0-42 in your application.

Salary Details

Salary Range: $155,000 to $175,000 Per Year ($ USD)

Full Time Job

Sr Site Reliability Engineer

NBCUniversal

Job Description

Salary Details

Find More Jobs Like This

Similar Listings

Sr UX Researcher Golfnow

Manager, Product

Manager, Product

Quality Engineer - Sports Next