company_logo

Full Time Job

Technical Manager

Warner Media

Atlanta, GA 08-19-2021
 
  • Paid
  • Full Time
  • Mid (2-5 years) Experience
Job Description
The Job

WarnerMedia seeks a Technical Manager to lead a team of highly-skilled Site Reliability Engineers within our WarnerMedia Technology & Operations (WMTO) organization. The SRE team owns and manages the infrastructure stack for our unified video delivery platform, a core set of products and workflows that power video acquisition, encoding, delivery, and playback across our WarnerMedia brands.
As a hands-on SRE Manager, you will maintain and evolve our highly-available, highly-scalable video systems infrastructure using containers, cluster management, cloud services, and performance tools to keep our systems available in 24x7 environments. You will identify and solve opportunities to increase efficiency, eliminate downtime, optimize runtime costs, and maintain performance at scale. You will closely collaborate with our product engineering teams to drive proactive product reliability during feature development, while guiding them on best practices for application performance, cloud infrastructure needs, and continuous integration/delivery (CI/CD). Our tech stack includes AWS, Kubernetes, Docker, Terraform, Postgres, Mongo, Jenkins, and Elasticsearch.
The SRE Manager should be a strong technical leader to champion strategic site reliability engineering principles, including solid systems architecture, iterative improvements to limit time spent on operations, monitoring tooling across the tech stack, and proactive incident management. You will drive automation needs for day-to-day functions such as release deployments, rollbacks, failover recovery, and infrastructure as code for provisioning. You will also interact directly with senior technical leaders across our larger WMTO organization for IT security compliance activities and cloud governance practices.

The Daily
• Lead a team of talented engineers to run and optimize the tech stack running our Live Streaming and VOD services for reliability, scalability, latency, efficiency, and security
• Recruit and develop SRE team, building a culture of excellence in site reliability, systems performance, capacity planning, and operational efficiencies
• Manage large project initiatives for tech stack improvements, including AWS cloud needs, video encoder integrations, on-prem to cloud migrations, hardware deprecation, and software/security updates
• Identify strategic and tactical opportunities to improve service health, performance, reliability, and telemetry; Drive these initiatives within SRE team and across peer engineering teams
• Partner with multiple engineering teams to strategize and drive best-in-class SRE practices, participating early in solutions design to establish optimal infrastructure plans
• Track performance metrics (SLAs/SLOs/SLIs), application tracing, monitoring alerts, and logging across all services to gain insight into service availability, resource usage, and errors
• Innovate forward – Evaluate emerging technologies for potential adoption; Lead discovery efforts and proof-of-concepts (POCs) to assess new tech stack components
• Lead by example – Roll up your sleeves to triage critical incidents, debug systems issues, and assist with on-call support as needed

The Essentials
• Bachelor's degree in Computer Science or equivalent work experience
• 3+ years as a SRE or DevOps manager for enterprise-level applications
• 5+ years experience in cloud and container designs, architectures, and migrations
• 5+ years experience in AWS cloud technologies, with broad exposure to AWS suite of services including S3, EFS, RDS, ECS, EKS, ALB, Route 53, etc.
• 7+ years experience in software development lifecycle and application modernization
• Expertise in container-based, serverless, physical server, and VM architecture designs
• Strong knowledge of sophisticated application, database, network, and service-level integrations across distributed large-scale architectures
• Strong problem solving and troubleshooting skills to identify root causes, implement short-term remediation, and create solutions to prevent problem recurrence
• Expert knowledge of Unix/Linux system administration at scale
• Ability to code SRE tools using Node.js, Python, shell scripting, or other languages
• Strong experience with source control and CI/CD pipelines
• Ability to lead infrastructure solution design meetings and problem triage sessions
• Ability to work in a dynamic, fast-paced environment
• Clear and effective communicator with both technical and non-technical audiences
• Experience in full digital video stack is a plus – video encoding (CMAF/DASH/HLS), adaptive bit rate packaging, CDN delivery, DRM solutions, and AWS video cloud service (MediaLive, MediaConvert, MediaPackage), video playback

Jobcode: Reference SBJ-gq582m-3-144-212-145-42 in your application.