Reporting to the Manager of Site Reliability Engineering, this position is critical to the mission of the Automation & Reliability Engineering team as part of the Global Technology & Operations Group. The Monitoring Engineer is a self-starter willing to take the initiative. The core purpose of the role is to ensure that our applications, platforms, and infrastructure are effectively monitored for availability, performance, and functionality, and that alerts driven by our monitoring systems are accurate and actionable. To succeed in this role the post-holder needs to be creative clever, passionate and love building new solutions.
Based in the greater Washington, DC area, or Knoxville, TN, with remote opportunities, the postholder is part of a team supporting our broader Global Technology & Operations and Direct-to-Consumer Digital teams based around the world. As a function these teams oversee all aspects of the core IT Infrastructure, the 24/7 Command & Control Hub for our media distribution and IT support services, and our development teams building and supporting our direct-to-consumer products such as Discovery+. The position is key to ensuring organizational improvements, consistently improving and maintaining our availability and uptime, establishing effective automation and monitoring solutions to deliver successes and areas of opportunity.
The Automation & Reliability Engineering team partners with engineering and workforce technology teams to advocate sensible, scalable systems design as well as building the best tools to diagnose, resolve and prevent issues. The postholder is a practitioner and advocate of good monitoring practices and configuration management within GT&O and so should be a great communicator and enthusiastic champion of Technology Operations.
In managing and advocating for effective and complete monitoring solutions, the Monitoring Engineer services the needs of the broader GT&O organization, as well as internal customers across a myriad of technology and business teams. This person wants to see into the future, establishing predictive and proactive monitoring solutions to identify risks before services could be impacted.
- Design, roadmap, and administers tools used in discovering and monitoring Discovery's applications, services, platforms, and infrastructure
- Build monitoring systems that assist in infrastructure and application event detection and alert remediation
- Ensure all relevant infrastructure and services are properly covered within our monitoring and alerting systems in a manner consistent with our standards; collect the right metrics at the right frequency and ensure the data is readily available for effective alerting, reporting, and analysis
- Define business and operations success metrics, Establishes departmental process model for benchmarking, standardization, and process improvements.
- Collaborate with a cross-functional team of Dev, Ops, Engineers, and architects to understand complex application architectures to implement an effective top-down monitoring strategy of holistic service visibility
- Participate in strategy and future implementation discussions for the redesign and implementation of monitoring environments to modernize with latest technology trends
- Leveraging performance counters to diagnose and troubleshoot infrastructure problems.
- Create/maintain documentation for monitoring requirements, processes, and implementation.
- Assist in the deployment, organization, and management of standard operating procedures
- Perform other duties as needed
- Bachelor's degree in Information Technology, Computer Science, or closely related degree or equivalent work experience
- Minimum of 5 years of experience in system administration and/or engineering in an enterprise production environment
- Experience in the use of network management protocols (e.g. SNMP, WMI, Syslog, ICMP, NetFlow, etc.)
- In-depth experience installing, configuring, and maintaining monitoring tools with at least two of the following: Splunk, SolarWinds, New Relic, Datadog, and ServiceNow
- Experience managing a Splunk environment including forwarders, heavy forwarders, deployment servers, data ingestion, apps, indexes, clusters, and search queries
- Strong distributed systems and architecture knowledge and experience (networking, storage, operating systems)
- Hands on scripting/automation experience with one or more of the following: Python, PowerShell, BASH
- Working knowledge of ITIL required. Foundation certification is preferred but not required. Must be able to effectively communicate with owners of ITIL Disciplines (Incident, Problem, Change, Release, and Configuration) to provide effective IT support to the end-users.
- Excellent verbal, written, interpersonal communication and customer service skills
- Strong organizational and conceptual skills combined with proven critical thinking, analytic, problem solving, and decision-making abilities
- Ability to multi-task within related functions
- Strong presentation and communication skills, ability to interface with internal and external groups
- Able to demonstrate a high degree of flexibility, including flexibility in working hours to support employees and customers across multiple time zones.
- Experience of working for a Media Company/Broadcast is desirable but not essential
- Must have the legal right to work in US
Jobcode: Reference SBJ-rjq0q1-34-231-243-21-42 in your application.
Discovery, Inc. is the global leader in real life entertainment. We serve passionate fans with content that inspires, informs, and entertains, providing leadership across deeply loved and trusted brands, such as Discovery Channel, TLC, Animal Planet, HGTV, Food Network, and Travel Channel.