Site Reliability Engineering team members work with infrastructure & product teams to run operations and help improve development pipelines and infrastructure. The SRE has a highly skilled combination of engineering and operations skills and is focused on automating and improving operations. Their job is to guarantee system reliability, performance, and supportability with a strong engineering emphasis on building autonomous solutions that deliver value to end-users early, often, & fast. They are central to the reputation and trustworthiness of our services and act as an advocate for engineering best practices. WM's Site Reliability Engineering members are responsible for the management & governance of all infrastructure. You will be involved from the initial design phase of figuring out the right network topology for our clients to the final stages of deploying and setting up monitoring, alerting, patching, and logging. The role requires good problem-solving skills and ability to work across several teams and partner groups. We're looking for candidates who love to learn and can adapt quickly.
Responsible for monitoring and alerting strategy of all production and client facing infrastructure.
Partner with Operations leads to define and execute on the automation strategy for the proactive & reactive issue management, performance & reliability data.
Ensure that WM security and methodology standards and procedures are adopted and implemented.
Design, write and deliver software and infrastructure to improve the reliability, scalability, latency, and efficiency of our services.
Bring new technology thinking and lead developers to embrace new tools and innovative solutions to solve business needs.
Guide developers to design, develop, and test for reliability, and to automate tasks for applications.
Solve problems relating to mission critical services and create solutions to prevent problem recurrence, with the goal of automating response to all non-exceptional service conditions.
Influence and create new designs, architectures, standards, and methods for large-scale distributed systems.
Engage in service capacity planning, service integration and geo-expansion, software performance analysis and system tuning.
Candidate must be solutions oriented using rigorous logic and methods to solve difficult problems with effective solutions, probing all sources for answers.
Conduct on call duties.
Help troubleshoot incidents to address failure patterns, automate remediation through runbooks, and document application optimization.
Enable the Cloud, DevOps, Continuous Integration, and Test automation & Monitoring strategies.
Coordinate with the Applications team to satisfy all non-functional project requirements (security, performance, scalability, and resiliency).
Proven track record building/supporting/scaling a high transactional 24x7 environment.
Collaborate with the Platform Engineering team to build tools to help automate deployments.
Coordinate with relevant teams to build useful tools to support network operations (internal and external).
Passionate about SRE, DevOps, Automation, and infrastructure platforms. Must excel with agile and lean development practices and manage multiple priorities and multiple roles.
BS Degree in Computer Science, Engineering or Mathematics or equivalent experience.
Experience with build automation and CI/CD.
Expertise with monitoring or log aggregation tools.
Proficient in Python, JSON, XML, PowerShell, Bash, web services (SOAP, REST).
Experience working with Azure, AWS cloud services and edge services for all layers of core infrastructure.
Experience of infrastructure automation, such as Terraform, Ansible, or vRealize suite.
Experience with containerization technologies such as Docker/Kubernetes/Tanzu.
Experience profiling and debugging enterprise infrastructure.
5+ years of experience and outstanding coding skills in at least one of the object-oriented computer languages: C#, C++/C, or Java.
1+ years of experience in software architecture and design.
Experience MS System Center Conf Manager, Ops Manager, Mgmt Packs, NPM/APM, networking, hardware, logistics and operations or capacity planning.
Expertise in problem solving and analyzing global scale systems and critical production service environments.
Capable of technical deep dives into networking, service design, operating systems, and storage.
Verbally and cognitively agile enough to engage in strategy discussions with leadership team members.
A passion for building and participating in highly effective teams and development processes.
Strong debugging, testing / validation, and analytics/SQL(AO), skills, IIS, Apache,
Experience working with Agile methodologies (Scrum) and cross-functional teams.
The Nice to have's:
Experience in media, news, and/or entertainment industry.
Active participant in industry events (publish articles, present at conferences.)
Certifications in MCSE, RHCA, AWS, Azure, VMware, Network Essentials, or other relevant certs.
Jobcode: Reference SBJ-d5pv4z-54-227-97-219-42 in your application.