Job Description
As a member of the Service Platform Operations SRE team, you will carry the responsibility of keeping core user experiences on the platform available, resilient and high performing, while continually enabling our service teams to deliver new and exciting product and technical features. Our team strives to iteratively learn, improve and automate our processes every single day, which continually raises the bar for operational excellence within our organization. You will be empowered to drive and lead technical initiatives on our team, helping identify and proactivity drive improvements in both process and technology supporting millions of users.
Responsibilities:
As a Senior SRE your responsibilities will include hands-on application management and production support of core experience related services within an AWS cloud environment, ensuring availability, resiliency, scalability and high performance. You will work side by side with our service development teams to provision, automate and ensure the production readiness of all new services and features introduced.
Other responsibilities include...
• Integrate and automate the configuration and ongoing operations of AWS managed services.
• Identify areas for operational process improvement and automation. Drive the hands-on development of scripts and tools to automate these processes within our environment.
• Increase observability on our platform by implementing robust monitoring and alerting patterns across our services. Develop rich, informative dashboards / reports on our services that provide valuable insight and meaningful alerting to drive down the MTTD and MTTR on platform incidents.
• Collaborate and partner with other SRE teams that specialize in areas such as data services, data platform, and platform hosting to inspire changes and ensure optimal end-to-end system performance and resiliency across all back-end services within PlayStation.
• Iteratively drive performance and capacity validation analysis for our services. Utilize AWS patterns and technologies such as spot instances, dynamic auto-scaling and EKS to optimize resource usage and AWS spend.
• Review service flows and architecture to influence resiliency, availability and scalability for all services within our platform
• Provide rotational on-call support where you'll respond, detect, triage and resolve production incidents.
• Conduct, document and present root cause analysis documents to share incident insights and findings with our broader engineering organization.
Qualifications:
• BS degree in Computer Science, Engineering, or related technical area.
• 5+ years hands-on AWS experience – integrating, developing and managing applications.
• 5+ years of relevant work experience in a large scale and/or critical production, software environment
• 5+ years of hands-on software engineering or supporting/maintaining software systems experience (Java and/or c++ services)
• 3+ years of experience with building automation into daily operational processes through one or more programming languages (preferably Python or Go).
• Strong experience in configuring, tuning and automating operational responsibilities for AWS managed data services including RDS, DynamoDB and Elasticache
• Experience with application monitoring and log management tools (ie: DataDog, CloudWatch, Splunk)
• Experience with container technologies and orchestration (ie: Docker, Kubernetes, EKS, Fargate)
• Hands-on experience in triaging and tuning Java cloud applications with integration into AWS managed services
• Proven understanding of AWS networking systems and protocols (ie: ALB, R53, API-Gateway, TCP/IP, HTTP/HTTPS, DNS)
• Experience with developing or supporting Continuous Integration and Continuous Delivery pipelines (CI/CD)
#LI-GM1
Jobcode: Reference SBJ-rbn2qk-44-222-87-38-42 in your application.