company_logo

Full Time Job

Staff Site Reliability Engineer

Sony Interactive Entertainment

San Diego, CA 12-23-2021
 
  • Paid
  • Full Time
  • Executive (10+ years) Experience
Job Description
Staff Site Reliability Engineer – San Mateo

As a member of the operations SRE team within the platform technology group, you will carry the responsibility of keeping key user experiences on the platform available, resilient and high performing, while continually enabling our service teams to deliver new and exciting products and technical features. Our team strives to iteratively learn, improve and automate our processes every single day, which continually sets the standard for operational excellence within our organization. You will be empowered to drive and lead technical initiatives, helping identify and proactively drive improvements in both process and technology supporting millions of users.

Responsibilities:
• Application operations and production support of internal and public facing services within an AWS cloud environment, ensuring availability, resiliency, scalability and performance.
• Provision, automate and ensure the production readiness of all new services and features introduced.
• Identify areas for operational process improvement and automation. Drive the hands-on development of scripts and tools to automate these processes within our environment.
• Increase observability on our platform by implementing robust monitoring and alerting patterns across our services. Develop rich, informative dashboards / reports on our services that provide valuable insight and meaningful alerting to drive down the MTTD and MTTR on platform incidents.
• Collaborate and partner with other SRE teams that specialize in areas such as data services, CICD, and platform hosting to inspire changes and ensure optimal end-to-end system performance and resiliency across all back-end services within PlayStation.
• Iteratively drive performance and capacity validation analysis for our services. Apply AWS patterns and technologies such as spot instances, dynamic auto-scaling and EKS to optimize resource usage and AWS spend.
• Conduct, document and present root cause analysis documents to share incident insights and findings with our broader engineering organization.
• Provide rotational on-call support where you'll respond, detect, triage and resolve production incidents.
• Coaching and mentoring other team members.
• Work closely with SRE teams and leadership to define critical metrics, processes and drive continuous improvement
• Influence the architecture and implementation of solutions across teams and organizations.

Key Qualifications:
• Equally adept at software development and systems engineering/operations
• PASSIONATE(!) desire to automate and improve everything including process improvements, standardizing tools and technologies
• Excellent troubleshooting skills that span user experience, system, infrastructure, and network (TCP/IP). Ability to zoom in from user error to JVM garbage collection problem to packet loss on the network.
• Drive platform-wide solutions and provide technical leadership during their implementation
• Represent the operational reliability, availability, scalability of solutions in the wider organization.
• Customer and peer relationship focused with strong interpersonal and communication skills

Required Skills:
• Fluency with running distributed services at scale
• In depth understanding of Unix/Linux systems internals and networking
• Source code (GitHub) and configuration management tools (Ansible, Chef, etc.)
• Software development experience in one or more of following: Python, Go or Java
• Building and deploying Infrastructure as Code: CloudFormation/Terraform
• Building continuous integration and continuous delivery (CICD) pipelines in Jenkins, Spinnaker, or similar
• Operating and running Java services/APIs in AWS cloud infrastructure
• AWS systems and network protocols (ie: ALB, R53, API-Gateway, TCP/IP, HTTP/HTTPS, DNS)
• Configuring, tuning and automating AWS services including Lambda, RDS, DynamoDB and Elasticache.
• Container technologies and orchestration (ie: Docker, Kubernetes, EKS, Fargate)
• Application monitoring tools: DataDog, CloudWatch, Splunk, Grafana
• Data Reporting & Analytics: SQL, MySQL,
Oracle, or Big Data
• Operating and supporting large scale and/or critical customer-facing production services or applications

Experience:
• BS degree in Computer Science, Software Engineering, or related technical area
• 10+ years professional experience
• 5+ years AWS Cloud - deploying, tuning and operating Java/API services.
• 5+ years operating and supporting services in production environment at scale

#LI-TP1

Jobcode: Reference SBJ-gpnwp0-18-117-196-184-42 in your application.

Company Profile
Sony Interactive Entertainment

Recognized as a global leader in interactive and digital entertainment, Sony Interactive Entertainment (SIE) is responsible for the PlayStation® brand and family of products and services.