Job Description
About Coral:
This is a dangerous time to be a journalist on the internet. Online comments are often filled with rumors, insults and threats, pushing away readers and reducing community engagement. It doesn't have to be this way.
The Coral team at Vox Media believes that healthy online conversation can exist, given the right systems and tools – and that a strong democracy depends on it. The Coral community platform now supports journalists on more than 180 news sites, helping them engage with their communities, share knowledge, empower discussions, and reduce the impact of trolls. And it doesn't share or sell anyone's data while doing it.
Coral users include The Washington Post, the Wall Street Journal, The Financial Times, New York Magazine, and the LA Times.
About the role:
Under general supervision of Coral's SRE Engineering Manager, the Senior Site Reliability Engineer is responsible for the scaling, performance, availability and security of Coral's hosted client platform, websites, applications and services. The Senior SRE is also responsible for managing the tools and infrastructure that support the above. They will have a primary role in the leadership and execution of infrastructure initiatives from conception to production.
Our stack:
• Kubernetes, GKE, Google Cloud, Terraform, Docker
• MongoDB Atlas, Redis
• Nodejs, Go, Python, GraphQL
• Our open source codebase: https://github.com/coralproject/talk
What you'll bring:
• Familiarity with our stack:
• Kubernetes, GKE, Google Cloud, Terraform, Docker
• MongoDB Atlas, Redis
• Nodejs, Go, Python, GraphQL
• Ability to right-size and capacity plan for high-availability, high-traffic SaaS infrastructure
• Experience owning, managing, scaling, optimizing and monitoring high volume Kubernetes and cloud SaaS infrastructure
• Experience managing and utilizing GitOps and DevOps workflows
• Experience managing MongoDB, including familiarity with data exports, imports and ETL
What you'll do:
• Monitor and improve service stability and performance of Coral's hosted platform, website, applications and services
• Implement and automate tools and processes to improve reliability and efficiency of Coral's hosted platform, websites, applications and services.
• Participate in on call rotation, respond to service interruptions and stability and performance alerts
• Develop custom tools or replace existing tools when necessary to facilitate or improve monitoring, automation, performance and stability
• Assist with the development and implementation of contingency and disaster recovery plans
• Assist in the development of capacity and budget planning and forecasts
• Build out customer facing hosted infrastructure to ensure reliability, availability, efficiency and cost-effectiveness of technical requirements
• Configure and operate Google Cloud, GKE, Kubernetes, and other cloud tools and services
• Utilize and develop GitOps workflows to update and maintain Kubernetes deployments in GKE
• Utilize Terraform to declare, provision, and maintain GC resources
• Enhance, monitor and troubleshoot storage and backup systems to ensure reliability, performance and durability of data
• Manage and assist customers through their integration process
• Troubleshoot customer issues and update or create documentation where necessary to correctly address questions and concerns
• Investigate and reproduce customer and internally reported bugs and issues
• Reproduce and document steps that lead to unexpected behavior, and recommend fixes to dev team where appropriate
• Evaluate existing software, applications and systems on a regular basis to ensure that critical security and stability patches or upgrades are applied
Are you passionate about this opportunity, but worried that you don't have 100% of the experience we're looking for? We still want to hear from you!
Jobcode: Reference SBJ-d28jk5-216-73-216-180-42 in your application.