Site Reliability Engineer

Research & Development

∙

Remote

∙

July 3, 2025

Our client is a leading WebOps platform that powers the open web by hosting high-performance websites in the cloud for global organizations such as Stitch Fix, Okta, Home Depot, Pernod Ricard, and The Barack Obama Foundation. Every day, thousands of developers and marketers use the platform to build, iterate, and scale websites that reach billions of users worldwide.

This SaaS-based solution helps web and digital teams of all sizes improve performance through powerful tools for site management, governance, security, and collaboration. With built-in support for developing, testing, deploying, and running websites — all with best-in-class speed, scalability, and uptime — the platform enables teams to succeed in a fast-paced digital environment.

With over 35% of the web powered by open-source technologies and a $200B+ total addressable market, the company is scaling rapidly and expanding its world-class team. Headquartered in San Francisco, it is the trusted solution for managing high-value WordPress and Drupal websites. Their greatest strength lies in the creativity, passion, and collaboration of their people.

Position overview:

We are looking for a Site Reliability Engineer (SRE) to join their engineering team. The SRE will help scale and support a platform that powers hundreds of thousands of websites, runs millions of containers, and serves billions of page views each month.

The role involves maintaining and evolving a Kubernetes-based, cloud-native infrastructure, custom CI/CD pipelines, distributed file systems, and internal tooling designed to manage containers at scale. The company is an active contributor to open-source communities such as WordPress, Drupal, Fedora, Chef, systemd, cURL, Kubernetes, Terraform, and Sensu.

Responsibilities

Architect and implement global-scale systems using cutting-edge tools on Google Cloud Platform;
Improve the reliability and scalability of the Pantheon platform using technologies such as Kubernetes, Prometheus, Go, and Terraform;
Collaborate with engineering teams to help define and achieve Service Level Objectives (SLOs);
Maintain and enhance infrastructure components, including observability, monitoring, and Kubernetes management;
Drive continuous improvements in engineering practices and standards for testing, deployment, and development workflows;
Participate in an on-call rotation to support platform stability and performance.

Technical / Hard Skills:

Site Reliability Engineering (SRE) practices;
Kubernetes – orchestration and infrastructure management;
Prometheus – monitoring and alerting;
Google Cloud Platform (GCP) – cloud infrastructure;
Go (Golang) – programming language;
Python, Ruby, Bash – scripting and automation;
Terraform – infrastructure as code (IaC);
CI/CD pipelines – design and maintenance (GH Actions, CircleCI);
Monitoring & Metrics Systems – design and implementation;
Observability Engineering – instrumentation, logging, tracing;
Distributed Systems – design and maintenance;
Multi-tenant Architecture – platform and resource isolation;
Containers & Container Management at Scale – Docker and orchestration;
Service Level Objectives (SLOs) – definition and enforcement;
Automation of Infrastructure Tasks;
Version Control Systems – Git;
Incident Management / On-call Operations;
Security & Governance in SaaS Environments;
Open-source Contributions (optional) – familiarity with communities like WordPress, Drupal, Chef, systemd, etc;
Linux Systems – administration and troubleshooting.

Requirements

Proven experience working with high-traffic, large-scale platforms in production environments;
Deep interest in monitoring, metrics, and SRE principles such as SLOs and error budgets;
Strong preference for automation over manual processes ("toil");
Proficiency in one or more programming languages such as Go, Python, Ruby (optional);
Excellent English communication skills, with the ability to convey complex ideas clearly and collaborate effectively across teams;
Team-oriented mindset and pride in contributing to shared success.

Bonus points for

We offer excellent benefits, including but not limited to

People-oriented management without bureaucracy;
Competitive compensation;
Flexible schedule;
20 working days of annual paid vacation;
Paid sick leaves;
Friendly and engaging professional team;
Opportunities for self-realization, career, and professional growth;
Accounting and legal support.

Site Reliability Engineer

Responsibilities

Requirements

Bonus points for

We offer excellent benefits, including but not limited to

Application form

Team needs

Careers

Agile Fuel HQ

Site Reliability Engineer

Responsibilities

Requirements

Bonus points for

We offer excellent benefits, including but not limited to

Application form

We use cookies

Team needs

Careers

Agile Fuel HQ