examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 18 min

Implementing SLIs, SLOs, and SLAs

3,600 words · ≈ 18 min read ·

Master the core principles of Site Reliability Engineering (SRE) by defining SLIs, setting SLOs, and understanding SLAs in Google Cloud.

Do 20 practice questions → Free · No signup · PCA

Introduction to SRE Reliability Metrics

Site Reliability Engineering (SRE) is a discipline that applies software engineering mindsets to system operations. At the heart of SRE are three acronyms that define how we measure and promise reliability: SLI, SLO, and SLA.

For a Professional Cloud Architect, understanding these is critical for balancing the need for feature velocity (moving fast) with the need for stability (not breaking things).


Plain-Language Explanation: SLI, SLO, SLA

Analogy 1 — The Pizza Delivery Promise

  • SLI (Indicator): This is the stopwatch. It measures exactly how long it takes from the phone call to the pizza arriving at your door. (e.g., 28 minutes).
  • SLO (Objective): This is the internal goal the manager sets for the staff: "95% of our pizzas must arrive in under 30 minutes." If we hit this, the manager is happy.
  • SLA (Agreement): This is the external promise to the customer: "If your pizza takes longer than 30 minutes, it's free." If we break this, we lose money.

Analogy 2 — The Exam Grade

  • SLI: Your actual score on a practice test (e.g., 85%).
  • SLO: The grade you want to get into a good university (e.g., "I need to average 90% over the semester").
  • SLA: The scholarship requirement. If your average drops below 80%, you lose your funding (The consequence).

Analogy 3 — The Car's Reliability

  • SLI: The percentage of time your car successfully starts when you turn the key.
  • SLO: Your personal expectation: "My car should start 99.9% of the time so I'm not late for work."
  • SLA: The manufacturer's warranty. If the car fails to start because of a defect, the dealer pays for the repair.

The amount of unreliability (1 - SLO) that a service is allowed to have. It represents the space for taking risks, such as deploying new features.


Defining the Three Pillars

1. SLI (Service Level Indicator)

A quantitative measure of some aspect of the level of service provided.

  • Common SLIs: Availability (Success/Total), Latency (Response Time), Throughput (Requests/sec), Durability (Data persistence).
  • Format: Usually a ratio: (Good Events / Total Events) * 100%.

2. SLO (Service Level Objective)

A target value or range of values for a service level that is measured by an SLI.

  • Purpose: To define a "happy state" for the system.
  • Example: "99.9% of requests must have a latency of less than 200ms over a rolling 30-day window."

3. SLA (Service Level Agreement)

An explicit or implicit contract with your users that includes consequences for missing SLOs.

  • Consequences: Usually financial credits, service extensions, or legal penalties.
  • Note: SREs usually focus on SLOs, while Legal/Business teams focus on SLAs. The SLO is almost always stricter than the SLA (e.g., SLO is 99.9%, SLA is 99.5%).

The Error Budget: Balancing Risk and Speed

The Error Budget is the most powerful concept in SRE.

  • Formula: 100% - SLO% = Error Budget.
  • Logic: If your SLO is 99.9%, you have a 0.1% "budget" for failure.
  • Action:
    • If the budget is full, you can deploy new features rapidly and take risks.
    • If the budget is exhausted, you must freeze new deployments and focus 100% on reliability until the budget recovers.
::promoted

Architect's Insight: On the PCA exam, if a question asks how to handle a conflict between a "Product Team wanting to ship features" and an "Ops Team wanting stability," the answer is to establish an Error Budget. This provides a data-driven way to decide when to move fast and when to slow down. ::


Selecting the Right SLI: The Five SLI Families

Not every metric makes a good SLI. The Google SRE workbook groups user-facing SLIs into five families, and PCA scenarios often ask you to pick the right one for a given workload.

Availability SLI

The fraction of requests that succeed: successful_requests / total_valid_requests. In Cloud Monitoring's SLO UI this maps to a request-based SLI built from loadbalancing.googleapis.com/https/request_count filtered by response_code_class != "5xx". Use it for synchronous HTTPS APIs, GKE Ingress, or App Engine front ends.

Latency SLI

The fraction of requests faster than a threshold: e.g., requests_with_latency_lt_300ms / total_requests. Cloud Monitoring exposes loadbalancing.googleapis.com/https/total_latencies as a distribution metric — set the threshold at the 95th or 99th percentile of what users tolerate.

Throughput SLI

For batch and streaming pipelines (Dataflow, Pub/Sub, Dataproc), measure throughput as the rate of work completed against a floor: e.g., "Dataflow job processes ≥ 50K events/minute". Pub/Sub subscriptions use subscription/oldest_unacked_message_age as a freshness proxy.

Quality SLI

For graceful-degradation systems (recommendation engines, search), measure the fraction of requests served by the full-fidelity path versus a cheaper fallback. Useful for personalization services running on Vertex AI endpoints with cached fallbacks.

Correctness / Freshness SLI

For data pipelines and BigQuery scheduled queries, measure (rows_passing_validation / total_rows) or data_age < freshness_target. Crucial for ML feature stores in Vertex AI Feature Store where stale features silently degrade model quality.

PCA exam pattern: when a scenario says "users complain the search results are stale" or "the dashboard shows yesterday's numbers", the right SLI family is freshness / correctness, not availability. Availability would still report 100% while the system is broken from the user's perspective.


Good Events vs Bad Events: Defining the Ratio Precisely

Every SLI in Cloud Monitoring is ultimately good_events / valid_events. Getting this denominator wrong is the single most common SLO mistake.

What counts as "valid"

  • Exclude synthetic probes, internal health checks, and load tests. A 200 OK from an uptime check is not a real user event.
  • Exclude 4xx responses caused by client errors (400, 401, 403, 404) — these are not service failures. But include 429 (rate limit) if your rate limiter is mis-tuned, because that is your fault.
  • Include 5xx, timeouts, and unhandled exceptions surfaced via error_reporting.

Window types

Cloud Monitoring supports two SLO compliance window types:

  • Calendar-based: aligns to weeks, months, or quarters. Useful for SLAs that match billing cycles.
  • Rolling: typically 28 or 30 days. Better for engineering decisions because it does not reset on the first of the month and hide a bad week.

Compound SLIs

A "session success" SLI may require login + search + checkout to all succeed. Implement this as a derived metric in Cloud Monitoring or a log-based metric over Cloud Logging, never as the arithmetic product of three independent SLOs (which double-counts correlated failures).


Choosing SLO Targets by Service Tier

A single global "99.9% for everything" target is an anti-pattern. Tier services by user impact and pick targets accordingly.

Tier 0 — Revenue-critical synchronous APIs

Examples: checkout, payment authorization, login. Target 99.95% availability and 99% latency < 300ms over a 28-day rolling window. The error budget is roughly 21 minutes of downtime per month — enough for one rolling-restart deployment per quarter but not weekly.

Tier 1 — User-facing read APIs

Examples: product catalog, search, dashboards backed by BigQuery BI Engine. Target 99.9% availability. Error budget is ~43 minutes/month, leaving room for canary failures and dependency hiccups.

Tier 2 — Async / batch / internal

Examples: Cloud Composer DAGs, Dataflow nightly aggregations, Cloud Run admin tools. Target 99.5% job-success-rate over 7 days, plus a freshness SLO ("data available by 06:00 UTC"). Do not waste engineering effort chasing 99.99% here.

Why not 99.99% everywhere

99.99% means ~4.3 minutes/month of unavailability. At this level, your service's reliability is dominated by dependencies you do not control: GCE live migration, regional networking, third-party DNS. Pushing one tier higher than its dependencies allow burns engineering time without improving user experience.

Common PCA trap: A question describes a B2B internal admin tool used by 5 employees and asks for the SLO target. The "best practice" answer is not 99.99%. The right answer matches the cost of downtime — usually 99% or 99.5% is plenty, freeing error budget for the customer-facing tier.


The Error Budget Policy: Turning a Number into Action

An error budget without a written policy is just a metric. The policy converts budget state into team behavior, and PCA scenarios test whether you know what each state triggers.

Budget healthy (> 50% remaining)

  • Feature releases proceed on the normal cadence.
  • Risky migrations (region failover drills, schema changes) are scheduled here.
  • SREs spend ≥ 50% of time on project work, not toil.

Budget warning (10-50% remaining)

  • Slow the release train: canary for longer, require staged rollouts via Cloud Deploy.
  • Pause infrastructure experiments (e.g., new GKE node pool migrations).
  • Mandate a postmortem-action-item burndown sprint.

Budget exhausted (< 0% remaining)

  • Hard freeze on feature deploys. Only reliability fixes and security patches ship.
  • Page the SRE on-call and notify the product VP.
  • Postmortems become blocking — no new features until action items land.

Who owns the policy

The policy is signed by Engineering, SRE, and Product before the SLO goes live. Cloud Architects on PCA scenarios should propose this tripartite agreement when a question asks "how do you stop the dev team from arguing with ops every time there's an outage?"


Burn Rate Alerting: Multi-Window Multi-Burn

Alerting on "SLO is below target" is too late — by then you have already broken the promise. SRE best practice is multi-window multi-burn-rate alerting, natively supported by Cloud Monitoring SLO alerts.

The burn rate formula

burn_rate = (errors_in_window / window_length) / (1 - SLO). A burn rate of 1.0 means you are spending budget at exactly the rate that exhausts it over the SLO window. A burn rate of 14.4 means at this pace you will burn a 30-day budget in ~2 days.

Severity Burn rate Short window Long window Budget consumed before page
Page (fast) 14.4 5 min 1 hour 2%
Page (slow) 6 30 min 6 hours 5%
Ticket 1 2 hours 3 days 10%

Why two windows

The short window catches fast burns (a bad deploy spiking errors). The long window suppresses false positives from a single bad minute. Cloud Monitoring's SLO alerting UI configures both with one policy and routes to PagerDuty / Opsgenie via notification channels.

Cost discipline

Multi-burn alerting fires ~10x less often than naive threshold alerts, which is the entire point. On-call fatigue is itself a reliability risk.

Burn-rate cheat sheet for PCA: Page fast at burn-rate 14.4 over a 1-hour long window (consumes 2% of a 30-day budget). Page slow at burn-rate 6 over a 6-hour window (5%). File a ticket at burn-rate 1 over 3 days (10%). Memorise the trio 14.4 / 6 / 1.

Cloud Monitoring specifics: Define burn-rate alerts in the Services section under your SLO, not as standalone metric alerts. The SLO-aware alert automatically uses the same SLI definition and good/valid event filters, eliminating drift between what you measure and what pages you.


Service Monitoring on Google Cloud

The Services feature in Cloud Monitoring (formerly "Stackdriver Service Monitoring") is the PCA-blessed way to operationalize SLOs.

Auto-detected services

  • App Engine services and versions
  • Cloud Run services
  • GKE workloads with Istio / Anthos Service Mesh
  • Cloud Endpoints / API Gateway

For these, Cloud Monitoring proposes default SLIs based on Istio request metrics or HTTPS load balancer metrics — accept them as a starting point, then tune the threshold.

Custom services

For VMs, BigQuery scheduled queries, or Dataflow jobs, create a custom service and attach SLIs sourced from:

  • A standard metric (e.g., dataflow.googleapis.com/job/is_failed)
  • A log-based metric defined in Cloud Logging
  • A user-defined Monitoring metric written by OpenTelemetry exporters

SLO API and Terraform

Define SLOs as code via the google_monitoring_slo Terraform resource. Storing SLOs in version control is the only way to prevent silent drift when engineers tweak thresholds in the console to dodge an alert.

Error budget burn integration

The Services dashboard shows current SLO compliance, error budget remaining, and burn rate trend on a single pane. Link it from your runbook so on-call engineers see budget state before deciding to roll back.


Toil Reduction and the 50% Rule

The SRE book defines toil as manual, repetitive, automatable, tactical work that scales linearly with service growth. Examples on GCP: manually rotating GKE node pools, hand-running BigQuery backfills, ticking through Cloud Logging entries to ack alerts.

The 50% cap

SREs cap operational/toil work at 50% of their time. The remaining 50% is engineering: automation, capacity planning, SLO refinement. When toil exceeds 50%, the team hands services back to the dev team until automation catches up.

GCP automation levers

  • Cloud Workflows / Cloud Composer to orchestrate runbook steps that previously needed humans.
  • Cloud Functions triggered by Eventarc to auto-remediate (e.g., scale up a node pool on kube_pod_status_reason="OutOfcpu").
  • Terraform + Config Sync to eliminate "snowflake" cluster configs that generate toil during incidents.
  • Cloud Deploy with automated canary analysis instead of human-gated promotions.

Toil as an SLI input

Track toil hours per engineer per week as its own metric in BigQuery. When the trend rises, it predicts SLO regressions 4-6 weeks out, because tired humans push more bad changes.


Blameless Postmortems and the SLO Consumer Compact

When an SLO is missed, the response is a blameless postmortem, not a finger-pointing exercise. The PCA exam frames this as a process question: who attends, what's the output, who owns the follow-up?

Postmortem template (Google's open-source format)

  • Summary: 2-3 sentences, blameless tone.
  • Impact: requests affected, error budget consumed, revenue/users impacted.
  • Root causes: contributing factors, not "human error".
  • Trigger: the specific change or event that started the incident.
  • Resolution: what stopped the bleeding.
  • Action items: each with an owner, a Jira/Issue ID, and a due date.

The SLO consumer compact

When your service is a dependency of someone else's SLO, you sign a consumer compact: a written agreement that your SLO is stricter than theirs and that you will share burn rate dashboards. On GCP this often means publishing a Cloud Monitoring dashboard via IAM-shared workspace so downstream teams see your error budget in real time.

Observability budget

Logs, metrics, and traces cost money — Cloud Logging beyond 50 GiB/project/month is billable, and high-cardinality custom metrics in Cloud Monitoring add up fast. Set an observability budget (e.g., 5% of cloud spend) and track it like an error budget. Drop debug-level logs when over budget rather than letting Finance turn off log retention mid-incident.

Cap Cloud Logging cost by routing high-volume application logs through a Log Router sink to BigQuery (cheap) or Cloud Storage (cheapest) and excluding them from the default _Default log bucket. Keep alerting-grade logs in Cloud Logging; archive everything else.


FAQ — SLIs, SLOs, and SLAs

Q1. Who should define the SLOs?

SLOs should be a collaborative effort between Product Owners (who understand user needs) and SREs/Architects (who understand system capabilities).

Q2. Is "100% Availability" a good SLO?

No. 100% is never a realistic or desirable goal. It makes the system too expensive and prevents any change (since every change carries risk). As Google says, "100% is the wrong target for almost anything."

Q3. How do I measure SLOs in Google Cloud?

Use Cloud Monitoring. It has a dedicated "Services" section where you can define SLIs (based on metrics or logs) and set SLO targets. It then automatically tracks your Error Budget burn rate.

Q4. What is a "User-Centric" SLI?

A user-centric SLI measures what the user actually feels. Instead of measuring "Server CPU," measure "Home Page Load Time" or "Successful Checkout Rate."

Q5. What happens when an SLO is missed?

Missing an SLO triggers a policy change (e.g., stopping releases), not necessarily a middle-of-the-night page (unless the budget is burning so fast that you will miss the SLA).

Official sources

More PCA topics