Introduction to SRE (Site Reliability Engineering)
Site Reliability Engineering (SRE) is what happens when you ask a software engineer to design an operations function. For a Professional Cloud Architect, SRE is the framework used to balance the need for velocity (releasing features) with the need for stability (reliability).
The core tenet of SRE is that 100% reliability is the wrong target for basically everything.
白話文解釋(Plain English Explanation)
Analogy 1 — SLI/SLO/SLA as the Speedometer, Speed Limit, and Fine
Imagine driving on a highway. The SLI is your speedometer — it measures your actual speed (e.g., the percentage of HTTP 200 responses over the last 5 minutes). The SLO is the posted speed limit — the internal target you commit to (e.g., 99.9% success rate over 30 days). The SLA is the traffic fine — a contractual penalty (service credits) you pay customers if you exceed the limit. The speedometer never argues with you, the speed limit guides driving behavior, and only the fine has legal force. In Cloud Monitoring, you configure the SLI as a "Service Level Indicator" object, attach an SLO with a compliance period, and let the platform burn-rate-alert when you drive too fast.
Analogy 2 — Error Budget as a Monthly Data Plan
An error budget is like a mobile data plan that resets each month. If your SLO is 99.9% over a 30-day window, you receive ~43 minutes of "downtime data" to spend. Shipping a risky Cloud Run revision is like streaming 4K video — it burns budget fast. A stable, well-tested release is like reading text articles — barely a dent. When the plan is depleted, the carrier (the error budget policy) throttles you: no new feature launches until the next billing cycle. Cloud Monitoring SLO burn-rate alerts are the "you've used 80% of your data" warning text.
Analogy 3 — Toil Reduction as Replacing a Cashier with a Self-Checkout Kiosk
Toil is the human cashier scanning every item by hand — slow, repetitive, and scaling linearly with customer count. Engineering work is the self-checkout kiosk: high upfront cost (designing the UX, integrating payment), but afterwards one staffer supervises ten lanes. On GCP, the cashier is the on-call engineer running gcloud compute instances reset at 3 AM. The kiosk is a Cloud Function triggered by a Pub/Sub topic from a Cloud Monitoring alert, which auto-restarts the unhealthy MIG. Same outcome, no human required, and the engineer sleeps through the night.
The Pillars of SRE: SLI, SLO, and SLA
The quantified amount of unreliability a service is permitted to incur over a measurement window, calculated as 100% - SLO. For a 99.9% SLO over 28 days, the error budget is ~40 minutes of unavailability. When the budget is exhausted, the Error Budget Policy mandates a feature freeze until the budget recovers — making reliability and feature velocity a shared, data-driven trade-off.
Understanding the difference between these three terms is critical for the PCA exam.
- SLI (Service Level Indicator): A quantitative measure of some aspect of the level of service that is provided.
- Example: "The percentage of successful HTTP requests."
- SLO (Service Level Objective): A target value or range of values for a service level that is measured by an SLI.
- Example: "99.9% of HTTP requests must be successful over a rolling 30-day window."
- SLA (Service Level Agreement): A legal contract with users that defines the consequences (usually financial) if the SLO is not met. SREs care about SLOs; lawyers care about SLAs.
Always set your internal SLO stricter than the external SLA — typically by one nine. If you promise customers 99.9% (SLA) but target 99.95% (SLO) in Cloud Monitoring, you have a safety margin before incurring SLA credits. Google's own products (e.g., Cloud Storage Multi-Region) publish 99.95% SLAs while internally targeting higher SLOs.
SLI/SLO/SLA Hierarchy and Selection
The three-layer hierarchy isn't just terminology — it dictates which Google Cloud surface you use and who gets paged.
Choosing the right SLI category
The SRE Workbook defines four canonical SLI categories. Pick from these, not from system metrics like CPU:
- Availability: fraction of valid HTTP responses (2xx + 3xx, excluding 4xx client errors).
- Latency: fraction of requests served faster than a threshold (e.g., p95 < 300 ms).
- Quality: fraction of responses served without degraded mode (e.g., full search vs. cached fallback).
- Freshness: age of the data served (critical for ETL pipelines into BigQuery).
In Cloud Monitoring, each SLI is a ServiceLevelIndicator resource with either a request_based or windows_based definition. Use request-based for stateless RPC services; use windows-based for stateful systems like databases.
Mapping SLO to compliance period
Cloud Monitoring supports rolling windows (1, 7, 28 days) and calendar windows (week, month, quarter). For a customer-facing API, a 28-day rolling window is standard because it smooths over weekly patterns and gives leadership a stable monthly review. For batch pipelines, use a calendar month because billing and reporting align to that boundary.
Linking SLA back to the SLO
The SLA is a business document, not a monitoring config. But your SLA credits should be triggered by Cloud Monitoring SLO breaches. Use a Cloud Function subscribed to the SLO alert policy that writes a record to a "credits owed" BigQuery table, which Finance then reconciles monthly.
The Error Budget: Balancing Innovation and Risk
The Error Budget is the most powerful concept in SRE. It is calculated as: 100% - SLO.
- If your SLO is 99.9%, your Error Budget is 0.1%.
- If you have budget remaining: You can release features quickly, even if they carry some risk.
- If you have exhausted your budget: Feature releases stop. The team must focus on reliability, automation, and fixing the root cause of the failures.
Architect's Insight: The Error Budget creates a shared incentive between Developers (who want to move fast) and SREs (who want stability). If the budget is spent, everyone is responsible for restoring reliability. ::
Error Budget Policy in Practice
A budget without a policy is just a metric. The Error Budget Policy is a signed document that codifies what happens when the budget burns.
A worked example for a Cloud Run API
SLO: 99.9% availability over 28 days (≈ 40 min budget)
Burn rates:
- 14.4× burn for 1 hour → page on-call (fast burn alert)
- 6× burn for 6 hours → ticket the dev team (slow burn alert)
- 1× burn for 3 days → review at weekly SLO meeting
Configure these in Cloud Monitoring as multi-window, multi-burn-rate alert policies — Google's recommended pattern. A single 14.4× burn rate over an hour will exhaust the entire 28-day budget if sustained, justifying an immediate page.
Policy escalation tiers
- Budget healthy (>50% remaining): Normal feature velocity. Canary deployments at 5%/25%/100%.
- Budget squeezed (10–50% remaining): Reduce canary blast radius; require staff-engineer signoff for risky changes; freeze chaos engineering exercises.
- Budget exhausted (0% remaining): Feature freeze. All engineering capacity diverted to reliability work (post-mortem action items, removing toil, fixing flaky tests). Only security patches and explicit "fix-forward" reliability changes deploy.
A common failure mode: teams silently raise the SLO target when the budget is exhausted to "unfreeze" releases. This is gaming the system. The Error Budget Policy must require leadership (Director+) approval before any SLO change, and the change should be backed by user research showing customers tolerate a lower bar.
Reducing Toil
Toil is work that is:
- Manual
- Repetitive
- Automatable
- Tactical (no long-term value)
- Grows linearly with service size
SRE teams aim to limit toil to 50% of their time. The other 50% must be spent on Project Work (engineering) that improves the system.
Toil Reduction Quantification
You cannot eliminate what you cannot measure. SRE teams track toil quarterly and treat reduction as a first-class engineering goal.
Measuring toil
Each on-call engineer keeps a toil log — typically a Google Sheet or a Jira "Toil" label. Every interrupt is tagged with:
- Duration (minutes)
- Category (manual restart, customer ticket triage, capacity bump, certificate rotation, etc.)
- Whether it was paged or self-discovered
Roll this up into a Looker Studio dashboard sourced from BigQuery. The goal is a toil percentage per engineer per quarter. If any individual exceeds 50%, that's a staffing/automation signal — not an "ask them to work harder" signal.
Automation ROI calculation
For each candidate automation, compute:
ROI = (hours_saved_per_year × loaded_hourly_cost) - engineering_cost_to_build
Payback period = engineering_cost / annual_hours_saved
A 40-hour automation that saves 2 hours/week pays back in ~5 months. Anything with a payback under 12 months is a no-brainer. Above 24 months requires a strategic justification (e.g., the toil is also burning SLO budget).
Top toil candidates on GCP
- Certificate rotation: replace with Google-managed SSL certs on the global load balancer.
- VM patching: move to Cloud Run / GKE Autopilot where Google manages the host.
- Manual scaling: use Managed Instance Group autoscalers or Cloud Run min/max instances.
- Log triage: replace with Cloud Logging log-based metrics + alert policies routing to the right team.
Incident Management and Blameless Post-mortems
When things go wrong, SREs follow a structured process.
1. Incident Response
- Identify: Detect the failure (usually via SLO-based alerting).
- Triage: Determine the severity and impact.
- Mitigate: Restore service as quickly as possible (e.g., rollback, restart, failover). DO NOT try to find the root cause during mitigation.
2. Blameless Post-mortems
A document written after an incident to understand why it happened and how to prevent it.
- Blameless: Focus on system failures, not human errors. If a human made a mistake, the system should have had a guardrail to prevent it.
- Actionable: Every post-mortem must result in "Action Items" to fix the underlying issues.
Runbook Automation via Cloud Functions
Static runbook wikis age poorly — engineers paste outdated commands at 3 AM. Programmable runbooks invert this: the runbook is code, executed by a Cloud Function.
Pattern: alert-triggered remediation
Cloud Monitoring alert policy
│ (notification channel = Pub/Sub topic)
▼
Pub/Sub topic: "ops-remediation"
│
▼
Cloud Function (2nd gen, Python)
│
├── Verify alert is still firing (idempotency)
├── Take action (restart MIG, drain pod, flush cache)
├── Post to Slack #incidents with what was done
└── Update incident in Firestore
The Cloud Function uses a service account with least-privilege IAM — only the specific roles needed (e.g., roles/compute.instanceAdmin.v1 scoped to one MIG). Every action emits an Audit Log entry, so post-mortems can reconstruct exactly what the bot did.
When NOT to automate
Auto-remediation is dangerous when:
- The failure mode is unknown (the bot might amplify it — e.g., a "restart" loop on a corrupt persistent disk).
- The action is irreversible (deleting data, terminating a Spanner instance).
- The blast radius is global (don't auto-failover a multi-region service without a human).
Start with suggest-mode runbooks — the Cloud Function posts the remediation command to Slack with a "click to execute" button instead of running it directly. After 3 months of data showing zero false positives, promote it to fully automated. This builds team trust and surfaces edge cases safely.
Capacity Planning and Load Testing
Reliability fails most often not from code bugs but from running out of capacity headroom. SRE capacity planning is a quarterly cycle, not a one-off.
The capacity model
For each service, maintain a spreadsheet (or Looker Studio dashboard) with:
- Demand forecast: projected QPS / GB / connections per region for the next 4 quarters, driven by product launches and seasonality.
- Per-unit capacity: how much load one GKE pod or one Cloud Run instance can sustain at SLO (measured via load test, not guessed).
- Headroom target: typically 50% — i.e., today you run at 50% utilization so you survive a region failover that doubles regional load.
Load testing on GCP
- Locust or k6 on GKE Autopilot for distributed load generation (cheap, scales to millions of RPS).
- Cloud Load Balancing synthetic traffic via Cloud Monitoring uptime checks (lightweight baseline).
- For BigQuery / Spanner: use the performance overview dashboards to identify hotspot keys before they melt down in prod.
Stress testing failover
At least once a quarter, run a regional failover drill: drain us-central1, verify us-east1 absorbs the load within SLO, then drain back. This validates both your capacity model and your runbook. Use Cloud Deploy or a Terraform workflow to make the drain reversible in minutes.
SRE Practices on Google Cloud
GCP provides tools to implement SRE:
- Cloud Monitoring: Define and track SLIs and SLOs.
- Cloud Logging: Analyze logs for root cause analysis.
- Cloud Trace / Profiler: Identify performance bottlenecks.
- Error Reporting: Automatically group and track application crashes.
Regional vs Global Service Tiers
Not every service needs to be global. Tier your services to align reliability investment with business value.
Tier 0 — Global, multi-region active-active
- Examples: payment processing, authentication, the marketing homepage.
- GCP building blocks: Spanner multi-region (
nam-eur-asia1), Cloud Storage multi-region buckets, global external HTTPS load balancer with backends in 3+ regions. - SLO target: 99.99%+ availability, p95 latency budget split across regions.
- Failover RTO: seconds (DNS or Anycast handles it).
Tier 1 — Regional with cross-region failover
- Examples: customer dashboard, internal admin tools.
- GCP building blocks: regional GKE clusters in two regions, Cloud SQL with cross-region read replica + promotion, Cloud Storage dual-region buckets.
- SLO target: 99.9% availability. Failover RTO: 5–15 minutes (manual promotion + DNS update).
Tier 2 — Single region, multi-zone
- Examples: internal reporting, batch jobs, dev/staging environments.
- GCP building blocks: regional MIG across 3 zones, regional Cloud SQL, multi-zonal GKE.
- SLO target: 99.5% — accept regional outage as a recoverable event, restored from backup within RTO.
Global LB + regional backends ≠ global service. A global HTTPS load balancer routes traffic to the nearest healthy backend, but if your only backend is in us-central1, a regional outage takes you down. True global service requires backends in at least three regions across two continents for active-active.
Graceful Degradation Patterns
When a dependency fails, the worst outcome is total failure. Graceful degradation keeps the core experience alive while shedding non-essential features.
Feature-flag controlled degradation
- Wrap every non-essential feature (recommendations, personalization, fancy charts) in a feature flag managed by Firebase Remote Config or LaunchDarkly.
- In an incident, the on-call flips the flag to "off". Cloud Run / GKE pods pick up the change within seconds.
- The user sees a simpler page instead of a 500.
Read-only mode
For e-commerce and SaaS, implement a read-only mode activated when the primary database is degraded:
- Disable writes (cart additions, settings changes) via an envoy-level flag.
- Serve cached product catalog from Memorystore Redis with extended TTL.
- Display a clear banner: "We're experiencing issues — purchases will resume shortly."
Stale-while-revalidate caching
Configure Cloud CDN with stale-while-revalidate and stale-if-error directives. When the origin (Cloud Run) is down, the CDN serves the last-cached response for up to N seconds, buying you time to fix the origin without user impact.
Static fallback page
For total origin failure, route traffic to a Cloud Storage-hosted static page via a load balancer backend bucket. Users see "We'll be back soon" instead of a connection-refused error.
Circuit Breaker (Anthos Service Mesh)
A circuit breaker stops cascading failures by failing fast when a downstream dependency is unhealthy.
How Anthos Service Mesh implements it
Anthos Service Mesh (ASM) ships with Envoy as the sidecar proxy. Configure circuit breaking via a Kubernetes DestinationRule:
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: payments-api
spec:
host: payments.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 10
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 60s
maxEjectionPercent: 50
What this buys you
maxConnections: caps concurrent TCP connections so one runaway client cannot exhaust pool capacity for everyone.outlierDetection: ejects a pod from the load balancing pool after 5 consecutive 5xx errors, for 60 seconds. This isolates a sick replica without taking down the service.maxEjectionPercent: 50: ensures you never eject more than half the pool — preventing the breaker itself from causing an outage.
Observability hook
ASM exports circuit breaker metrics (envoy_cluster_outlier_detection_ejections_active) to Cloud Monitoring. Build a dashboard showing ejection events alongside SLO burn rate to correlate breakers tripping with user-visible failures.
Retry, Timeout, and Backoff Configuration
Misconfigured retries are the #1 cause of cascading failure in microservices. Three rules, in order of importance:
Rule 1 — Always set a timeout
A request without a timeout will eventually block forever, exhausting your thread/connection pool. On GCP:
- Cloud Run: request timeout default 5 min, max 60 min. Set it to the minimum your endpoint actually needs (often 10–30 s for an API).
- gRPC clients: always use
context.WithTimeout(ctx, 2*time.Second). Never usecontext.Background()for an RPC. - HTTPS LB → backend: configure
backend timeouton the URL map (default 30s).
Rule 2 — Retry with exponential backoff and jitter
attempt 1: immediate
attempt 2: wait 1s ± random(0–500ms)
attempt 3: wait 2s ± random(0–1s)
attempt 4: wait 4s ± random(0–2s)
max attempts: 3 (not 10!)
Jitter prevents the thundering herd when many clients retry simultaneously after a brief outage. Google's google.api.core.retry library implements this correctly out of the box.
Rule 3 — Only retry idempotent operations
A retried POST /payment can charge the customer twice. Use idempotency keys (a UUID the client generates and the server deduplicates) for any write operation. Reads (GET) and explicit idempotent writes (PUT) are safe to retry.
Retry budgets are non-negotiable. Even with backoff and jitter, unbounded retries can amplify a small outage 5–10×. Configure a per-call retry budget (e.g., maxRetries: 3) AND a circuit-breaker-style global retry rate limit at the Envoy/ASM layer (e.g., max 10% of requests can be retries). Without the global cap, a downstream brownout triggers retry storms that prevent recovery.
On-Call Training and Rotation
Reliability is a team capability, not an individual heroics story. A mature on-call program protects both the service and the engineers.
Shadow on-call rotation
New engineers spend their first month as shadow on-call — paired with the primary, observing pages, but not authorized to take action. This builds familiarity with:
- Cloud Monitoring dashboards and alert syntax
- The team's runbook conventions
- How to declare an incident and run an incident channel
Wheel of Misfortune (tabletop drills)
Monthly, the team runs a 1-hour Wheel of Misfortune session. An experienced engineer picks a real past incident (or invents a plausible one) and the trainee narrates how they would respond — what dashboards to open, which gcloud command to run, who to escalate to. The "Game Master" reveals consequences. Cheap, high signal, zero production risk.
Disaster Recovery game days
Quarterly, run a real game day: inject a real failure (kill a GKE node pool, throttle a Cloud SQL replica via stress test) in a staging environment that mirrors prod. The on-call rotation responds as if it were prod. Capture timing data: how long to detect, mitigate, document.
On-call health metrics
Track and act on:
- Pages per shift: > 2 per night = unsustainable, automate or alert-tune.
- Sleep disruption rate: pages between 22:00–06:00.
- Voluntary retention: if engineers ask to leave the rotation, the rotation is broken — not the engineers.
Compensate fairly: many companies pay an on-call stipend (cash or equivalent comp time) for nights and weekends.
FAQ — Reliability Engineering Processes
Q1. Why not aim for 100% reliability?
Because the cost of going from 99.99% to 100% is astronomical, and your users won't notice the difference if their own ISP or device has 99% reliability. That "extra" reliability is better spent on feature velocity.
Q2. What is the difference between a "User-facing" SLI and a "System-level" SLI?
User-facing SLIs measure experience (e.g., request latency). System-level SLIs measure utilization (e.g., CPU usage). SREs prioritize User-facing SLIs for SLOs because users don't care about your CPU; they care if the page loads.
Q3. How do I start SRE in a traditional "Siloed" organization?
Start by defining SLOs for one service. Get buy-in on the concept of an Error Budget. Use the budget to justify spending time on automation rather than just "putting out fires."
Q4. Can I have an SLO without an SLI?
No. You cannot set an objective (SLO) if you have no way to measure (SLI) if you are meeting it.
Q5. What is "Golden Signals" in monitoring?
The four signals you should always monitor for any service: Latency, Traffic, Errors, and Saturation.
Final Architect Tip
On the PCA exam, if a question asks how to handle a team that is "Releasing bugs too often," the answer is "Implement an Error Budget and freeze releases when it is exhausted." If a question asks how to improve a manual process, the answer is "Reduce Toil through automation." Always remember: SRE is about Data-driven decisions regarding risk.