DevOps and Site Reliability Engineering (SRE)

Q: Q: Which Google Cloud service is the managed CI service?

A: Cloud Build. It runs your build and test steps inside isolated container workers on Google-managed infrastructure, integrates with GitHub/GitLab/Bitbucket/Cloud Source Repositories, and pushes built artifacts to Artifact Registry. Pricing is per build-minute and there is a free tier. Cloud Build is the answer whenever a CDL scenario mentions "automated builds," "build pipelines," or "CI on Google Cloud."

Q: Q: How does Cloud Deploy differ from Cloud Build?

A: Cloud Build is CI — it turns source code into tested artifacts. Cloud Deploy is CD — it takes those artifacts and progressively rolls them out to GKE , Cloud Run , or Anthos through a delivery pipeline with approval gates, canary strategies, and one-click rollback. A complete pipeline almost always uses both: Cloud Build emits an image into Artifact Registry, then triggers a Cloud Deploy release.

Q: Q: What is an error budget and why does Google SRE recommend one?

A: An error budget is the difference between 100% and your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes per month of allowed unavailability. The error budget is meant to be spent on release velocity : when the budget is healthy, the team ships fast; when it is exhausted, the team slows down and prioritizes reliability. It prevents the false trade-off between "shipping features" and "keeping the site up" by making the trade-off explicit and measurable.

Q: Q: What's the difference between SLO and SLA?

A: An SLO (Service Level Objective) is the internal target the team commits to (e.g., 99.95% availability). An SLA (Service Level Agreement) is the external contract with customers, usually with financial penalties for breach (e.g., 99.9% availability with a 10% credit if missed). Best practice is to keep the internal SLO stricter than the external SLA so the team has headroom to detect and fix issues before the customer is entitled to a refund.

Q: Q: What are the four DORA metrics?

A: Deployment Frequency , Lead Time for Changes , Change Failure Rate , and Mean Time to Recovery (MTTR) . Together they measure both speed (deployment frequency, lead time) and stability (change failure rate, MTTR). Elite-performing teams in Google's annual State of DevOps Report deploy multiple times per day with under 5% change failure rate and recover from incidents in under an hour. The DORA metrics are the standard way to benchmark DevOps performance.

What Are DevOps and SRE on Google Cloud?

DevOps is a culture, a set of practices, and a toolchain that brings software development (Dev) and IT operations (Ops) into a single, continuous loop. Site Reliability Engineering (SRE) is Google's discipline for running reliable services at scale — it is what happens when you apply software engineering to operations problems. Both ideas show up repeatedly on the Cloud Digital Leader (CDL) exam because Google Cloud productizes them through services like Cloud Build, Cloud Deploy, Artifact Registry, and the Google Cloud Operations Suite.

For the CDL exam you are not asked to write a Cloud Build pipeline YAML file or compose a Terraform module. Instead you are asked business-leader questions: "Why does an organization adopt DevOps?" "What is an error budget and how does it shape release decisions?" "Which Google Cloud service automates progressive rollouts to GKE and Cloud Run?" This study note builds the vocabulary, the mental models, and the Google Cloud product mapping you need to confidently answer those scenarios.

Modern enterprises adopt DevOps and SRE for a reason: research summarized in Google's annual DORA (DevOps Research and Assessment) State of DevOps report shows that elite performing teams ship code hundreds of times more often with lower change failure rates than low performers. Google Cloud bakes that research into its tooling so that any organization — from a Taipei startup to a multinational bank — can climb the DORA performance curve without reinventing the practices.

By the end of this chapter you will be able to describe the DevOps and SRE pillars, explain how Google Cloud's CI/CD products fit together, articulate SLI/SLO/SLA and error budgets to a non-technical executive, and reason about modernization trade-offs from a business-leader perspective.

DevOps Foundations: CI, CD, IaC, Automation, Culture

DevOps rests on four practical pillars: Continuous Integration (CI), Continuous Delivery (CD), Infrastructure as Code (IaC), and a blameless, collaborative culture. None of these are tools by themselves — they are practices that tools support.

Continuous Integration (CI)

Continuous Integration means developers merge their work into a shared mainline branch many times per day. Every merge triggers an automated build and automated tests so that bugs are caught minutes after they are introduced, not weeks. The longer code lives on a separate branch, the more painful the merge — CI eliminates that pain by making integration constant and small.

On Google Cloud the managed CI service is Cloud Build. It listens for commits on GitHub, GitLab, Bitbucket, or Cloud Source Repositories, runs your build steps inside isolated containers, runs your test suite, and produces signed build artifacts. Cloud Build pricing is per build-minute with a generous free tier, and it integrates natively with Artifact Registry and Cloud Deploy.

Continuous Delivery and Continuous Deployment (CD)

Continuous Delivery keeps every successful build in a state where it could be released to production at the push of a button. Continuous Deployment goes one step further: every successful build that passes tests is automatically promoted to production. The difference is a human approval gate.

On Google Cloud the managed CD service is Cloud Deploy. It models a delivery pipeline as an ordered set of targets — typically dev → staging → prod — and orchestrates progressive rollouts to GKE, Cloud Run, and Anthos clusters. Cloud Deploy supports canary releases, blue-green deployments, rollback in one click, and built-in approval gates for regulated environments.

Infrastructure as Code (IaC)

Infrastructure as Code means describing your cloud resources — VPCs, GKE clusters, IAM bindings, Cloud SQL instances — as version-controlled text files that a tool can apply repeatably. The benefits are enormous: environments become reproducible, drift is auditable, and disaster recovery becomes "re-run the code" instead of "click 400 buttons."

On Google Cloud the de facto IaC standard is Terraform. Google maintains the official google and google-beta providers, and Cloud Build can run terraform plan and terraform apply as part of an automated pipeline. Google also offers Config Connector (Kubernetes-style declarative configuration of Google Cloud resources) and Infrastructure Manager (a managed Terraform execution service) for teams that want a fully managed IaC backend.

Automation Everywhere

The unifying theme of DevOps is automation. Every manual step — provisioning servers, deploying code, running tests, paging on-call engineers — is a candidate for automation. Automation reduces toil (more on that under SRE), eliminates human error, and frees engineering time for higher-value work.

Blameless, Collaborative Culture

Tools alone do not create DevOps. The cultural shift is just as important: developers and operators share goals, share metrics, and share responsibility. When incidents occur, the team focuses on systemic root cause rather than personal blame. This is what makes high-trust environments where engineers will admit mistakes early instead of hiding them.

DevOps is the union of practices (CI/CD, IaC, monitoring), culture (shared ownership, blameless learning), and tools (Cloud Build, Cloud Deploy, Artifact Registry, Terraform) that compresses the time between writing code and delivering value to users while maintaining reliability. SRE is Google's specific implementation pattern for DevOps that applies software engineering to operations problems. See https://cloud.google.com/architecture/devops for the official Google Cloud DevOps capability model.

白話文解釋（Plain English Explanation）

DevOps and SRE can feel abstract until you map them to physical-world systems you already understand. The four analogies below approach the same concepts from different angles. Pick the one that clicks for you and use it during the exam to reason about scenarios.

Analogy 1 — A Modern Factory Assembly Line (CI/CD Automation)

Imagine a Taiwanese semiconductor factory that produces millions of chips a day. Twenty years ago every step was manual: a worker placed the silicon wafer, another worker pressed a button, a third worker visually inspected the result. Throughput was low, defects were high, and shipping a single batch took weeks.

Today the same factory runs as a fully automated assembly line. Wafers move on conveyor belts. Robotic arms place each component with sub-millimeter precision. Inline inspection cameras test every chip the moment it is produced and reject failures before they reach the next stage. If a defect spikes, the line stops automatically and alerts the supervisor. Shipping a batch now takes hours.

This is exactly what CI/CD on Google Cloud does for software. Cloud Build is the conveyor belt that moves code from commit to artifact. The automated test suite is the inspection camera that rejects bad builds. Artifact Registry is the warehouse that stores finished, versioned products. Cloud Deploy is the dispatcher that ships products to customers in waves — first to a small pilot region (canary), then to staging, then to production. If a defect is detected in production, rollback is the equivalent of recalling the bad batch in one click. The factory analogy explains why DevOps teams ship hundreds of times faster than traditional teams: every step is automated, measured, and reversible.

Analogy 2 — A 24-Hour Hospital ICU (SLI/SLO/Error Budgets)

A hospital intensive care unit (ICU) does not aim for "perfect" patient outcomes. It aims for measurable, achievable targets: oxygen saturation above 92%, heart rate within a certain range, response time to a code-blue alarm under 30 seconds. These are the ICU's Service Level Indicators (SLIs): the specific numbers being measured. The targets like "92% oxygen saturation" are Service Level Objectives (SLOs): the goals the team commits to. The contract the hospital signs with insurance providers ("we will admit emergency cases within 15 minutes") is the Service Level Agreement (SLA): a legally binding promise with penalties for breach.

Crucially, no ICU promises 100% perfect outcomes. That would be both unachievable and counterproductive — chasing 100% would mean refusing risky surgeries and never trying new treatments. Instead the ICU accepts that a small amount of risk is allowed — the error budget. That budget is spent deliberately: on innovation, on training new doctors, on running clinical trials. When the budget is exhausted (too many bad outcomes recently), the ICU pauses risky procedures and focuses on stabilization.

This is exactly how Google SRE runs production services. Your SLI might be "99.9% of HTTP requests return success within 200 ms." Your SLO is the target (99.9%). Your error budget is the remaining 0.1% — roughly 43 minutes per month of allowed unavailability. Your release team can deploy aggressively while the budget is healthy and must slow down when it is exhausted. The error budget aligns developers (who want velocity) and operators (who want stability) around a single shared number.

Analogy 3 — An Airline Maintenance Manual (Infrastructure as Code)

A Boeing 787 cannot be maintained by intuition. Every inspection, every part replacement, every fluid top-up is described in a maintenance manual — thousands of pages of step-by-step instructions that any qualified technician can follow to produce identical results. If you remove the manual and let each mechanic improvise, the airline becomes unsafe and uncertifiable. The manual is the source of truth: anything not in the manual is not done; anything in the manual is done the same way every time.

Infrastructure as Code brings the same discipline to cloud infrastructure. Instead of clicking around the Google Cloud Console to create a VPC, a GKE cluster, and an IAM binding, you describe them in a Terraform configuration file that anyone can read, review, version, and replay. When a new environment is needed for testing, terraform apply reproduces the entire setup in minutes. When an auditor asks "show me how production was configured on March 15," git log answers definitively. When disaster strikes and a whole region fails, the same Terraform code rebuilds the environment elsewhere. Without IaC, every environment becomes a snowflake — unique, undocumented, and impossible to reproduce. With IaC, environments become like aircraft: standardized, certifiable, and safe.

Analogy 4 — A Baseball Team's Coaching Statistics (DORA Metrics)

A baseball coach does not evaluate the team by gut feel. The coach measures batting average, on-base percentage, earned run average, fielding percentage — concrete numbers that reveal where the team is strong and where it needs work. Without those statistics, improvement is impossible because there is nothing to improve against.

DORA metrics are the four statistics that reveal a software team's performance: Deployment Frequency (how often you ship), Lead Time for Changes (how long from commit to production), Change Failure Rate (what percent of deployments cause incidents), and Mean Time to Recovery (MTTR) (how fast you recover when something breaks). Like batting averages, DORA metrics let an engineering leader benchmark their team against the industry: elite performers deploy multiple times per day with a change failure rate below 5%, while low performers deploy less than once per month with double-digit failure rates. Google publishes the annual State of DevOps Report so any organization can see where it stands and what to improve next.

Google Cloud's CI/CD Toolchain

Google Cloud offers a complete, integrated CI/CD toolchain. Each service does one job well and connects natively to the rest.

Cloud Build — Managed CI

Cloud Build is Google Cloud's fully managed CI service. It runs your build steps as containers on Google-managed infrastructure, so there is no build server to maintain. A typical pipeline:

Developer pushes a commit to GitHub.
A webhook triggers Cloud Build.
Cloud Build runs steps defined in cloudbuild.yaml — docker build, go test, npm run lint, security scans.
On success, Cloud Build pushes the image to Artifact Registry and creates a Cloud Deploy release.

Cloud Build includes vulnerability scanning via Artifact Analysis, private pool workers for VPC-internal builds, and tight integration with Binary Authorization for supply-chain security.

Cloud Deploy — Managed CD

Cloud Deploy is the managed continuous-delivery service for GKE, Cloud Run, and Anthos targets. You declare a delivery pipeline YAML that lists targets in order (dev, staging, prod) and Cloud Deploy handles:

Progressive rollouts with traffic splitting (e.g., 10% → 25% → 50% → 100%).
Canary and blue-green deployment strategies.
Approval gates for production promotions.
One-click rollbacks to any previous release.
Audit trails of who deployed what, when, and to where.

Artifact Registry — Container and Package Storage

Artifact Registry is Google Cloud's unified storage for Docker images, Maven, npm, Python, Go, and OS packages. It replaces the older Container Registry and provides:

Regional and multi-regional repositories for low latency.
Vulnerability scanning through Artifact Analysis.
Fine-grained IAM for read/write/admin access.
Remote and virtual repositories that proxy public registries like Docker Hub and PyPI for security and reliability.

Cloud Source Repositories — Git Hosting

Cloud Source Repositories offers private Git hosting on Google Cloud. In practice most teams host code on GitHub or GitLab and connect those external repositories to Cloud Build via the Cloud Build GitHub App or webhooks. The CDL exam treats GitHub/GitLab on Google Cloud as the mainstream pattern, with Cloud Source Repositories as the legacy fallback for teams that want everything inside Google Cloud's perimeter.

Skaffold — Local Inner Loop

Skaffold is an open-source Google tool that automates the local development inner loop: watch source files, rebuild containers, redeploy to a local or remote Kubernetes cluster, and stream logs. It glues docker build, kubectl apply, and live reload into a single skaffold dev command. Skaffold is also the engine underneath Cloud Deploy for rendering Kubernetes manifests.

Terraform on Google Cloud — IaC

Terraform is the IaC standard on Google Cloud. The official google and google-beta providers cover essentially every Google Cloud resource. Best practice patterns:

Store Terraform state in a Cloud Storage bucket with versioning and locking.
Use separate state per environment (dev, staging, prod).
Run terraform plan and terraform apply from Cloud Build with a dedicated service account.
Manage Workload Identity Federation so Terraform never holds long-lived keys.

For the CDL exam, memorize the Google Cloud CI/CD product mapping: managed CI = Cloud Build, managed CD with progressive rollouts = Cloud Deploy, container and package storage = Artifact Registry, local dev loop = Skaffold, IaC standard = Terraform. Scenarios that mention "automated build pipeline" or "build minutes" point to Cloud Build; scenarios that mention "canary deployment" or "approval gate before production" point to Cloud Deploy. See https://cloud.google.com/build/docs/overview and https://cloud.google.com/deploy/docs/overview for the official overviews.

SRE Pillars: SLI, SLO, SLA, and Error Budgets

Site Reliability Engineering was invented at Google in 2003 and codified in the famous SRE books published in 2016 and 2018. Its core philosophy: treat operations as a software engineering problem. The pillars below show up directly on the CDL exam.

The CDL exam's favorite SRE question pattern: "An e-commerce platform commits to 99.9% availability in their customer SLA. They have already used 80% of their monthly error budget by week 3 due to a release that triggered cascading failures. What should the SRE team do next?" The expected answer is to slow down release velocity and prioritize reliability work until the error budget recovers — not to push more features. The error budget concept is what turns "100% uptime" into a measurable trade-off between velocity (shipping features) and reliability (uptime). Always favor the answer that respects the error budget over the answer that chases zero downtime.

Service Level Indicator (SLI)

An SLI is a quantitative measure of a specific aspect of service quality — the thing you measure. Common SLIs:

Availability: the fraction of successful requests over total requests.
Latency: the 50th / 95th / 99th percentile response time.
Throughput: requests per second served.
Correctness: the fraction of responses with correct data.
Durability: the fraction of objects preserved over a time window.

Good SLIs are user-facing: they measure what customers actually experience, not internal CPU utilization.

Service Level Objective (SLO)

An SLO is a target value for an SLI over a defined window — the goal you commit to internally. Example: "99.9% of HTTP requests over a rolling 28-day window return a 2xx response within 300 ms." The SLO is not the same as the maximum possible quality; it is the minimum your customers find acceptable.

Service Level Agreement (SLA)

An SLA is the contractual commitment you make to customers about an SLO, usually with financial penalties for breach. SLAs are always looser than internal SLOs because you want a safety margin: if your internal SLO is 99.95% but your customer-facing SLA is 99.9%, you have headroom to detect and fix issues before customers can claim a refund.

Error Budget

The error budget is the gap between 100% and your SLO. If your SLO is 99.9%, your error budget is 0.1% — about 43 minutes of allowed downtime per month. This budget is a resource to be spent:

When the budget is healthy, the team can deploy aggressively, run chaos experiments, and ship risky new features.
When the budget is exhausted, the team must slow releases, focus on reliability fixes, and possibly freeze deployments until the SLO recovers.

The error budget creates a built-in trade-off between velocity and reliability that aligns developers and SREs around a single shared metric.

SLI = what you measure (e.g., success rate). SLO = the target you commit to (e.g., 99.9%). SLA = the contract with customers and its financial penalty. Error budget = 100% − SLO = the allowed unreliability you can spend on release velocity. Internal SLOs should always be stricter than external SLAs. See https://cloud.google.com/sre for the canonical SRE definitions.

100% uptime is NOT a realistic SLO. A common trap on CDL questions is to pick "we should aim for 100% availability." Google's SRE doctrine teaches the opposite — chasing 100% halts innovation because every change becomes a risk to that perfect record, and the marginal cost of each extra "nine" grows exponentially (going from 99.9% to 99.99% can cost 10× more in redundancy). The right answer is: define an SLO that customers accept (often 99.9% or 99.95%), then spend the error budget on velocity. See https://cloud.google.com/sre/sre-fundamentals.

Toil Reduction and Automation

Toil is Google SRE's term for manual, repetitive, automatable, tactical, devoid-of-enduring-value operational work. Resetting passwords by hand, restarting servers by SSH, running the same database export script every night — all toil. SRE teams target less than 50% of an engineer's time spent on toil, with the rest spent on engineering work that reduces future toil.

Toil reduction strategies:

Automate repetitive runbooks with scripts, Cloud Workflows, or Cloud Functions triggered by Pub/Sub.
Self-service common requests through internal platforms instead of ticket queues.
Eliminate broken classes of work entirely by fixing root causes.
Measure how much time the team spends on toil so the trend is visible.

Google Cloud services that directly reduce toil include Cloud Operations Suite (auto-collected metrics, logs, and traces — no agent toil), Cloud Run (no servers to patch), and GKE Autopilot (no nodes to manage). The general principle: the more managed the service, the less toil.

Blameless Postmortems and Incident Response

When an incident occurs, the SRE response is blameless: the postmortem focuses on systemic factors — missing alerts, ambiguous runbooks, single-points-of-failure — rather than personal mistakes. The premise is that humans make errors under pressure; the system must be designed to absorb those errors safely.

A blameless postmortem typically includes:

Timeline of what happened with timestamps from Cloud Logging.
Root cause analysis (using techniques like 5-Whys).
What went well (early detection, fast rollback).
What went wrong (gaps in alerting, slow paging chain).
Action items with owners and due dates.

The output is published widely so the entire engineering organization learns. This is one of the cultural practices that distinguishes elite-performing teams in the DORA research.

For CDL scenarios that mention "after an outage, the team holds a meeting to discuss what happened without assigning blame to individuals," the answer is blameless postmortem — a core SRE practice. Pair this with the error budget concept: a blameless postmortem typically results in changes that protect the error budget going forward (better tests, better alerts, better rollback automation). See https://cloud.google.com/architecture/framework/reliability/postmortems for Google Cloud's postmortem guidance.

DORA Metrics: Measuring DevOps Performance

The DORA (DevOps Research and Assessment) team — now part of Google Cloud — has run the longest-running academic study of DevOps practices, published yearly as the State of DevOps Report. They distilled four key metrics that predict elite engineering performance:

Deployment Frequency

How often does the team push code to production? Elite teams deploy multiple times per day; low performers deploy less than once per month. High deployment frequency requires automation, small batch sizes, and trunk-based development.

Lead Time for Changes

How long does it take from a commit on main to that code running in production? Elite teams measure this in hours; low performers measure it in weeks. Short lead times require automated tests, automated deployments, and tight feedback loops.

Change Failure Rate

What percent of deployments cause an incident, rollback, or hotfix? Elite teams keep this under 5%; low performers see double-digit rates. Low change failure rate requires comprehensive automated testing and progressive rollout strategies.

Mean Time to Recovery (MTTR)

How fast does the team restore service when an incident occurs? Elite teams recover in under an hour; low performers can take days. Fast recovery requires rollback automation, blameless culture, and good observability via the Cloud Operations Suite.

The four DORA metrics together give a holistic view: speed (deployment frequency + lead time) and stability (change failure rate + MTTR). Improving one without the other is a warning sign.

Modernization Journey: From Manual Ops to Full SRE

Most organizations don't jump straight from "tickets and bash scripts" to "elite DORA performance." The realistic modernization journey:

Adopt source control and CI — every change goes through Git and is built/tested automatically by Cloud Build.
Containerize and standardize artifacts — package apps as Docker images stored in Artifact Registry.
Automate deployments — replace manual kubectl apply with Cloud Deploy pipelines that include approval gates.
Adopt IaC — describe infrastructure in Terraform, store state in Cloud Storage, apply via Cloud Build.
Define SLOs and dashboards — set up SLIs in Cloud Monitoring, publish SLO burn-rate alerts.
Run blameless postmortems — every incident produces shareable learning.
Reduce toil — measure operational time and automate the top time-sinks each quarter.

Each step compounds the next. For a deeper look at the observability layer that powers steps 5–7 see /en/certs/gcp/cdl/topics/google-cloud-operations-suite. For how compute choices affect operational burden see /en/certs/gcp/cdl/topics/google-cloud-compute-options and /en/certs/gcp/cdl/topics/containers-vs-vms.

Security in the DevOps Pipeline (DevSecOps)

Modern DevOps embeds security into the pipeline rather than bolting it on at the end. Google Cloud supports DevSecOps with:

Artifact Analysis for vulnerability scanning on every container image.
Binary Authorization to enforce "only signed images from trusted sources can deploy."
Cloud Build private pools so build workers can reach VPC-internal dependencies without exposing them publicly.
Workload Identity Federation so pipelines authenticate to Google Cloud without long-lived service-account keys.
Software Delivery Shield as the umbrella security posture for the supply chain.

A typical CDL scenario: "How does an organization ensure that only approved container images can run in production?" The answer is Binary Authorization with Artifact Registry as the trusted source.

Cost and Business Value of DevOps and SRE

For a Cloud Digital Leader, the business value of DevOps and SRE is the part that matters most:

Faster time to market — features ship in days instead of quarters.
Lower change failure rate — fewer outages mean lower customer churn and lower incident-response cost.
Higher engineering satisfaction — DORA research shows elite teams have lower burnout because automation removes toil.
Audit and compliance readiness — every deployment is logged in Cloud Audit Logs, every infrastructure change is in Git.
Predictable infrastructure cost — IaC plus rightsizing recommendations from Recommender prevent surprise bills.

The investment cost — Cloud Build minutes, Cloud Deploy targets, Artifact Registry storage, engineer training — is small compared with the cost of one major outage or one missed release window. This ROI argument is what a Cloud Digital Leader uses when proposing DevOps adoption to executives.

Frequently Asked Questions

Q: Which Google Cloud service is the managed CI service?

A: Cloud Build. It runs your build and test steps inside isolated container workers on Google-managed infrastructure, integrates with GitHub/GitLab/Bitbucket/Cloud Source Repositories, and pushes built artifacts to Artifact Registry. Pricing is per build-minute and there is a free tier. Cloud Build is the answer whenever a CDL scenario mentions "automated builds," "build pipelines," or "CI on Google Cloud."

Q: How does Cloud Deploy differ from Cloud Build?

A: Cloud Build is CI — it turns source code into tested artifacts. Cloud Deploy is CD — it takes those artifacts and progressively rolls them out to GKE, Cloud Run, or Anthos through a delivery pipeline with approval gates, canary strategies, and one-click rollback. A complete pipeline almost always uses both: Cloud Build emits an image into Artifact Registry, then triggers a Cloud Deploy release.

A: An error budget is the difference between 100% and your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes per month of allowed unavailability. The error budget is meant to be spent on release velocity: when the budget is healthy, the team ships fast; when it is exhausted, the team slows down and prioritizes reliability. It prevents the false trade-off between "shipping features" and "keeping the site up" by making the trade-off explicit and measurable.

Q: What's the difference between SLO and SLA?

A: An SLO (Service Level Objective) is the internal target the team commits to (e.g., 99.95% availability). An SLA (Service Level Agreement) is the external contract with customers, usually with financial penalties for breach (e.g., 99.9% availability with a 10% credit if missed). Best practice is to keep the internal SLO stricter than the external SLA so the team has headroom to detect and fix issues before the customer is entitled to a refund.

Q: What are the four DORA metrics?

A: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR). Together they measure both speed (deployment frequency, lead time) and stability (change failure rate, MTTR). Elite-performing teams in Google's annual State of DevOps Report deploy multiple times per day with under 5% change failure rate and recover from incidents in under an hour. The DORA metrics are the standard way to benchmark DevOps performance.

Q: Is Infrastructure as Code required for the CDL exam?

A: You will not be asked to write Terraform HCL syntax, but you must recognize what IaC is, why organizations adopt it (reproducible environments, auditable change history, faster disaster recovery), and which tools implement it on Google Cloud (Terraform with the official google provider, plus Config Connector and Infrastructure Manager). Scenarios that mention "version-controlled infrastructure" or "reproducible environments" point to IaC.

Summary

DevOps combines practices, culture, and tools to compress the time between writing code and delivering value, while SRE is Google's specific engineering discipline for running reliable services at scale. Google Cloud productizes both through an integrated toolchain:

Cloud Build — managed CI that turns commits into tested artifacts.
Cloud Deploy — managed CD with progressive rollouts, canaries, and one-click rollback.
Artifact Registry — unified storage for container images and language packages.
Skaffold — local development inner loop.
Terraform on Google Cloud — the IaC standard with managed execution via Infrastructure Manager.
Cloud Operations Suite — the observability backbone for SLIs, SLOs, and alerts.

The SRE pillars — SLI, SLO, SLA, error budget, toil reduction, blameless postmortems — give organizations a measurable, sustainable way to balance velocity and reliability. The DORA metrics — deployment frequency, lead time, change failure rate, MTTR — let leaders benchmark their teams against the industry. As a Cloud Digital Leader, your job is to translate these concepts into business outcomes (faster releases, lower churn, predictable cost) and to map each scenario to the right Google Cloud service. Master this framework and any DevOps or SRE question on the CDL exam becomes straightforward.

What Are DevOps and SRE on Google Cloud?

DevOps Foundations: CI, CD, IaC, Automation, Culture

Continuous Integration (CI)

Continuous Delivery and Continuous Deployment (CD)

Infrastructure as Code (IaC)

Automation Everywhere

Blameless, Collaborative Culture

白話文解釋（Plain English Explanation）

Analogy 1 — A Modern Factory Assembly Line (CI/CD Automation)

Analogy 2 — A 24-Hour Hospital ICU (SLI/SLO/Error Budgets)

Analogy 3 — An Airline Maintenance Manual (Infrastructure as Code)

Analogy 4 — A Baseball Team's Coaching Statistics (DORA Metrics)

Google Cloud's CI/CD Toolchain

Cloud Build — Managed CI

Cloud Deploy — Managed CD

Artifact Registry — Container and Package Storage

Cloud Source Repositories — Git Hosting

Skaffold — Local Inner Loop

Terraform on Google Cloud — IaC

SRE Pillars: SLI, SLO, SLA, and Error Budgets

Service Level Indicator (SLI)

Service Level Objective (SLO)

Service Level Agreement (SLA)

Error Budget

Toil Reduction and Automation

Blameless Postmortems and Incident Response

DORA Metrics: Measuring DevOps Performance

Deployment Frequency

Lead Time for Changes

Change Failure Rate

Mean Time to Recovery (MTTR)

Modernization Journey: From Manual Ops to Full SRE

Security in the DevOps Pipeline (DevSecOps)

Cost and Business Value of DevOps and SRE

Frequently Asked Questions

Q: Which Google Cloud service is the managed CI service?

Q: How does Cloud Deploy differ from Cloud Build?

Q: What is an error budget and why does Google SRE recommend one?

Q: What's the difference between SLO and SLA?

Q: What are the four DORA metrics?

Q: Is Infrastructure as Code required for the CDL exam?

Summary

Official sources

More CDL topics