Blue-Green and Canary Deployments — GCP PCA Study Notes

Q: Q1. How do we handle database schema changes during a Blue-Green deployment?

This is the hardest part. You must ensure the database is backward compatible . Usually, you apply schema changes (like adding a column) before the deployment, so both Blue and Green can work with the same database.

Q: Q2. What is a "Dark Launch"?

A dark launch is when you deploy code to production but keep it hidden from users (using feature flags). This allows you to test the performance and stability of the code under real load without affecting the user interface.

Q: Q3. When should I use a Rolling Update instead of Blue-Green?

Use a Rolling Update for internal tools or non-critical services where a few seconds of "mixed versions" is acceptable and you want to save on infrastructure costs.

Q: Q4. How do I automate a Canary rollback?

Integrate your deployment pipeline (e.g., Cloud Deploy) with Cloud Monitoring . If the error rate for the canary version exceeds a threshold, the pipeline should automatically trigger a gcloud deploy rollbacks create command.

Q: Q5. What is the role of a Service Mesh (Istio) in deployments?

A Service Mesh like Anthos Service Mesh provides fine-grained traffic control. You can shift traffic based on HTTP headers (e.g., "Only send internal employees to the Canary version") or geography, which is much more powerful than simple IP-based weighting.

Introduction to Release Management Strategies

Releasing software is one of the most high-risk activities in the SDLC. To minimize downtime and the impact of bugs, architects use various deployment strategies. In Google Cloud, tools like Cloud Deploy, GKE, and Cloud Run provide built-in support for these patterns.

For the GCP Professional Cloud Architect, the goal is to choose the strategy that best balances risk, cost, and complexity.

Plain-Language Explanation: Deployment Strategies

Analogy 1 — The Bridge Construction (Blue-Green)

Imagine you have an old bridge (Blue). Instead of repairing it while cars are driving on it, you build a brand-new bridge (Green) right next to it. Once the new bridge is tested and ready, you simply move the road signs (Shift Traffic) to point to the new bridge. If the new bridge starts to wobble, you move the signs back to the old bridge (Rollback).

Analogy 2 — The Coal Mine Canary (Canary)

In the old days, miners took a canary into a coal mine. If the canary stopped singing (died from gas), the miners knew it wasn't safe for humans to enter. A Canary Release is like sending 5% of your users (The canaries) to the new version. If they don't experience errors, you slowly let the other 95% of users in.

Analogy 3 — The Revolving Door (Rolling Update)

A Rolling Update is like a large building with a revolving door. You don't empty the whole building to let new people in. You replace the people one by one, or group by group. At any given time, the building is mostly full, but the "version" of the people inside is slowly changing until everyone is "Version 2.0."

The process of redirecting a portion of network traffic from one version of an application to another, usually managed by a Load Balancer or a Service Mesh (like Istio/Anthos Service Mesh).

Comparing Deployment Strategies

1. Blue-Green Deployment

How it works: Two identical environments (Blue = Current, Green = New). Switch traffic 100% at once.
Pros: Instant rollback, zero downtime.
Cons: Expensive (requires 2x resources), complex database synchronization.

2. Canary Deployment

How it works: Deploy the new version to a small subset of instances. Gradually increase traffic (e.g., 5% -> 20% -> 50% -> 100%).
Pros: Minimal blast radius, allows for real-world testing with real users.
Cons: Requires advanced monitoring to detect issues in the small canary group.

3. Rolling Update

How it works: Incrementally replace old instances with new ones. This is the default in GKE and Managed Instance Groups (MIGs).
Pros: Cost-effective (no extra resources needed), simple.
Cons: No easy "instant" rollback of all traffic; you might have two versions running at the same time for a while.

Release Management on GCP

Cloud Deploy: A managed service for continuous delivery that supports Canary and Blue-Green strategies for GKE, Cloud Run, and GKE Enterprise.
Global Load Balancer: Use Weighted Backend Groups to shift traffic between "Blue" and "Green" instance groups or services.
Feature Flags: Use tools like ConfigCat or custom logic to enable/disable features without redeploying code (Dark Launching).

::promoted

Architect's Insight: On the PCA exam, if a scenario emphasizes "Zero downtime" and "Instant rollback" above all else, Blue-Green is the answer. If it emphasizes "Testing in production" and "Minimizing risk for new features," Canary is the answer. ::

Cloud Deploy + Cloud Build Pipeline

Cloud Build handles the build/test/package phase, while Cloud Deploy handles the promote/release/verify/rollback phase. Together they form the canonical GCP CI/CD pipeline for GKE, GKE Enterprise, and Cloud Run.

Pipeline anatomy

Cloud Build trigger fires on git push to main (or tag) — uses cloudbuild.yaml to run pnpm test, docker build, gcloud artifacts docker push into Artifact Registry.
Cloud Build then issues gcloud deploy releases create rel-${SHORT_SHA} --delivery-pipeline=web-pipeline --region=us-central1 --images=app=us-central1-docker.pkg.dev/$PROJECT/web/app:$SHORT_SHA.
Cloud Deploy picks up the release and progresses through the targets defined in clouddeploy.yaml (dev → staging → prod), applying the strategy declared in deployParameters.strategy (standard, canary, or blueGreen).

Canary strategy declaration

strategy:
  canary:
    runtimeConfig:
      kubernetes:
        serviceNetworking:
          service: web-svc
          deployment: web-deploy
    canaryDeployment:
      percentages: [10, 25, 50]
      verify: true

With verify: true, Cloud Deploy runs a verify job (a Skaffold-defined container) between each percentage step. If verify fails, the pipeline halts and gcloud deploy rollouts retry or gcloud deploy rollbacks create is your recovery path.

Why split Build from Deploy?

Cloud Build is stateless and ephemeral — every step starts fresh, which is ideal for reproducible builds but terrible for tracking which artifact is in which environment. Cloud Deploy maintains the delivery pipeline state (which release is in which target, rollout history, approval gates), so you get a Git-for-deployments model. For PCA, expect questions where "build runs on every commit but only certain commits get promoted to prod" — the answer is Cloud Build → Cloud Deploy with manual approval on the prod target.

Traffic Splitting in Cloud Run

Cloud Run revisions are immutable, and traffic between revisions is controlled via the --to-revisions flag on gcloud run services update-traffic. This is the simplest canary surface in GCP — no Load Balancer, no Service Mesh, no MIG required.

Tag-based canary pattern

# Deploy new revision but send 0% traffic
gcloud run deploy web --image=... --no-traffic --tag=canary --region=us-central1

# Test the canary directly via the tagged URL
curl https://canary---web-abc123-uc.a.run.app/healthz

# Send 5% of production traffic
gcloud run services update-traffic web \
  --to-revisions=web-00042-abc=5,web-00041-def=95 \
  --region=us-central1

Key behaviors

Revisions are versioned (web-00042-abc); traffic percentages are assigned to revisions, not to images.
Tags create a stable subdomain (canary---web-...) so internal QA can hit a specific revision without affecting the LATEST traffic split.
--no-traffic deploys a revision without routing any production traffic — ideal for "deploy at 3 AM, ramp at business hours" workflows.
gcloud run services update-traffic --to-latest is the panic button: shifts 100% back to the previous stable revision (rollback in < 5 seconds).

Combine Cloud Run revision tagging with Cloud Deploy custom targets to get a fully managed canary: Cloud Deploy promotes a revision tagged canary, runs verify, then issues --to-revisions=canary=100 only after verify passes. You avoid hand-rolling traffic math in shell scripts.

GKE Rolling vs Blue-Green via Anthos Service Mesh

GKE's native Deployment resource defaults to RollingUpdate (maxSurge=25%, maxUnavailable=25%). This is cheap but means mixed versions answer requests for several minutes — unacceptable for stateful or session-pinned workloads.

Rolling update (native)

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Pros: zero extra replicas needed. Cons: no instant rollback — you must trigger another rolling update back to the previous image, which takes minutes.

Blue-Green via Anthos Service Mesh (ASM)

Run two Deployment objects (web-blue, web-green) behind one Service, and use an ASM VirtualService with a weight field to flip traffic atomically:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination: { host: web, subset: blue }
      weight: 0
    - destination: { host: web, subset: green }
      weight: 100

A single kubectl apply flips traffic in one control-plane reconciliation cycle (< 1 second on most clusters). Rollback is the inverse kubectl apply.

ASM-only superpowers

Header-based canary: route only requests with x-canary: true to green, leaving the public unaffected — perfect for internal dogfooding.
Mirror traffic: send 100% to blue but also shadow-copy requests to green for load testing without user impact (mirror: { host: web, subset: green }).
Outlier detection: ASM automatically ejects green pods that return 5xx above a threshold, providing circuit-breaking on top of blue-green.

For PCA: if the question mentions "header-based routing" or "mirror production traffic to a new version," the answer is Anthos Service Mesh, not native GKE.

GKE's native Deployment resource cannot do true blue-green — it only supports RollingUpdate and Recreate. To get atomic traffic flips on GKE, you must add either Anthos Service Mesh (Istio VirtualService) or Cloud Deploy with the blueGreen strategy (which manages two Service objects under the hood). Plain kubectl won't cut it.

A/B Testing with Weighted Backend Services on Global LB

When workloads run on MIGs, Cloud Run, or GKE behind a Global External Application Load Balancer, you don't need a service mesh — the LB itself supports weighted traffic splitting at the URL map level.

URL map weighted routing

defaultRouteAction:
  weightedBackendServices:
  - backendService: projects/$P/global/backendServices/web-blue
    weight: 90
  - backendService: projects/$P/global/backendServices/web-green
    weight: 10

Weights are integers (typically 0-1000); the LB normalizes to percentages.
Updates propagate globally in 30-60 seconds via Google's Envoy fleet.
Cost: zero — no per-request fee for the split, just the standard LB forwarding rule cost.

Use case: geographic A/B

Combine weightedBackendServices with pathMatcher + headerAction (matching X-Client-Geo) to send 50% of EU traffic to green and 100% of US traffic to blue. This is how you test a new EU data residency backend without affecting US customers.

vs Anthos Service Mesh

Global LB weighted backends: layer 7 at the edge, works across regions and backend types (MIG + Cloud Run + GKE simultaneously).
ASM VirtualService: inside the cluster, finer-grained (per-namespace, per-header), but only for GKE workloads.

For PCA questions about "split traffic between two Cloud Run regions" or "weighted routing across hybrid backends," the answer is Global LB weighted backend services.

Automated Rollback via Cloud Deploy Verify Phase

Cloud Deploy's verify phase is the GCP-native answer to "automate canary rollback." When verify: true is set on a strategy.canary or strategy.standard, Cloud Deploy executes a Skaffold verify profile between each rollout phase.

Verify phase mechanics

# skaffold.yaml
verify:
- name: smoke-test
  container:
    image: gcr.io/$PROJECT/smoke-test:latest
    command: ["./run-smoke.sh"]
- name: slo-check
  container:
    image: gcr.io/$PROJECT/slo-checker:latest
    command: ["python", "check_slo.py", "--window=5m", "--error-budget=0.99"]

Each verify container runs in a Cloud Build worker and has access to the target cluster/service via the rollout's execution identity.
Exit code 0 = pass, advance to next phase. Non-zero = fail, halt rollout, mark rollout as FAILED.

Hooking into Cloud Monitoring

The slo-check container typically calls Cloud Monitoring API (projects.timeSeries.query) to read the error rate of the canary revision over the last 5 minutes. If the canary's error rate exceeds the baseline by more than the SLO budget, the script exits 1 and Cloud Deploy halts.

Automatic rollback

Set automaticRollback: true on the rollout (via gcloud deploy rollouts retry policy) or use Cloud Deploy's rollback-on-failure behavior: when a phase fails verify, Cloud Deploy automatically creates a Rollback resource that promotes the previous successful release. Recovery time: typically under 2 minutes.

The verify phase is not optional for production canaries on the PCA exam. Any answer that involves "manually monitoring Cloud Monitoring dashboards and clicking rollback" is wrong — the right answer always involves verify: true + automated SLO checks + Cloud Deploy rollback.

Spinnaker on GCP

Spinnaker (originally Netflix, now CNCF) was the de facto multi-cloud CD tool before Cloud Deploy existed. Google maintains Spinnaker for Google Cloud Platform as a self-managed option, though Cloud Deploy is now the preferred managed service.

When Spinnaker still wins

Multi-cloud pipelines: deploy the same artifact to GKE and EKS and on-prem in a single pipeline.
Advanced deployment strategies: built-in Kayenta for automated canary analysis using statistical comparison of metrics between baseline and canary.
Complex approval workflows: manual judgments with custom UI, ServiceNow integration, JIRA-driven approvals.

When Cloud Deploy wins (most cases on PCA exam)

Fully managed: no Spinnaker cluster to operate (Spinnaker itself runs ~10 microservices that need GKE + Redis + Cloud SQL).
Native GCP integration: IAM, Cloud Build, Artifact Registry, Cloud Monitoring all work out of the box.
Lower TCO: Cloud Deploy is priced per-rollout; Spinnaker requires you to pay for the cluster running it 24/7.

Migration path

If a customer is running Spinnaker on GKE today and asks for a simpler option:

Replace Spinnaker pipelines with Cloud Deploy DeliveryPipeline YAML.
Replace Kayenta canary analysis with Cloud Deploy verify + Cloud Monitoring SLO checks.
Decommission the Spinnaker cluster (typically saves $300-800/month in GKE node costs).

For PCA: "We use Spinnaker and want to reduce operational overhead" → migrate to Cloud Deploy.

Feature Flags with LaunchDarkly

Feature flags decouple deployment (binary in production) from release (feature visible to users). This is the foundation of dark launching and progressive feature rollout independent of any infrastructure-level canary.

LaunchDarkly on GCP integration pattern

LaunchDarkly SDK initialized at app startup; flags evaluated locally with a streaming connection back to LaunchDarkly's edge for updates (sub-200ms propagation).
Targeting rules: serve true to user.segment == "internal", false to everyone else. Then ramp to user.country == "JP", then 5% of all users, etc.
Audit log of every flag change pipes to Cloud Logging via LaunchDarkly's webhook → Cloud Functions integration for compliance.

Alternative: in-house flags on Firestore

For simpler use cases, store flag state in Firestore and read it via the Firestore client SDK with real-time listeners. The flag value updates in clients within ~1 second of a Firestore document write. Cheap, but lacks LaunchDarkly's percentage rollouts, segment targeting, and audit features.

Feature flags vs canary deployments

Canary: infrastructure-level — which version of the binary serves the request.
Feature flag: code-level — which code paths execute inside one binary.
They compose: deploy v2 via blue-green (zero risk, both versions of infra tested), then turn on enableNewCheckout flag for 1% of users (zero risk, both versions of feature tested).

Candidates often pick "Cloud Deploy canary" when the scenario actually describes a feature flag problem (e.g., "we want to enable the feature for specific customer accounts without redeploying"). Cloud Deploy operates on revisions/images; it cannot target individual users. The correct answer for per-user/per-account toggling is feature flags (LaunchDarkly, Firestore-backed, or app-level config).

Database Schema Migration During Canary

Schema changes are the #1 reason canary deployments fail in production, because the database is shared between old and new app versions during the rollout window. The discipline is called expand-and-contract (or parallel change).

The expand-and-contract pattern

Expand (before canary): make the schema backward-compatible additive change. ALTER TABLE orders ADD COLUMN new_status VARCHAR(32) DEFAULT NULL; — old code ignores new_status, new code reads/writes it. Deploy this migration on its own, no app change.
Migrate app (the canary): deploy new app version that dual-writes to both status and new_status. Run canary 10% → 25% → 100% over hours/days. Both old and new app versions coexist; both work because both columns exist.
Backfill: run a one-shot job to copy status → new_status for historical rows.
Contract (after 100% rollout + soak time): deploy a third version that reads only new_status. After this is stable, drop the old status column in a final migration.

Tools on GCP

Cloud SQL: use Liquibase or Flyway running as a Cloud Build step before the Cloud Deploy promote step. Migrations are versioned in the same Git repo as the app.
Spanner: schema changes are online and non-blocking for additive operations (ADD COLUMN, CREATE INDEX) — Spanner is the friendliest GCP DB for canary deployments because of this property.
AlloyDB: behaves like PostgreSQL; same Liquibase/Flyway story as Cloud SQL.

Anti-patterns

DROP COLUMN during canary: old pods crash on writes. Never do destructive schema changes until 100% of traffic is on new code and you've soaked for at least one release cycle.
ALTER TABLE on a hot table without LOCK=NONE (Cloud SQL MySQL): blocks writes for minutes. Use pt-online-schema-change or gh-ost.
Renaming a column: this is a destructive change in disguise. Treat as ADD new_name → backfill → DROP old_name, never RENAME.

Expand-and-contract rule: every canary-safe schema change is purely additive in the expand phase (ADD COLUMN, ADD INDEX, new table). Destructive changes (DROP, RENAME, ALTER TYPE) only happen in the contract phase, after the old code is fully retired. Forgetting this rule turns every canary into a partial outage.

FAQ — Release Management Strategies

Q1. How do we handle database schema changes during a Blue-Green deployment?

This is the hardest part. You must ensure the database is backward compatible. Usually, you apply schema changes (like adding a column) before the deployment, so both Blue and Green can work with the same database.

Q2. What is a "Dark Launch"?

A dark launch is when you deploy code to production but keep it hidden from users (using feature flags). This allows you to test the performance and stability of the code under real load without affecting the user interface.

Q3. When should I use a Rolling Update instead of Blue-Green?

Use a Rolling Update for internal tools or non-critical services where a few seconds of "mixed versions" is acceptable and you want to save on infrastructure costs.

Q4. How do I automate a Canary rollback?

Integrate your deployment pipeline (e.g., Cloud Deploy) with Cloud Monitoring. If the error rate for the canary version exceeds a threshold, the pipeline should automatically trigger a gcloud deploy rollbacks create command.

Q5. What is the role of a Service Mesh (Istio) in deployments?

A Service Mesh like Anthos Service Mesh provides fine-grained traffic control. You can shift traffic based on HTTP headers (e.g., "Only send internal employees to the Canary version") or geography, which is much more powerful than simple IP-based weighting.

Introduction to Release Management Strategies

Plain-Language Explanation: Deployment Strategies

Analogy 1 — The Bridge Construction (Blue-Green)

Analogy 2 — The Coal Mine Canary (Canary)

Analogy 3 — The Revolving Door (Rolling Update)

Comparing Deployment Strategies

1. Blue-Green Deployment

2. Canary Deployment

3. Rolling Update

Release Management on GCP

Cloud Deploy + Cloud Build Pipeline

Pipeline anatomy

Canary strategy declaration

Why split Build from Deploy?

Traffic Splitting in Cloud Run

Tag-based canary pattern

Key behaviors

GKE Rolling vs Blue-Green via Anthos Service Mesh

Rolling update (native)

Blue-Green via Anthos Service Mesh (ASM)

ASM-only superpowers

A/B Testing with Weighted Backend Services on Global LB

URL map weighted routing

Use case: geographic A/B

vs Anthos Service Mesh

Automated Rollback via Cloud Deploy Verify Phase

Verify phase mechanics

Hooking into Cloud Monitoring

Automatic rollback

Spinnaker on GCP

When Spinnaker still wins

When Cloud Deploy wins (most cases on PCA exam)

Migration path

Feature Flags with LaunchDarkly

LaunchDarkly on GCP integration pattern

Alternative: in-house flags on Firestore

Feature flags vs canary deployments

Database Schema Migration During Canary

The expand-and-contract pattern

Tools on GCP

Anti-patterns

FAQ — Release Management Strategies

Q1. How do we handle database schema changes during a Blue-Green deployment?

Q2. What is a "Dark Launch"?

Q3. When should I use a Rolling Update instead of Blue-Green?

Q4. How do I automate a Canary rollback?

Q5. What is the role of a Service Mesh (Istio) in deployments?

Official sources

More PCA topics