Introduction to Disaster Recovery (DR)
For a Professional Cloud Architect, Disaster Recovery is about planning for the "unthinkable"—a regional outage, a massive data corruption event, or a ransomware attack. DR is not the same as High Availability (HA). While HA focuses on surviving the failure of a single component (like a disk or a zone), DR focuses on surviving a complete catastrophic failure of the primary infrastructure.
白話文解釋(Plain English Explanation)
Analogy 1 — The Restaurant With a Sister Branch (Cross-region Replication)
Imagine a popular restaurant in Taipei that also runs an identical sister branch in Taichung. Every order taken in Taipei is photographed and faxed to Taichung within seconds—that's Cloud SQL Cross-Region Replica with sub-second replication lag. If a typhoon floods Taipei (the regional outage), Taichung promotes its menu, recipes, and reservation list to "main branch" status (replica promotion), and customers redirected by Cloud DNS Failover can keep ordering with at most a few seconds of missed orders (RPO).
Analogy 2 — The Airport Control Tower With a Mirror Site (Multi-region Spanner)
A major airport runs two control towers in different cities, both processing flight data simultaneously through synchronous voting. This is multi-region Spanner: writes are committed only when a quorum across regions agrees (Paxos). If one tower is destroyed, the other already has every byte of data—RPO is zero, RTO is essentially zero. The cost? Two fully staffed towers running 24/7, which is why Spanner multi-region costs ~3x regional.
Analogy 3 — The Insurance Policy You Test Every Quarter (DR Drill Cadence)
A factory buys earthquake insurance but also runs evacuation drills every quarter, complete with stopwatches and post-mortems. That's the difference between having a DR plan and knowing it works. Companies that skip drills discover during a real outage that their Backup and DR Service snapshots were 47 days stale, the Terraform state was lost, and nobody remembered the on-call rotation. Quarterly tabletop exercises + annual full failover drills are the equivalent of fire alarms in your runbook.
Defining DR Objectives: RTO and RPO
The entire DR strategy is built around two numbers:
- Recovery Time Objective (RTO): The maximum tolerable downtime. "How long can the business be offline before we lose too much money?" (e.g., 4 hours).
- Recovery Point Objective (RPO): The maximum tolerable data loss. "If we restore from a backup, how much work are we okay with losing?" (e.g., 15 minutes of transactions).
RTO = Recovery Time Objective = wall-clock minutes between disaster start and service restored. RPO = Recovery Point Objective = data window between last good replica/backup and the disaster moment. RTO measures downtime tolerance; RPO measures data-loss tolerance.
RTO and RPO drive every architecture choice. A bank trading platform with RPO ≤ 0 forces multi-region Spanner (synchronous replication). A blog with RPO of 24 hours is fine on GCS nightly snapshots at 1/100th the cost. Always extract RTO/RPO from the question stem before picking services on the PCA exam.
DR Tier Definitions: Cold, Warm, and Hot
GCP's DR planning guide maps three tiers to RTO/RPO bands:
Cold DR (Backup and Restore)
- RTO: Hours to days. RPO: Hours (depending on backup frequency).
- Architecture: Data lives in Cloud Storage (Nearline/Coldline/Archive). No compute is running.
- Recovery: Provision infrastructure via Terraform or Deployment Manager, then restore data.
- Cost driver: Storage only. Compute = $0 until failover.
Warm DR (Pilot Light / Warm Standby)
- RTO: Minutes to ~1 hour. RPO: Seconds to minutes.
- Architecture: A minimal copy of the stack runs in the DR region—Cloud SQL cross-region read replica, a small GKE node pool at minimum size, MIG with
min_replicas=1. Data replicates continuously. - Recovery: Scale up compute, promote replica to primary, repoint Cloud DNS or Global Load Balancer backends.
Hot DR (Active-Active / Multi-region)
- RTO: Seconds (or zero). RPO: Zero (with Spanner) or near-zero.
- Architecture: Traffic served from both regions simultaneously via Global External HTTP(S) LB. Data in multi-region Spanner, multi-region BigQuery, or multi-region GCS.
- Cost driver: Full duplicate stack always running. Often 2-3x single-region cost.
On the PCA exam, map keywords to tiers: "lowest cost" → Cold; "minimal downtime, moderate cost" → Warm; "mission-critical, zero data loss" → Hot. Don't pick Hot DR for a blog—it's overengineering and the wrong answer.
Backup and DR Service Architecture
Google Cloud Backup and DR Service (formerly Actifio) is the managed, application-consistent backup product for GCE VMs, Cloud SQL, VMware Engine, SAP HANA, and on-prem workloads. Key concepts:
Components
- Management Console: Regional deployment that orchestrates policies, schedules, and recovery.
- Backup/Recovery Appliance: A VM that runs in your VPC and handles snapshot ingestion. Sized by capacity tier (4 TiB to 200 TiB).
- Backup Vault: Immutable, indelible GCS-backed storage for backups. Once written with a retention lock, even project owners cannot delete it before expiry—critical for ransomware defense.
Backup Plans
- Policy templates define frequency (e.g., hourly snapshots), retention (e.g., 30 days), and target (vault or appliance).
- Application consistency uses VSS on Windows and pre/post scripts on Linux to flush DB buffers before snapshot.
- Cross-region copy replicates vault contents to a second region for geographic redundancy.
gcloud Commands
# Create a backup plan
gcloud backup-dr backup-plans create critical-vms-plan \
--location=us-central1 \
--resource-type=compute.googleapis.com/Instance \
--backup-rules=name=hourly,frequency=HOURLY,retention=720h
Backup Vault immutability is the ransomware kill switch. Even compromised IAM credentials cannot delete vault contents before the retention period expires. This is why Backup and DR Service beats plain snapshot schedules for regulated workloads.
Multi-region Spanner Failover
Cloud Spanner multi-region configurations (e.g., nam-eur-asia1, nam3) provide synchronous replication across three regions using Paxos consensus.
Topology
- Read-write regions (2): hold full replicas, vote in Paxos quorum, serve writes.
- Witness region (1): holds metadata only, votes in quorum, no data replicas.
- Read-only regions (optional): serve stale reads, do not vote.
Failover Behavior
- Automatic: If one read-write region becomes unavailable, the remaining quorum continues serving writes. No human action required.
- RPO = 0: Writes are acknowledged only after Paxos commits to a majority. Data loss is impossible by design.
- RTO ≈ seconds: Clients reconnect to the surviving leader via the global Spanner endpoint.
Cost
Multi-region Spanner runs ~3x the price of regional Spanner per node-hour. For non-critical workloads, regional Spanner + a daily backup is dramatically cheaper.
Don't confuse Spanner regional with Spanner multi-region. Regional Spanner survives zone failure but NOT region failure—if us-central1 goes down, regional Spanner there is unavailable until the region recovers. PCA exam questions often slip this in.
Cloud SQL Cross-Region Replica Promotion
For MySQL, PostgreSQL, and SQL Server workloads needing warm-standby DR:
Setup
- Enable automated backups and binary logging on the primary.
- Create a cross-region read replica in the DR region.
- Monitor replication lag via
replica_lagmetric in Cloud Monitoring.
Failover Procedure (Manual)
# 1. Stop application writes to primary (or accept they're lost beyond replica lag)
# 2. Promote the replica
gcloud sql instances promote-replica dr-replica --region=us-east1
# 3. Repoint application connection strings (or update Cloud DNS)
# 4. (After primary recovers) set up reverse replication or rebuild
Trade-offs
- RPO: Equal to replication lag, typically seconds but can spike under heavy write load.
- RTO: ~5-15 minutes for promotion + DNS/config changes.
- Promotion is one-way: The promoted replica becomes a standalone primary. You must rebuild replication afterward.
GCS Turbo Replication
Standard multi-region GCS buckets replicate asynchronously with eventual consistency (typically minutes). Turbo replication is a paid add-on for dual-region buckets that guarantees:
- RPO ≤ 15 minutes with a 100% SLA—object available in both regions within 15 minutes of write.
- Per-GB cost premium on top of standard dual-region storage.
- Use case: Disaster recovery for critical data lakes, regulated archives, ML training datasets where stale reads after failover are unacceptable.
# Create a dual-region bucket with turbo replication
gcloud storage buckets create gs://critical-dr-bucket \
--location=NAM4 \
--enable-turbo-replication
Turbo replication only works on dual-region buckets (e.g., NAM4 = us-central1 + us-east1), not multi-region buckets like US. Multi-region buckets don't expose the underlying replica regions, so SLA guarantees can't be made on per-object replication time.
Cloud DNS Failover Routing
Cloud DNS supports several routing policies for DR scenarios:
Failover Policy
- Primary and backup targets with health checks.
- If primary fails health check, DNS automatically serves the backup IP.
- TTL controls how fast clients switch; typically 30-60s for DR.
Geo Policy
- Routes users to the nearest healthy region based on source IP geolocation.
- Combined with health checks, supports active-active topologies.
Weighted Round-Robin (WRR)
- Distributes traffic by weight (e.g., 90% primary / 10% canary DR).
- Useful for blue-green DR testing.
# Failover policy via gcloud
gcloud dns record-sets create api.example.com. \
--zone=prod-zone --type=A --ttl=60 \
--routing-policy-type=FAILOVER \
--routing-policy-data="primary=10.0.0.1;backup=10.1.0.1" \
--enable-health-checking
DR Runbook Templates
A runbook turns tribal knowledge into repeatable execution under pressure. Minimum sections:
- Detection & Declaration: Who declares a disaster? What signal triggers it (e.g., Cloud Monitoring alert, Service Health Dashboard RED)?
- Roles: Incident Commander, Comms Lead, Tech Lead, Scribe. On-call rotation in PagerDuty / Opsgenie.
- Pre-flight Checks: Verify backup freshness, DR region capacity, IAM access in DR project.
- Failover Steps: Numbered, copy-pasteable commands. Include the
gcloud sql instances promote-replicaand DNS update commands above. - Validation: Smoke tests, health endpoints, data integrity checks.
- Failback Procedure: How to return to the original region once recovered.
- Post-mortem Template: Timeline, root cause, action items.
Store runbooks in version-controlled markdown in the same repo as IaC. Avoid Confluence-only runbooks—during an outage, third-party SaaS may also be down or have stale auth.
DR Drill Cadence (Quarterly)
A plan untested is a wish. Recommended cadence:
| Drill Type | Frequency | Scope |
|---|---|---|
| Tabletop exercise | Quarterly | Walk through scenario verbally; no real failover. |
| Game day / partial failover | Semi-annually | Failover one tier (e.g., DB only) in staging. |
| Full DR drill | Annually | Full production failover to DR region. |
| Backup restore test | Monthly | Restore latest snapshot to scratch project, verify integrity. |
Each drill produces a drill report with measured RTO/RPO vs target, gaps identified, and remediation tickets filed.
Tabletop Exercise: Step-by-Step
A tabletop is a structured discussion, typically 90-120 minutes:
- Scenario brief (10 min): Facilitator presents a scenario—"At 02:30 UTC,
us-central1becomes unreachable. Customers report 5xx errors." - Round 1 – Detection (20 min): Team narrates: "I'd check Cloud Monitoring uptime checks first. Then Service Health Dashboard. Then page the IC."
- Round 2 – Decision (30 min): "Do we failover? Who decides? What's the SLO breach threshold?"
- Round 3 – Execution (30 min): Walk through runbook commands verbally. Flag missing steps, wrong owners, stale credentials.
- Hotwash (15 min): Capture gaps, assign owners, set deadlines.
Tabletops surface organizational gaps (who has the authority to failover at 3 AM?) that technical drills miss. Mix engineers, SRE, product, legal, and comms in the same room.
DR Patterns on Google Cloud
1. Cold DR (Backup and Restore)
- Method: Data is backed up to Cloud Storage. Compute resources are not running.
- Cost: Lowest.
- Recovery: Use Terraform to recreate the infrastructure and restore data from GCS.
- Use Case: Non-critical apps where 24-hour downtime is acceptable.
2. Warm DR (Warm Standby)
- Method: A "Pilot Light" version of your app is running. Database is replicated to a secondary region (e.g., Cloud SQL Cross-Region Replica). A small GKE cluster or a few VMs are on standby.
- Cost: Medium.
- Recovery: Scale up the standby compute resources and promote the replica database to primary.
3. Hot DR (Active-Active / Multi-Region)
- Method: Traffic is split between two regions using a Global External HTTP(S) Load Balancer. Data is stored in a multi-region database like Spanner or Multi-region BigTable.
- Cost: Highest.
- Recovery: Zero. If Region A goes down, the Load Balancer simply sends all traffic to Region B.
Vendor Lock-in Considerations
DR architectures often deepen cloud lock-in. Trade-offs to weigh:
Lock-in by Service Tier
- Low lock-in: GCE VMs + GCS snapshots. Disks and images can be exported to OVA and rehydrated on AWS/Azure or on-prem.
- Medium lock-in: Cloud SQL (PostgreSQL/MySQL/SQL Server are portable engines, but GCP-managed control plane is not). Migration via logical dump + restore.
- High lock-in: Spanner, BigQuery, Bigtable. Proprietary APIs with no drop-in replacement. Cross-cloud DR requires dual-writes through an abstraction layer or accepting recovery via export-only.
Mitigation Strategies
- Multi-cloud DR: Replicate backups to AWS S3 or Azure Blob via Storage Transfer Service for ultimate provider-failure protection. Expensive in egress.
- Open-source equivalents: Use Cloud SQL for PostgreSQL instead of Spanner where possible; use GKE instead of Cloud Run for portability.
- IaC discipline: Keep Terraform modules cloud-agnostic where feasible. Encapsulate GCP-specific services behind interfaces in application code.
"Multi-cloud DR" sounds resilient but doubles operational complexity, increases attack surface, and rarely pays off. A single-cloud, multi-region DR strategy is sufficient for 99% of workloads. PCA exam answers favor multi-region within GCP, not multi-cloud.
Key GCP Services for DR
| Service | DR Feature |
|---|---|
| Cloud Storage | Versioning and Bucket Lock (protects against ransomware/deletion). Multi-region buckets for high durability. |
| Cloud SQL | Cross-region Read Replicas. Can be promoted to a standalone master if the primary region fails. |
| Cloud Spanner | Multi-region configurations provide synchronous replication and 99.999% availability with zero RPO. |
| Compute Engine | Machine Images allow you to capture the entire state of a VM for restoration in another region. |
| Persistent Disk | Async Replication (PD-Async) allows you to replicate disks across regions with low RPO. |
| Backup and DR Service | Application-consistent backups with immutable vaults; ransomware-resistant. |
Testing the DR Plan
A DR plan that isn't tested is not a plan—it's a wish.
- Chaos Engineering: Use tools to simulate region failures.
- Drills: Perform a "Failover Drill" at least once a year.
- Automation: Use Infrastructure as Code (IaC) to ensure that the recovery environment is identical to the production environment.
Architect's Warning: Beware of "Split-Brain" scenarios in Active-Active setups, where both regions think they are the "Master." Use a reliable global consensus database like Spanner to prevent data corruption. ::
FAQ — Disaster Recovery Planning
Q1. Is High Availability (HA) the same as Disaster Recovery (DR)?
No. HA protects against the failure of a single zone or component within a region. DR protects against the failure of the entire region or the entire provider.
Q2. How does "Snapshot Schedule" help in DR?
Snapshot schedules for Persistent Disks allow you to automatically create backups. By checking the "Multi-regional" or "Regional" storage location for snapshots, you ensure they are available even if the original zone fails.
Q3. What is the cheapest DR strategy?
Cold DR (Backup and Restore). You only pay for storage of the backups. You don't pay for any compute (VMs/GKE) until the disaster actually happens and you need to rebuild.
Q4. Can I use VPC Peering for DR?
You can use VPC Peering to connect networks in different regions, but for DR data replication, it's often better to use internal IP addresses over Google's backbone or a Cloud VPN if connecting to on-premise.
Q5. What is "Cloud DNS Failover"?
Cloud DNS can use Health Checks to automatically change where a domain points. If the IP in Region A stops responding, DNS can automatically point users to the IP in Region B.
Final Architect Tip
On the PCA exam, pay close attention to the RTO and RPO requirements. If the requirement is "Lowest cost," choose Cold DR. If it's "Minimal downtime," choose Warm Standby. If it's "Mission critical / Zero downtime," choose Active-Active Multi-region. Also, remember that Cloud Storage is the "Swiss Army Knife" of DR—it's where your backups, logs, and images should live.