Introduction to Disaster Recovery for Data Platforms
Disaster Recovery for Data Platforms is the discipline of keeping analytical and transactional data systems alive when an entire region, zone, or service control plane goes dark. On Google Cloud, the building blocks for Disaster Recovery for Data Platforms range from a nightly export to GCS all the way to dual-region active-active Spanner with sub-second failover. The Professional Data Engineer exam expects you to pick the right tier for the right workload and to defend that choice with concrete RTO and RPO numbers.
This note walks through the mental model, the service-by-service mechanics, the runbook patterns that real teams use, and the drill cadence that turns a paper plan into something you can trust at 3 AM.
白話文解釋(Plain English Explanation)
Before diving into Disaster Recovery for Data Platforms on GCP, here are three concrete pictures that make the trade-offs click.
Think of DR Tiers Like Insurance Policies
You can buy renter's insurance, basic homeowner coverage, or a premium policy with a guaranteed hotel suite the night your house floods. Each tier costs more and pays out faster. Disaster Recovery for Data Platforms works the same way. A cold backup in a second region is the renter's policy: cheap, but you spend a weekend rebuilding. A warm standby is basic homeowner: a few hours to recover and you are back. A multi-region active-active Spanner instance is the premium policy: the moment your primary region fails, traffic is already being served from somewhere else and you barely notice. Nobody buys premium insurance for a $200 bicycle, and nobody puts a marketing analytics sandbox on multi-region Spanner. Match the policy to the asset.
Think of RPO and RTO Like a Photo Album and a Moving Truck
RPO answers "how recent is my last photo of the data?" If you take a photo every 24 hours and the building burns down, you lose at most 24 hours of memories. That is your RPO. RTO answers "once the truck arrives with the new house, how long until I can sleep in it?" That is the time from disaster to fully usable platform. The two numbers are independent. You might have a 5-minute RPO (Datastream CDC streaming changes constantly) but a 4-hour RTO (because someone has to manually promote the replica and rerun migrations). Both numbers cost real money to shrink.
Think of a DR Runbook Like a Fire Drill at School
Schools do not write a fire safety manual and shove it in a drawer. They run drills every term, time how long it takes for the last student to reach the assembly point, and revise the procedure when a stairwell turns out to be a bottleneck. Disaster Recovery for Data Platforms needs the same muscle memory. The runbook tells you which scripts to run, which dashboards to check, and which on-call person flips the DNS record. The drill tells you whether the runbook still works after six months of Terraform drift. Teams that skip drills always discover their failover is broken on the day they actually need it.
Core Concepts of Disaster Recovery for Data Platforms
Disaster Recovery for Data Platforms rests on a small set of vocabulary that the exam expects you to use precisely.
RTO and RPO
Recovery Time Objective is the maximum tolerable wall-clock time between a disaster and a working system. Recovery Point Objective is the maximum tolerable amount of data loss measured in time. A bank's funds-transfer ledger might require RPO = 0 and RTO < 1 minute. A monthly reporting cube might tolerate RPO = 24 hours and RTO = 12 hours. The PDE exam loves to give you a scenario, state both numbers, and ask which GCP product mix satisfies them at lowest cost.
DR Tiers
The industry-standard tiers map cleanly onto GCP services:
- Cold (backup and restore): data is exported to a separate region; compute is rebuilt on demand. Cheapest. RTO measured in hours to days, RPO measured in hours.
- Warm (pilot light): a minimal version of the platform runs in the secondary region with replicated data. RTO measured in tens of minutes, RPO measured in minutes.
- Hot (warm standby): a fully scaled secondary that lags the primary slightly. RTO under ten minutes, RPO under one minute.
- Multi-region active-active: both regions serve traffic simultaneously through a globally consistent layer. RTO is effectively zero, RPO is zero.
Failure Domains
A disaster on GCP is not always a smoking data centre. The failure domains you plan for include single-zone outages, regional outages, control-plane outages (rare but documented), accidental deletion by a human, encryption-key revocation, and ransomware-style data corruption that replicates faster than you can stop it. Disaster Recovery for Data Platforms must cover the corruption case explicitly, because replication alone propagates the bad write to your standby in seconds.
Backup vs Replication vs Snapshot
These words get used loosely in conversation but the exam treats them as distinct. A backup is an immutable point-in-time copy stored separately from the source, suitable for restore weeks later. Replication is a continuous stream that keeps a secondary in near-real-time sync; it does not protect against logical corruption. A snapshot is a metadata-light capture of state at an instant, often used as the basis of a clone or a restore.
Architecture and Design Patterns
The shape of your Disaster Recovery for Data Platforms deployment falls into a handful of recurring patterns. Picking the right one is mostly a function of latency tolerance, budget, and the consistency model the application demands.
Pattern 1: Backup and Restore
The simplest pattern. Source data lives in region A. A scheduled job copies it to a dual-region GCS bucket or a separate region. On disaster, you provision compute in region B, point it at the backup, and replay anything that landed after the last backup. Suitable for analytics workloads where a 12-24 hour data gap is acceptable.
Pattern 2: Pilot Light
A skeleton of the production system already exists in region B: minimal Dataflow capacity, an empty BigQuery dataset structure, IAM bindings, network configuration. Data flows continuously from A to B through Datastream, BigQuery cross-region dataset replication, or scheduled transfers. On disaster, you scale the skeleton up and divert traffic. Cuts RTO from hours to minutes for moderate cost.
Pattern 3: Warm Standby
Region B runs at perhaps 30 percent of region A's capacity, kept current by streaming replication. A failover flips the load balancer, scales B to full size, and resumes processing. Common for streaming pipelines that cannot tolerate more than a few minutes of lag.
Pattern 4: Multi-Region Active-Active
Both regions serve production traffic. Spanner multi-region instances, BigQuery multi-region datasets, and Pub/Sub global topics make this possible without application-level coordination. You pay for double the capacity, but a regional outage is a non-event.
Pattern 5: Asymmetric Recovery
A subtle pattern often missed by beginners. The DR site does not have to be identical to production. You might run a smaller, slower BigQuery slot reservation in the secondary region and accept that month-end reports take longer during a failover, in exchange for a much smaller bill the other 364 days of the year. The exam rewards candidates who recognise that "DR equal to prod" is a budget choice, not a requirement.
GCP Service Deep Dive
Each managed data service on GCP exposes its own primitives for Disaster Recovery for Data Platforms. Knowing which knob does what is the difference between a passing PDE answer and a failing one.
BigQuery Cross-Region Replication
BigQuery offers two distinct cross-region capabilities. Cross-region dataset replication, generally available, asynchronously copies a dataset from a source region to one or more secondary regions. Replication lag is typically minutes. You query the secondary read-only until you promote it. The newer managed disaster recovery feature provides a single failover reservation that handles both compute and data movement and gives you a documented RPO and RTO from Google.
For ad-hoc DR you can also schedule BigQuery Data Transfer Service jobs that copy datasets cross-region on a cron, which is essentially a backup pattern with BigQuery as both source and target.
BigQuery cross-region replication is asynchronous. If the primary region fails mid-load, the most recent rows may not be in the secondary yet. Plan for an RPO of at least a few minutes and reconcile from the source-of-truth (Pub/Sub, GCS, Datastream) during recovery. Reference: https://cloud.google.com/bigquery/docs/data-replication
Spanner Multi-Region Failover
Spanner is the easy mode of Disaster Recovery for Data Platforms for relational workloads. A multi-region instance configuration (for example nam-eur-asia1 or nam3) replicates synchronously across regions using Paxos. There is no separate failover step: if a region fails, the remaining replicas continue serving with strong consistency. RPO is zero by construction, and RTO is essentially the leader re-election time, typically a few seconds.
The trade-off is write latency. Cross-continental configurations add tens of milliseconds to every commit. If your write path cannot tolerate that, a regional configuration with a separate read-only replica region is the alternative, but you give up RPO = 0.
Cloud Storage Turbo Replication
Dual-region GCS buckets replicate objects asynchronously between two regions, with a default RPO target measured in hours. Turbo replication is an opt-in upgrade that targets RPO of 15 minutes for 100 percent of newly written objects. It costs more per GB-month and per operation, but it is the right answer when you have a data lake whose freshness directly drives revenue.
Use turbo replication selectively. A typical pattern is dual-region buckets with turbo replication for the "hot" landing zone where ingestion lands, and standard dual-region (or even nearline regional) for archived partitions where 12-hour lag is fine. This can cut DR storage costs in half without compromising the workloads that matter. Reference: https://cloud.google.com/storage/docs/turbo-replication
Dataflow Snapshot and Replay
Streaming Dataflow jobs accumulate state: windowed aggregations, side inputs, timers. A snapshot captures that state plus the source watermark, lets you redeploy the job in another region, and resumes from where the snapshot was taken. Combined with a Pub/Sub source that has a long retention window, this gives you a clean recovery path: take periodic snapshots, on disaster restart in the secondary region from the most recent snapshot, replay any Pub/Sub messages newer than the snapshot.
Pub/Sub Message Retention
Pub/Sub topics retain published messages for up to 31 days regardless of subscription state, and subscriptions retain unacknowledged messages for up to 7 days by default (configurable up to 31 days). For Disaster Recovery for Data Platforms purposes, this retention is your safety net: if a subscriber fails for any reason, including a regional outage that takes the subscriber down, messages are still recoverable when you bring up the replacement.
Pub/Sub itself is a global service. Topics and subscriptions live in Google's global control plane, and message storage automatically spans multiple zones within a region (or multiple regions for global endpoints). You generally do not need to design a DR plan for Pub/Sub itself, only for its consumers.
Datastream for Change Data Capture
Datastream captures row-level changes from Oracle, MySQL, PostgreSQL, and SQL Server in near-real-time and writes them to BigQuery, GCS, Cloud SQL, or Spanner. As a DR tool, Datastream is what enables warm-standby patterns for heterogeneous source systems: you mirror an on-prem Oracle into BigQuery in two regions, and on disaster the secondary BigQuery is already minutes behind the source.
Backup and DR Service
The Backup and DR Service (formerly Actifio) is GCP's managed backup product for Compute Engine, VMware on GCP, Oracle, SAP HANA, and several SQL flavours. It provides application-consistent backups, instant mount of historical backups for testing, and cross-region replication of the backup vault. For PDE candidates, the relevant scenario is "we have legacy databases on Compute Engine VMs and need a managed backup story" — that is when Backup and DR Service is the right answer rather than rolling your own scripts.
A backup taken with the source application quiesced or with a database-aware agent, so that the restored copy is internally consistent and recoverable without log replay. Crash-consistent backups, by contrast, capture whatever was on disk at the moment and may require recovery procedures. Reference: https://cloud.google.com/backup-disaster-recovery/docs/concepts/backup-and-dr-overview
Bigtable Multi-Cluster Routing
Bigtable instances can have multiple clusters in different regions, with replication between them. An app profile with multi-cluster routing automatically directs requests to the nearest healthy cluster and fails over transparently when one becomes unavailable. Eventually consistent across clusters, so the trade-off is brief read-your-writes anomalies during failover, which most analytical workloads accept.
Cloud SQL and AlloyDB Replication
Cloud SQL supports cross-region read replicas that can be promoted to primary on disaster. The promotion is manual, takes a few minutes, and is one-way (the old primary cannot be re-attached as a replica afterward without rebuilding). AlloyDB offers similar cross-region replica support with stronger performance characteristics for analytical queries.
Common Pitfalls and Trade-offs
Disaster Recovery for Data Platforms looks straightforward on a slide deck and reveals its sharp edges in production. Here are the ones the PDE exam returns to repeatedly.
Treating replication as a backup. If a developer issues DROP TABLE on the primary, BigQuery cross-region replication faithfully drops it on the secondary moments later. You need genuine point-in-time recovery (BigQuery time travel, Spanner stale reads, GCS object versioning) to protect against logical corruption.
Reference: https://cloud.google.com/bigquery/docs/time-travel
A second classic is forgetting about the metadata. The data may be replicated, but if your Composer DAGs, Dataflow templates, IAM policies, service accounts, KMS keys, and VPC firewall rules only exist in the primary region, you cannot bring up a working platform in the secondary even when the data is there. Treat infrastructure-as-code as a first-class DR artifact.
A third is underestimating the cost of egress during a real failover. Pulling a 100 TB BigQuery dataset across regions in a hurry generates an egress bill that surprises everyone. Pre-replicate, do not pull-on-demand.
A fourth, which the exam tests via subtle wording: confusing zonal HA with regional DR. Spanner regional configurations survive zonal failures because Paxos quorum is across zones. They do not survive a regional outage. If the question stem mentions "the entire region became unavailable" then regional Spanner is wrong; you need a multi-region configuration.
A fifth is assuming Pub/Sub retention covers any duration of outage. The default subscription retention is 7 days and many teams never raise it. A regional outage lasting 8 days (rare but not unprecedented) plus an unaware on-call team equals permanent message loss.
Disaster Recovery for Data Platforms is not just a technology problem. The most common cause of failed recovery is a missing or out-of-date runbook, not a missing replica. Audit your runbooks every quarter and after every architecture change. Reference: https://cloud.google.com/architecture/dr-scenarios-planning-guide
Best Practices
A short checklist that consistently separates teams that recover from teams that do not.
- Define RTO and RPO per dataset, not per platform. A revenue ledger and a clickstream sandbox should not share the same DR tier.
- Store backups in a different project with separate IAM, so a compromised production project cannot delete your last good copy.
- Encrypt backups with customer-managed keys (CMEK) held in a key ring outside the production region.
- Pre-create the secondary region's network, IAM, KMS, and service accounts via Terraform. Provisioning these during an incident wastes the RTO budget.
- Run a partial DR drill quarterly and a full failover drill annually. Track the wall-clock time and treat regressions as bugs.
- Automate failover where possible (Spanner, multi-cluster Bigtable, multi-region GCS) and document the manual steps where it is not.
- Monitor replication lag explicitly with a Cloud Monitoring alert. A silent replication failure is far worse than a loud one.
- Version your runbooks in git alongside the infrastructure code they reference.
For PDE exam answers, anchor on three numbers: the RPO the scenario states, the RTO it states, and the cost constraint it implies. Most "which DR strategy" questions are decided by which of those three the candidate solution comfortably satisfies. Reference: https://cloud.google.com/architecture/dr-scenarios-for-data
Real-World Use Case
Consider a fintech company with about 400 employees that processes roughly 12 million card transactions per day. Their data platform on GCP has three layers, each with different DR requirements.
The transactional ledger lives in Spanner, multi-region nam3 configuration. Regulators require zero data loss and sub-minute recovery for funds movement. Spanner's synchronous Paxos replication satisfies RPO = 0 and the documented RTO is well under a minute, so this layer needs no additional design beyond using the multi-region instance.
The streaming risk-scoring pipeline ingests transaction events through Pub/Sub, processes them in Dataflow, and writes scored events to BigQuery. Pub/Sub retention is bumped to 14 days. Dataflow snapshots are taken every 30 minutes. BigQuery uses managed disaster recovery with the secondary in us-east1. The scoring service runs warm-standby in the secondary at 25 percent capacity, autoscaling on failover. Target RPO is 5 minutes, target RTO is 15 minutes, and quarterly drills consistently land at 11-13 minutes.
The analytics warehouse, used by finance and product teams, holds three years of history in BigQuery and uses cross-region dataset replication to a secondary region. There is no warm compute; on disaster, slot reservations are created on demand in the secondary. Target RPO is 1 hour, target RTO is 4 hours. Cost of this tier is roughly 60 percent of what an active-active warehouse would cost, and the business has explicitly accepted the trade-off.
Backups of legacy Oracle systems migrated to Compute Engine VMs use Backup and DR Service with the backup vault replicated to a second region. CMEK keys for the backup vault sit in a third region, ensuring no single-region event can destroy both data and keys.
Total monthly DR overhead is approximately 18 percent of the platform's compute and storage spend. The CFO signed off because the regulatory fine for losing transaction data exceeds the entire platform budget for a year.
Exam Tips
The PDE exam treats Disaster Recovery for Data Platforms as a recurring theme rather than a single domain. A handful of patterns will help you pick the intended answer quickly.
When the scenario states "RPO of zero" or "no data loss is acceptable," your shortlist is Spanner multi-region, multi-region GCS with strong consistency for the specific operations involved, or synchronous replication patterns. Asynchronous tools (BigQuery cross-region replication, Datastream, Cloud SQL replicas) are wrong.
When the scenario emphasises cost, look for backup-and-restore patterns: scheduled exports to dual-region GCS, on-demand BigQuery slot reservations, no warm compute. The exam will often offer a multi-region active-active distractor that is technically correct but ten times the budget the scenario allows.
When the scenario mentions logical corruption, ransomware, or accidental deletion, replication is a trap answer. Look for time-travel, point-in-time recovery, object versioning, or backups in a separately governed project.
When the scenario mentions on-prem or hybrid sources, Datastream or Backup and DR Service tend to be the right tools. The exam likes to test whether you remember that Datastream does CDC for relational sources specifically.
When the scenario mentions a streaming pipeline that must survive a region outage, the answer almost always involves Pub/Sub retention plus Dataflow snapshots. Be ready to recognise this combination from a single sentence in the stem.
If two answer options are technically correct but one is dramatically more expensive without a clear justification in the scenario, pick the cheaper one. Google's exam philosophy strongly rewards "right-sized" architecture for the stated requirements rather than gold-plating. Reference: https://cloud.google.com/architecture/dr-scenarios-planning-guide
Building a DR Runbook
A runbook is the artifact that turns Disaster Recovery for Data Platforms from theory into a thing a sleep-deprived engineer can execute at 3 AM. The PDE exam does not test runbook syntax, but it does test whether you know what belongs in one.
A working runbook contains, at minimum: the trigger conditions that count as a disaster, the named decision-maker authorised to declare it, the order in which services are failed over (databases first, then stateless compute, then traffic), the exact gcloud or Terraform commands for each step, the validation queries that confirm each step succeeded, and the communications template for stakeholders.
Each step should have a time budget. If the runbook says "promote the BigQuery replica" and the actual operation reliably takes 8 minutes, that 8 minutes is part of your RTO and should be measured during drills.
The runbook should also have a rollback plan. Sometimes you fail over, the original region recovers, and you want to fail back. The fail-back path is often more dangerous than the fail-over path because both regions now hold writes that may have diverged.
DR Drills and Validation
Disaster Recovery for Data Platforms decays the moment you stop exercising it. A team that ran a flawless drill in March may discover in October that a Terraform refactor in June removed the secondary KMS key, that a new Composer DAG was never deployed to the standby, or that the on-call rotation list has three people who have left the company.
A pragmatic drill cadence looks like this:
- Monthly tabletop: 30-minute walkthrough with on-call engineers reading the runbook aloud against current architecture diagrams. Catches documentation drift cheaply.
- Quarterly partial drill: actually fail over one non-critical service to the secondary, measure the time, then fail back. Catches script and IAM regressions.
- Annual full drill: declare a simulated disaster, fail over the entire platform during a low-traffic window, run for a few hours, fail back. Catches everything else.
- Game-day exercises: inject specific failures (revoke a KMS key, delete a service account) without warning the on-call team. Catches the human side.
Track three metrics across drills: actual RTO, actual RPO measured against the stated target, and the count of runbook steps that needed manual intervention or correction. Trends in those metrics tell you whether your DR posture is improving or rotting.
Frequently Asked Questions (FAQ)
What is the difference between high availability and disaster recovery on GCP?
High availability protects against component or zonal failure within a single region; disaster recovery protects against the loss of an entire region or against logical events like data corruption. Spanner regional configurations give you HA across zones. Spanner multi-region configurations give you DR across regions. Both matter, and most production data platforms need both layers explicitly.
Does BigQuery automatically replicate data across regions?
No. A standard BigQuery dataset lives in the region or multi-region you created it in, and Google replicates it across zones within that location for HA, not across regions for DR. To get cross-region durability you must opt in via cross-region dataset replication, BigQuery managed disaster recovery, or scheduled cross-region copies through Data Transfer Service.
What RPO does Cloud Storage turbo replication actually deliver?
Turbo replication on dual-region GCS targets a 15-minute RPO for 100 percent of newly written objects. Standard dual-region replication targets the same eventual outcome but without a per-object SLA, so individual objects may take hours to propagate. For analytics landing zones where freshness is part of the value proposition, turbo replication is usually worth the price.
Can Pub/Sub messages survive a regional outage?
Yes, with caveats. Pub/Sub itself is a global service whose storage is replicated across zones (and across regions for global topics), so the messaging layer survives most regional events. The risk is at the subscriber side: if your Dataflow or Cloud Run consumer in region A goes down, messages accumulate up to your subscription retention window (default 7 days, max 31 days). Raise the retention if your DR plan needs more headroom.
How do I test a DR plan without affecting production?
Several patterns work. You can fail over a non-critical service end-to-end during a maintenance window. You can spin up an isolated test project that mirrors production and run the runbook against it. You can use Backup and DR Service's instant mount feature to attach a historical backup to a test instance without disturbing the source. The least useful pattern is reading the runbook aloud and declaring it tested; that catches almost nothing.
Is multi-region active-active always better than warm standby?
No. Active-active eliminates RTO and RPO, but it doubles your steady-state cost, complicates application logic (you need to handle multi-master writes or accept eventual consistency in some paths), and increases blast radius for software bugs because they deploy to both regions simultaneously. For systems where 10 minutes of recovery is acceptable, warm standby is usually the better engineering choice.
What happens to in-flight Dataflow jobs during a regional failure?
A streaming Dataflow job pinned to a single region stops processing if that region becomes unavailable. If you took periodic snapshots and your source (typically Pub/Sub) retains messages, you can launch a new job in another region from the most recent snapshot and replay the gap. Without snapshots, you would replay from the start of the Pub/Sub retention window, which can be expensive and produce duplicates that downstream systems must dedupe.
Do I need DR for fully managed services like BigQuery and Spanner?
Yes for region failure scenarios, no for component failure scenarios. The managed services handle component and zonal failures transparently, but a region-level event still affects single-region BigQuery datasets and single-region Spanner instances. The PDE exam consistently expects candidates to design for regional failure even when using managed services.
Related Topics
- Cloud Storage Data Lake Design
- Cloud Spanner High Availability Design
- BigQuery Data Modeling and Clustering
- Cost Optimization Architectures
- Data Sovereignty and Compliance Design
Further Reading
- Disaster recovery planning guide — Google Cloud's canonical DR framework, including the tier definitions referenced throughout this note.
- Disaster recovery scenarios for data — Service-by-service guidance covering BigQuery, Spanner, Bigtable, Cloud SQL, and GCS.
- BigQuery managed disaster recovery — Reference for the failover reservation feature.
- Spanner instance configurations — Definitive list of regional and multi-region configurations and their characteristics.
- Backup and DR Service overview — Documentation for the managed backup product and its supported workloads.