examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 27 min

Designing for HA and DR

5,400 words · ≈ 27 min read ·

Professional Cloud Architect deep dive into HA and DR design on Google Cloud: RTO, RPO, multi-regional strategies, failover patterns, and chaos engineering.

Do 20 practice questions → Free · No signup · PCA

The Fundamentals of Resilience: HA vs. DR

In the world of cloud architecture, High Availability (HA) and Disaster Recovery (DR) are two sides of the same coin, but they serve different purposes.

  • High Availability (HA): Focuses on keeping the system running during "normal" local failures (e.g., a single VM crashing or a single zone having power issues). HA is measured in Uptime percentage (e.g., 99.99%).
  • Disaster Recovery (DR): Focuses on restoring the system after a catastrophic event (e.g., an entire region going offline due to a natural disaster). DR is measured in RTO and RPO.

For the GCP PCA exam, you must be able to design architectures that meet specific RTO and RPO targets while staying within budget. The "Optimal" solution often involves a trade-off between cost and the speed of recovery.

The maximum acceptable amount of time that a system can be offline after a disaster. "How long does it take to get back up?" Reference: https://cloud.google.com/architecture/disaster-recovery#rto_and_rpo

The maximum acceptable amount of data loss measured in time. "How much data can we afford to lose?" Reference: https://cloud.google.com/architecture/disaster-recovery#rto_and_rpo


Plain-Language Explanation: HA and DR Design

Designing for resilience is like preparing a city for different levels of emergencies.

Analogy 1 — The Hospital's Backup Power System (HA)

Think of HA as a hospital's internal backup power system. If a fuse blows in the surgery wing (a local failure), the hospital has immediate backups (Redundancy) to keep the lights on without the doctors even noticing. HA is about the small, expected failures that shouldn't stop the mission.

Analogy 2 — The Emergency Relocation Plan (DR)

Think of DR as a city's plan to move its entire operation to a different city in case of a massive flood. This is a rare, catastrophic event. You can't prevent the flood, but you have a plan: "If City A is underwater, we open our backup offices in City B." You might lose a few hours of work (RTO) and some recent paperwork (RPO) during the move, but the city continues to function.

Analogy 3 — The Acrobat's Safety Net and Harness

HA is like the safety harness an acrobat wears—it prevents them from falling in the first place. DR is the safety net at the bottom—it doesn't stop the fall, but it prevents the fall from being fatal. A professional architect uses both: a harness for every trick (HA) and a big net just in case the harness snaps (DR).

On the PCA exam, if the target is RTO < 1 minute, you almost certainly need an Active-Active or Hot Standby DR pattern. Cold backups will not meet this requirement. Reference: https://cloud.google.com/architecture/disaster-recovery-patterns


Defining DR Patterns: Cold, Warm, and Hot

The cost and complexity of a DR solution increase as the RTO/RPO requirements become stricter.

Cold Standby (Backup and Restore)

  • Mechanism: Periodically back up data to Cloud Storage. In a disaster, provision new infrastructure and restore the data.
  • RTO: Hours or days.
  • RPO: Hours (depending on backup frequency).
  • Cost: Lowest.
  • Use Case: Non-critical internal applications.

Warm Standby

  • Mechanism: A scaled-down version of the application is always running in a secondary region. Data is continuously replicated (e.g., Cloud SQL Cross-region replicas).
  • RTO: Minutes.
  • RPO: Minutes or seconds.
  • Cost: Medium.
  • Use Case: Core business applications that can tolerate a short outage.

Hot Site (Active-Active)

  • Mechanism: Full-scale infrastructure is running in two or more regions simultaneously. Traffic is split using Global Load Balancing.
  • RTO: Near zero.
  • RPO: Near zero (using globally consistent databases like Spanner).
  • Cost: Highest.
  • Use Case: Critical consumer-facing platforms (e.g., e-commerce checkout, banking).

Technical Implementation on GCP

Multi-Regional vs. Multi-Zonal Design

  • Multi-Zonal: Protects against a single data center failure. Use Managed Instance Groups (MIGs) with an "Extra-zonal" distribution policy.
  • Multi-Regional: Protects against a whole region failure. This is the requirement for true DR.

Global Load Balancing

The Global External Application Load Balancer is a critical DR tool. It uses a single Anycast IP. If a region fails, the load balancer automatically stops sending traffic to that region's backend and reroutes it to the next healthy region. This happens at the edge, providing near-instant failover for the user.

Data Replication and Synchronization

  • Cloud Spanner: The "Optimal" choice for global HA. It provides 99.999% availability and strong consistency across regions.
  • Cloud SQL: Use Cross-region replicas. Note that failover to a replica is usually a manual or scripted process, not automatic.
  • Cloud Storage: Use Multi-regional buckets for automatic geographic redundancy.

RTO/RPO Tier Mapping to GCP Services

Translating business-defined recovery objectives into concrete GCP service choices is the most common PCA scenario question. Use this tier model as a decision framework.

Tier 1 — Hot Standby (RTO seconds, RPO ≈ 0)

Reserved for tier-0 systems: payment processing, identity, real-time inventory.

  • Compute: Regional MIGs in two or more regions, fronted by a Global External Application Load Balancer. Both regions serve live traffic.
  • Database: Cloud Spanner multi-region instance (e.g., nam-eur-asia1 or nam3) — synchronous Paxos quorum across regions, RPO of zero for committed writes.
  • Storage: Multi-region GCS buckets with turbo replication enabled for sub-15-minute RPO across geography.
  • Cache: Memorystore for Redis with Standard Tier and read replicas in the secondary region.

Tier 2 — Warm Standby (RTO 5-30 min, RPO 1-15 min)

Common for core business apps that can tolerate brief degradation.

  • Compute: Production MIG in region A; scaled-down MIG (1-2 instances) in region B. Auto-scale on failover.
  • Database: Cloud SQL with cross-region read replicas. Promotion to primary is manual but completes in minutes.
  • Storage: Standard multi-region buckets without turbo (RPO ~12 hours but eventual).

Tier 3 — Cold Standby (RTO hours, RPO 1-24 hours)

For internal tools, batch jobs, reporting workloads.

  • Compute: Terraform/Deployment Manager templates stored in Cloud Source Repositories. No running infrastructure in DR region.
  • Database: Backup and DR Service scheduled exports to a different region; restore on demand.
  • Storage: Nearline or Coldline buckets with cross-region replication on critical objects only.

Memorize the RTO/RPO ceiling for each Cloud SQL replication topology: HA configuration (regional) = RPO 0 / RTO ~60 seconds within one region; cross-region read replica = RPO seconds (async) / RTO minutes (manual promotion); scheduled backups only = RPO up to 24h / RTO hours. Spanner multi-region is the only managed option offering true RPO 0 across regions. Reference: https://cloud.google.com/sql/docs/mysql/intro-to-cloud-sql-disaster-recovery


Cloud SQL Cross-Region Replica Failover Mechanics

Cloud SQL's cross-region read replica is the workhorse of warm-standby DR for MySQL, PostgreSQL, and SQL Server workloads. Understanding its mechanics is mandatory for PCA scenario questions.

Replication Topology

  • The primary instance lives in region A with HA enabled (synchronous replication to a standby in a second zone within region A — RPO 0 for zonal failures).
  • One or more cross-region read replicas live in region B, replicating asynchronously via binary log (MySQL/SQL Server) or WAL streaming (PostgreSQL).
  • Replication lag is typically 1-5 seconds but can grow under heavy write load. Always monitor replica_lag in Cloud Monitoring.

Promotion Procedure

When region A fails, you must manually promote the cross-region replica:

gcloud sql instances promote-replica my-replica-region-b

After promotion:

  1. The replica becomes a standalone primary — replication from the old primary is severed and cannot be re-established.
  2. Application connection strings must repoint (use Cloud SQL Auth Proxy or update the connection name in Secret Manager).
  3. Configure a new HA standby and new cross-region replica from the promoted instance — DR posture is degraded until this is done.

Caveats

  • Promotion is irreversible. If region A recovers, you cannot "demote" the new primary back.
  • The replica does not inherit user-defined backups; configure backup schedules immediately after promotion.
  • For read-write traffic from both regions, Cloud SQL is the wrong choice — use Spanner.

A common exam distractor presents Cloud SQL cross-region replicas as offering automatic regional failover. They do not. Failover is manual (or scripted via custom automation). If the question requires hands-off cross-region failover with RPO 0, the correct answer is Cloud Spanner multi-region, not Cloud SQL. Reference: https://cloud.google.com/sql/docs/mysql/replication/cross-region-replicas


DNS, Load Balancer, and MIG-Level Failover

Failover is a layered concern. Different GCP primitives handle different blast radii — knowing which to apply where is core PCA material.

Cloud DNS Failover Routing Policy

Cloud DNS supports a failover policy that returns the primary IP when its health check passes and switches to a backup IP otherwise. Configure via:

gcloud dns record-sets create api.example.com. \
  --type=A --ttl=60 \
  --routing-policy-type=FAILOVER \
  --routing-policy-primary-data=34.120.0.10 \
  --routing-policy-backup-data-policy-type=GEO \
  --enable-health-checking

Use this when you have separate IPs per region (e.g., regional TCP proxy LBs) or for hybrid setups bridging on-prem and GCP. Caveat: DNS-based failover is bounded by TTL plus resolver caching — a 60-second TTL realistically yields 2-5 minute client-side failover.

Global Application Load Balancer Cross-Region Backends

The Global External Application Load Balancer attaches multiple regional backend services to a single Anycast frontend IP. Health checks at the edge automatically drain traffic from a failing region with no DNS change required — typically within 30 seconds. This is the preferred mechanism for sub-minute RTO on HTTP(S) workloads.

Regional MIG with Auto-Healing

A regional Managed Instance Group spreads instances across three zones in a region by default. Auto-healing uses a health check to recreate unhealthy VMs:

  • Self-healing handles VM-level and zone-level failures within a single region.
  • It does not protect against region-wide outages — that requires multiple MIGs in different regions stitched together via the global LB.

For sub-minute regional failover on HTTP/HTTPS workloads, always pair a Global External Application Load Balancer with regional MIGs (or Cloud Run services) in two or more regions. Cloud DNS failover is a fallback for non-HTTP protocols or hybrid scenarios — it is slower and constrained by TTL. Reference: https://cloud.google.com/load-balancing/docs/https


Backup and DR Service for Multi-Region Recovery

The Google Cloud Backup and DR Service (formerly Actifio GO) is the managed backup plane for Compute Engine VMs, VMware Engine workloads, databases (Cloud SQL, SAP HANA, Oracle), and file systems.

Multi-Region Backup Vaults

  • Backup data is stored in a backup/recovery appliance plus a backup vault in Cloud Storage.
  • Configure the vault region independently from the source workload region. Cross-region vaulting is essential for protecting against regional outage — if both source and backup are in us-central1, a regional disaster destroys both.
  • Immutable backup vaults with a minimum enforced retention protect against ransomware that holds delete privileges.

Recovery Patterns

  • Mount-based recovery: Instantly mount a backup image as a new VM disk — useful for forensics or partial recovery (RTO minutes, no full restore needed).
  • Full restore to alternate region: Re-hydrate a VM or database in a DR region from the backup vault. This is the canonical cold-standby pattern.
  • Application-consistent backups: Pre/post scripts quiesce databases before snapshot — required for transactional integrity.

Eventual Consistency Caveats for Multi-Region Buckets

Standard multi-region GCS buckets replicate objects asynchronously across regions. Two consequences:

  • A write to one region may not be visible from the alternate region for several minutes — apps reading "the latest" object from the failover region can see stale data.
  • Turbo replication (available on dual-region buckets, e.g., nam4, eur5, asia1) guarantees 15-minute RPO by SLA, with most writes replicated in seconds.
  • Listings (gs://bucket?prefix=) can show eventual ordering even with turbo — never rely on listing freshness as a synchronization primitive.

For DR plans that depend on backup objects being immediately readable from the recovery region, use turbo-replicated dual-region buckets or write explicit gsutil rsync to a region-specific bucket as part of the backup pipeline.


Chaos Engineering and Resilience Testing

Designing for DR is useless if you don't test it.

  • Game Days: Scheduled events where teams simulate a disaster (e.g., "Deleting the production database" in a staging environment) to test the recovery playbook.
  • Fault Injection: Deliberately injecting failures (like shutting down a random zone) to see how the system reacts.
  • Validation: Always validate that restored data is accurate and that the application is fully functional after a failover.

For the PCA exam, remember that Disaster Recovery is a process, not just a technology. A complete answer includes technical redundancy, documented playbooks, and regular testing. Reference: https://cloud.google.com/architecture/framework/reliability


Summary of Optimal vs. Viable Decisions in HA/DR

Requirement Viable Solution (Good) Optimal Solution (Architect-level)
Regional Failover Manual DNS update Global Load Balancing (Automatic failover)
Database HA Nightly backups to GCS Cloud Spanner or Multi-AZ Cloud SQL
RTO < 5 Minutes Warm Standby (Scaled down) Active-Active (Multi-regional)
Data Integrity Eventual consistency Strong consistency (Spanner)
DR Testing Annual paper audit Quarterly Game Days with Fault Injection

FAQ — High Availability and Disaster Recovery

Q1. Can I achieve 100% uptime?

No. In the cloud, 99.999% (Five Nines) is considered the industry maximum. There is always a statistical probability of failure.

Q2. What is the biggest challenge in Multi-regional DR?

Data Consistency. Synchronizing data across thousands of miles is limited by the speed of light. This is why Cloud Spanner is so revolutionary—it uses atomic clocks to manage this synchronization globally.

Q3. Is HA VPN "High Availability" for the connection?

Yes. An HA VPN setup in GCP consists of two tunnels on two separate interfaces, ensuring a 99.99% SLA for the hybrid connection.

Q4. Does serverless (Cloud Run) have built-in DR?

Cloud Run is regional. To achieve multi-regional DR for Cloud Run, you must deploy the service to multiple regions and put a Global Load Balancer in front of them.

Q5. What is "Self-healing"?

Self-healing is a feature of MIGs where the infrastructure automatically recreates a VM instance if it fails a health check, without any human intervention.


Final Architect Tip

On the PCA exam, when you see a requirement for "Business Continuity" or "Zero Downtime," your brain should immediately go to Multi-regional Active-Active and Global Load Balancing. Always evaluate the RTO/RPO targets first—they will tell you which DR pattern is the "Optimal" choice.

Official sources

More PCA topics