Cloud Spanner High Availability Design

Introduction to Cloud Spanner High Availability Design

Cloud Spanner High Availability Design sits at the intersection of two things relational databases historically refused to mix: strong consistency and global scale. Google built Spanner so a single SQL statement could touch rows on three continents and still return an answer that obeys ACID. For the GCP Professional Data Engineer exam, you need to know how that promise is actually delivered, what it costs in latency, and which knobs to turn when a region falls off the map.

This note walks through replica roles, instance configuration choices, TrueTime, failover behavior, and the schema patterns that keep availability from collapsing under hot keys. Every section ties back to decisions you make at design time, because Cloud Spanner High Availability Design is mostly a set of upfront choices that you cannot easily reverse later.

白話文解釋（Plain English Explanation）

Before the technical drilling, three analogies that map cleanly onto Spanner's replica model.

The Restaurant Kitchen with Three Branches

Picture a restaurant chain with branches in Taipei, Tokyo, and Singapore. Each branch keeps a copy of the master recipe book. When the head chef changes a recipe, every branch must agree on the new version before anyone serves it, otherwise customers get inconsistent dishes.

In a regional Cloud Spanner instance, three branches sit in the same city, separated by streets (zones). A multi-regional instance puts them in different cities. A read-only replica is a branch that can read the recipe book to take orders, but cannot edit it. A witness replica is a branch with no kitchen at all; it only votes on which version of the recipe is official, so the chain can keep running if one full kitchen burns down.

This is exactly how Cloud Spanner High Availability Design tolerates failure. Read-write replicas hold the data and serve writes, read-only replicas absorb local read traffic, and witnesses give you a quorum without paying for a third full copy.

The Stock Exchange Trading Floor

A stock exchange must guarantee that if Alice's sell order is recorded at 09:30:01 and Bob's buy order at 09:30:02, every trader on the floor agrees on that order. If two clocks disagree, the market collapses into arbitration disputes.

TrueTime is Google's solution. Every data center has GPS receivers and atomic clocks, and the TrueTime API never returns "the current time" as a single number. It returns an interval like "between 09:30:01.000 and 09:30:01.007", and Spanner waits out the uncertainty before committing. That waiting period is why external consistency works globally without a central coordinator. Cloud Spanner High Availability Design leans on this clock infrastructure, and you cannot get the same property by bolting a coordinator onto a regular SQL database.

The Library Card Catalog with Branch Holdings

A university library has a main branch and ten satellite libraries. The card catalog is master at the main library, but each satellite keeps an updated index so students can browse without a network round-trip to main. When a book gets re-cataloged, main updates first, then the change propagates.

In Spanner, the leader replica is main. It serializes writes. Read-only replicas are the satellite catalogs that serve local students fast. Cloud Spanner High Availability Design lets you place leaders close to your write traffic and read-only replicas close to your read traffic. If the main library closes for renovation, one of the read-write satellites gets promoted to leader within seconds, and students barely notice. That is the failover model.

Core Concepts of Cloud Spanner High Availability Design

A handful of terms recur across every Spanner availability discussion, and the exam will assume you know them cold.

A physical copy of the database stored in a single zone. Spanner replicates at the split level, not the row level, and every replica participates in Paxos for the splits it holds. See https://cloud.google.com/spanner/docs/replication

The replica that serializes writes for a given split. There is exactly one leader per split at any moment, and Paxos elects a new one if the current leader becomes unreachable. Leader placement drives write latency. Reference: https://cloud.google.com/spanner/docs/replication

A replica that serves stale or strong reads but does not vote in write quorums. Useful for absorbing analytical reads close to users in distant regions. Reference: https://cloud.google.com/spanner/docs/replication

A replica that votes in Paxos but stores no data. It exists to break ties and form quorums without the cost of a third full data copy. See https://cloud.google.com/spanner/docs/instance-configurations

A guarantee stronger than serializability: the order of transactions matches real wall-clock order observable from outside the system. Spanner is the only commercial database offering this property at global scale. See https://cloud.google.com/spanner/docs/true-time-external-consistency

The last one matters more than people realize. Linearizability says reads see the most recent write. External consistency says the system's transaction order matches what a human with a stopwatch would observe. Cloud Spanner High Availability Design uses TrueTime commit-wait to enforce this without a global lock.

Architecture & Design Patterns

A Spanner instance is a collection of nodes (or processing units, the smaller billing unit) deployed into an instance configuration. The instance configuration determines geography. Inside the instance you create databases, and each database is split automatically based on row volume and access patterns.

Splits are the unit of replication. A split is roughly a contiguous range of primary key values, and Spanner can split or merge ranges on its own as load shifts. Each split has its own Paxos group: a leader, some number of read-write voters, optional read-only replicas, and optional witnesses. Cloud Spanner High Availability Design therefore happens at the split granularity, not at the database or table level.

Three configuration archetypes cover most scenarios.

A regional configuration places three read-write replicas in three zones of one region. SLA is 99.99 percent. Write latency is single-digit milliseconds because all voters are nearby. Survives a single zone failure with no data loss and minimal interruption. This is the default for OLTP workloads serving one geography.

A dual-region configuration (multi-region with two read-write regions plus a witness) places two read-write replicas in two regions and one witness in a third. SLA jumps to 99.999 percent. Survives a full region outage. Write latency rises because Paxos quorum now requires a cross-region acknowledgment. Cloud Spanner High Availability Design with this topology is the sweet spot for finance and healthcare workloads that need regional disaster tolerance.

A multi-region configuration with three read-write regions (for example nam-eur-asia1) gives the maximum geographic spread. Read latency from any continent is low because there is a leader nearby. Write latency is the highest because quorum spans oceans. The exam often tests whether you can match these trade-offs to a stated workload.

Once an instance configuration is chosen, you cannot change it in place. Migrating a database between configurations requires creating a new instance and using Dataflow or backup/restore to copy data. Plan the configuration before going to production. See https://cloud.google.com/spanner/docs/instance-configurations

GCP Service Deep Dive: Spanner Replication Mechanics

Spanner uses Paxos rather than Raft, and the difference shows up in failover behavior. Each split has a fixed set of voting replicas. A write must be acknowledged by a majority of voters before commit. With three voters, two acknowledgments suffice. With five voters across regions, three.

The leader replica handles all writes. It receives the request, assigns a TrueTime-derived commit timestamp, replicates the log entry to other voters, waits for quorum, waits out the TrueTime uncertainty (commit-wait), and then acknowledges the client. Cloud Spanner High Availability Design pays for external consistency in that commit-wait window, which is typically a few milliseconds.

Read-only replicas receive the replicated log asynchronously. They cannot vote, so they do not slow down writes. They serve two read modes: stale reads (with bounded staleness, very fast) and strong reads (which require checking with the leader for the latest timestamp, slightly slower but still local data fetch).

Witnesses receive log entries and vote, but they do not store the actual data. A witness in eur4 can vote on a write originating in nam6 without ever materializing the rows. This makes the third site dramatically cheaper while still providing the quorum needed for region failure tolerance.

For a multi-region instance like nam-eur-asia1, the topology has read-write replicas in North America, Europe, and Asia. Each split's leader can live in any of the three default leader regions. You configure a default leader region, and Spanner places leaders there when possible. Reads from the leader region get the lowest latency; reads from other regions get slightly higher latency due to the extra hop to verify timestamps for strong reads.

For latency-sensitive read-heavy workloads, set the default leader region close to your write traffic and use bounded staleness reads (for example, 10 seconds stale) from other regions. This avoids the cross-region round-trip and still gives a fresh-enough view for most analytical queries. See https://cloud.google.com/spanner/docs/timestamp-bounds

PDE scenarios that pair a 99.999 percent SLA or "survive a region failure" requirement with read-your-writes flows (create-then-render, checkout confirmation, post-UPDATE audit lookups) expect a multi-region configuration (two read-write regions plus a witness, or three read-write regions like nam-eur-asia1) plus strong reads or a read-write transaction for the read-your-writes path. Bounded-staleness reads on read-only replicas may miss the last few seconds of writes, and regional 99.99 percent configurations cannot survive a region outage even with extra read-only replicas attached. See https://cloud.google.com/spanner/docs/instance-configurations and https://cloud.google.com/spanner/docs/timestamp-bounds

Common Pitfalls & Trade-offs

Cloud Spanner High Availability Design has several traps that catch teams who treat Spanner like a regular relational database.

Choosing a multi-region configuration "to be safe" when your workload only serves one country. Multi-region writes pay cross-region latency on every commit, often 30 to 80 milliseconds versus 5 to 10 for regional. A web checkout that does ten sequential writes can feel sluggish even though the database is technically healthy. If your users and your application servers are in one geography, regional is the correct answer. See https://cloud.google.com/spanner/docs/instance-configurations

The second trap is hot keys. If your primary key is a monotonically increasing integer or a current timestamp, every new row goes to the same split, which means the same leader, in the same zone. That single leader becomes the bottleneck for all writes, and worse, if that zone fails the new-leader election only protects existing data; the hot-spot pattern resumes against the new leader. Cloud Spanner High Availability Design assumes load is distributed across splits, and it cannot help when the schema funnels everything into one.

Use UUID v4 primary keys, hash a sequence into a bit-reversed value, or shard a timestamp by prepending a hash of the row. Spanner's docs call this "preventing hotspots" and the exam will test it under the schema design banner.

A third trap is forgetting that read-only replicas do not improve write availability. If you add a us-central1 read-only replica to a regional us-east1 instance, you have not bought any disaster tolerance against an east1 outage. You only added local reads in central1. To survive a region failure you need the data and the votes in another region, which means a multi-region or dual-region configuration.

A fourth trap is misunderstanding stale reads. Bounded staleness reads are served by the closest replica without contacting the leader. They are fast but they may miss the last few seconds of writes. Code that reads its own writes (for example, "create order then immediately render order summary") must use strong reads or a transaction, otherwise users see stale data right after their own action.

Best Practices

A short, opinionated list. None of these are optional for production workloads.

Place the default leader region where your write traffic originates. Cross-region writes are the most expensive operation in Spanner, and putting the leader near the writer cuts commit latency by tens of milliseconds.

Design primary keys for distribution from day one. Bit-reverse sequential IDs, use UUIDs, or prefix with a hash of a high-cardinality field. Schema migrations to fix hot keys later are painful because Spanner has no native ALTER for primary key.

Use interleaved tables to colocate parent and child rows in the same split. A Customer with its Orders interleaved means a "fetch customer and orders" query touches one split, one leader, one round-trip. This reduces tail latency dramatically and matters even more in multi-region topologies.

Set up Point-in-Time Recovery (PITR) with a retention window matching your RPO. PITR is independent of replication and protects against logical corruption (a bad UPDATE wiping rows), which replication alone does not.

Monitor the leader fraction across regions. If you set nam6 as default leader but most leaders end up in eur4 because of load balancing decisions, your write latency assumptions break. The Spanner monitoring page shows leader distribution per region.

Test failover. Use Spanner's documented failover procedure in a non-production instance and measure your application's behavior. Many client libraries retry transparently, but some application code holds stale connections and needs explicit error handling.

Cap node count growth with monitoring. Spanner scales linearly, but cost scales with it. Set up alerts on CPU utilization above 65 percent and add nodes proactively rather than reactively.

Real-World Use Case

A mid-size payments platform with a Singapore engineering team and customers across Southeast Asia chose Spanner to replace a sharded MySQL deployment that had become unmanageable. They process roughly 4,000 transactions per second at peak, with strict regulatory requirements that no transaction may be lost and that audit reads must reflect external consistency.

They started with a regional asia-southeast1 configuration: three zones, three read-write replicas, default leader in asia-southeast1-a. Write latency averaged 7 milliseconds, p99 under 25 milliseconds. The 99.99 percent SLA covered their initial compliance requirement. Cost was roughly 60 percent of the equivalent MySQL fleet they replaced, factoring in operations savings.

After two years and expansion into Indonesia and the Philippines, they migrated to a multi-region asia1 configuration with read-write replicas in Tokyo, Osaka, and a witness in Taiwan. Write latency rose to about 22 milliseconds because Tokyo-Osaka quorum is now in the critical path, but they gained 99.999 percent SLA and the ability to survive a Tokyo outage without data loss. The application team had to update a few endpoints to use bounded-staleness reads where strong consistency was not strictly required, which kept user-facing latency stable.

For analytical workloads, they added a Dataflow job that streams change data via Spanner's change streams feature into BigQuery. This offloads heavy aggregation from the operational database and gives the BI team a query surface independent of OLTP load. Cloud Spanner High Availability Design held up through one zone outage in Tokyo during the second year, with automatic leader re-election completing in under 30 seconds and zero data loss.

Exam Tips

The PDE exam tests Cloud Spanner High Availability Design across several question patterns. Recognize them and the answers fall out cleanly.

Pattern one: choose between regional and multi-region given an SLA target. 99.99 percent maps to regional, 99.999 percent maps to multi-region. If the question mentions "survive a regional outage" or "regulatory disaster recovery", the answer involves multi-region or dual-region.

Pattern two: explain why writes are slow in a multi-region instance. The answer is cross-region Paxos quorum plus TrueTime commit-wait. Watch for distractors that blame schema design when the topology is the real cause, and vice versa.

Pattern three: pick a primary key strategy. Monotonic integers and timestamps are wrong. UUIDs, bit-reversed sequences, or hash-prefixed keys are right. The exam loves to give you a high-throughput logging table with a timestamp PK and ask why writes hot-spot.

Pattern four: compare TrueTime to NTP or other clock systems. TrueTime gives bounded uncertainty and is what enables external consistency. Other clock services do not bound uncertainty, so they cannot serialize transactions globally without a coordinator.

Pattern five: choose between strong reads and stale reads. Read-your-writes scenarios need strong. Dashboards and analytics tolerate staleness measured in seconds. Stale reads are cheaper and often local-only.

Pattern six: failover behavior. Spanner failover is automatic at the split level via Paxos leader election. There is no manual failover for zone failures within a regional instance. For region failures in a multi-region instance, the surviving regions take over without operator action, though new leader election may briefly elevate latency.

Regional Spanner: 99.99 percent, single region, low write latency, survives zone failure. Multi-region Spanner: 99.999 percent, geographically spread, higher write latency, survives region failure. TrueTime + commit-wait = external consistency. Reference: https://cloud.google.com/spanner/docs/instance-configurations

Schema Design for High Availability

Schema choices interact with availability in ways that are not obvious until you measure. Cloud Spanner High Availability Design starts with the keys.

Avoid sequential keys. A serial counter as primary key forces every new row to the highest-key split, where one leader handles every insert. If that leader's zone fails, the new leader inherits the same hot pattern. Use UUID v4 or apply the bit-reversal trick: store the integer as its bit-reversed counterpart so consecutive logical IDs land in different splits.

Use interleaved tables for parent-child relationships. INTERLEAVE IN PARENT physically colocates a child row with its parent in the same split. A query that joins parent and child becomes a single-split lookup with no cross-split coordination, which improves both latency and availability (fewer splits in the critical path means fewer failure surfaces).

Limit interleaving depth. Spanner allows up to seven levels of interleaving but practical schemas rarely go past three. Deeper interleaving makes the schema harder to evolve and can produce oversized splits if one parent has millions of children.

Use secondary indexes carefully. Each index is itself a Paxos-replicated structure. Indexes on monotonic values create their own hot spots, even when the base table has a well-distributed primary key. STORING clauses on indexes let you serve read queries without touching the base table, which is great for read availability but adds write cost.

Plan for schema changes ahead of time. Spanner allows online schema changes (adding columns, indexes, even tables) without downtime, but some changes (changing primary key, removing NOT NULL) require backfill jobs. Cloud Spanner High Availability Design assumes you can migrate live, so design the migration runbook before you need it.

For multi-region instances, consider where parent rows are placed. Spanner does not let you pin specific rows to specific regions, but you can use leader region settings to influence where writes land. If you have customer-level partitioning by geography, the customer ID prefix can correlate with the default leader region for that workload.

Frequently Asked Questions (FAQ)

What is the difference between regional and multi-regional Spanner instance configurations?

Regional configurations place three read-write replicas in three zones of a single region. They give 99.99 percent SLA and the lowest write latency, but cannot survive a region-wide outage. Multi-regional configurations spread replicas across two or three regions plus optional witnesses. They give 99.999 percent SLA and survive region failure, at the cost of higher write latency due to cross-region quorum. Cloud Spanner High Availability Design picks regional for single-geography workloads and multi-region for global or compliance-driven scenarios.

How does TrueTime enable external consistency?

TrueTime returns time as an interval rather than a single value, with bounded uncertainty (typically under 7 milliseconds). When Spanner commits a transaction, it picks a timestamp at the upper bound of the interval and waits until the current time provably exceeds that timestamp before acknowledging the client. This commit-wait guarantees that any later transaction will see a strictly higher timestamp, which is what external consistency requires. Without TrueTime, achieving the same property would require a global coordinator, which would not scale.

What is a witness replica and when should I use one?

A witness replica votes in Paxos but stores no data. It exists in multi-region configurations to provide a third voting site without the cost of a third full data copy. You do not choose witnesses individually; they come built into specific instance configurations like nam3 (two read-write regions plus a witness in a third). The witness lets the system maintain quorum if one of the two read-write regions fails, which is the heart of multi-region Cloud Spanner High Availability Design.

How does Spanner handle a zone failure in a regional instance?

Each split has three voting replicas across three zones. If one zone goes down, the remaining two voters still form a quorum, so writes continue with no data loss. If the failed zone hosted the leader for some splits, Paxos elects new leaders from the surviving voters within seconds. Clients may see brief errors during election (typically retried automatically by client libraries) but the database remains available. No operator action is required.

Can I change a Spanner instance configuration after creation?

You cannot modify the instance configuration in place. Migrating from regional to multi-region or vice versa requires creating a new instance with the desired configuration and copying data using Dataflow, backup/restore, or change streams. This is one reason Cloud Spanner High Availability Design must be planned upfront. Pick your configuration based on the SLA and disaster recovery requirements you expect to need within the next two years, not just the immediate need.

What primary key designs cause hot spots in Spanner?

Monotonically increasing keys (auto-increment integers, current timestamps, sequential UUIDs like UUID v1) all funnel new rows to the highest-key split, creating a write hot spot on a single leader. Solutions include UUID v4 (random), bit-reversed sequences (store the bit-reversed integer), or hash-prefixed keys (prepend a hash of another column). Hot spots hurt both performance and availability, because the bottleneck leader becomes a single point of contention even though the database is distributed.

Does adding read-only replicas improve write availability?

No. Read-only replicas serve reads but do not participate in write quorum. Adding read-only replicas in another region does not protect against a write region failure. To improve write availability across regions you need a multi-region instance configuration with read-write replicas in multiple regions, optionally plus a witness. Read-only replicas are useful for offloading read traffic and reducing read latency for distant clients, not for disaster recovery.

Cloud Spanner Architecture for Data Engineering

Introduction to Cloud Spanner High Availability Design

白話文解釋（Plain English Explanation）

The Restaurant Kitchen with Three Branches

The Stock Exchange Trading Floor

The Library Card Catalog with Branch Holdings

Core Concepts of Cloud Spanner High Availability Design

Architecture & Design Patterns

GCP Service Deep Dive: Spanner Replication Mechanics

Common Pitfalls & Trade-offs

Best Practices

Real-World Use Case

Exam Tips

Schema Design for High Availability

Frequently Asked Questions (FAQ)

What is the difference between regional and multi-regional Spanner instance configurations?

How does TrueTime enable external consistency?

What is a witness replica and when should I use one?

How does Spanner handle a zone failure in a regional instance?

Can I change a Spanner instance configuration after creation?

What primary key designs cause hot spots in Spanner?

Does adding read-only replicas improve write availability?

Further Reading

Official sources

More PDE topics

Introduction to Cloud Spanner High Availability Design

白話文解釋（Plain English Explanation）

The Restaurant Kitchen with Three Branches

The Stock Exchange Trading Floor

The Library Card Catalog with Branch Holdings

Core Concepts of Cloud Spanner High Availability Design

Architecture & Design Patterns

GCP Service Deep Dive: Spanner Replication Mechanics

Common Pitfalls & Trade-offs

Best Practices

Real-World Use Case

Exam Tips

Schema Design for High Availability

Frequently Asked Questions (FAQ)

What is the difference between regional and multi-regional Spanner instance configurations?

How does TrueTime enable external consistency?

What is a witness replica and when should I use one?

How does Spanner handle a zone failure in a regional instance?

Can I change a Spanner instance configuration after creation?

What primary key designs cause hot spots in Spanner?

Does adding read-only replicas improve write availability?

Related Topics

Further Reading

Official sources

More PDE topics