Introduction to Cloud Spanner Query Optimization
Cloud Spanner Query Optimization is the discipline of shaping schemas, indexes, and SQL so a globally distributed database actually behaves like one. A query that looks innocent in a single-node Postgres can scatter across dozens of splits in Spanner and saturate the network. Cloud Spanner Query Optimization is what turns a 4-second tail latency into a 40 ms p99, and it sits squarely on the GCP Professional Data Engineer blueprint.
白話文解釋(Plain English Explanation)
Cloud Spanner Query Optimization Is Like Routing a Pizza Order Through a Multi-City Kitchen
Imagine a pizza chain with kitchens in Tokyo, Frankfurt, and São Paulo. A customer in Osaka orders pepperoni. If the dispatcher sends the ticket to São Paulo, the pizza arrives cold. Cloud Spanner Query Optimization is the dispatcher: it figures out which split owns the row you want and routes the request to the closest replica that can serve it. Get the routing wrong and every order pays a transcontinental tax. Get it right and the kitchen across the street handles it in seconds.
The EXPLAIN plan is the dispatcher's printed ticket. It tells you which kitchen took the order, which prep stations were involved, and where the bottleneck lived. Interleaved tables are like keeping the pepperoni topping in the same fridge as the pizza dough that uses it, so the cook never has to walk across the street.
Secondary Indexes Are Like the Card Catalog in a 1980s Library
Walk into a library and look for "books by Murakami". You do not start in row one of the stacks. You go to the card catalog, find the call numbers under M, then walk straight to the shelf. A Spanner secondary index is the card catalog. The base table is the stacks. Without an index, every "WHERE author = 'Murakami'" forces a full scan of every shelf.
A STORING clause adds extra fields to the index card itself, so you do not even need to walk to the stacks. If the card already shows the title and the publication year, the librarian answers from the catalog. That is a covering index, and Cloud Spanner Query Optimization lives or dies by whether your hot queries are covered.
NULL_FILTERED indexes are the catalog with the empty cards thrown out. If 90 percent of your books have no ISBN, why waste catalog drawers on them? Filter the nulls and the catalog shrinks.
Hot Keys Are Like Every Driver Taking the Same Highway Exit at 5 PM
A monotonically increasing primary key (a timestamp, an auto-increment id) is the rush-hour exit. Every new write hits the same split, which lives on one machine, which has one CPU. The other 99 percent of your cluster is idle while one node melts. Cloud Spanner Query Optimization is partly about not building that exit. UUIDv4 prefixes, hash-bucketed keys, and bit-reversed sequences are the on-ramps that spread the traffic across every lane.
A read timestamp is the difference between asking the traffic camera "what is happening right now" (strong read, expensive, blocks) versus "what was happening 15 seconds ago" (stale read, cheap, served from any replica). Most dashboards do not need real-time. They just think they do.
Core Concepts of Cloud Spanner Query Optimization
The Spanner query engine has its own dialect of relational thinking. Cloud Spanner Query Optimization rests on a handful of primitives.
Splits. Spanner shards a table into ranges of primary key called splits. Each split lives on a Paxos group of replicas. The optimizer's first job is to figure out which splits a query touches.
Distributed union. When a query crosses splits, the executor builds a distributed union: it dispatches sub-queries to each split's leader and merges the results. The cost of a query is roughly the cost of the most expensive sub-query plus the merge.
Locking reads versus snapshot reads. Read-write transactions take locks. Read-only transactions take a snapshot at a chosen timestamp and never block writers. Cloud Spanner Query Optimization usually means moving as much work as possible into read-only snapshots.
Query optimizer versions. Spanner pins a query_optimizer_version per database. New versions can change plans. Pin a version explicitly when you ship to production so a Google rollout does not silently regress your p99.
Index intersection and force_index. Spanner can intersect two indexes for an AND predicate, but the optimizer will not always pick the best path. The @{FORCE_INDEX=index_name} hint is the override.
Partitioned DML. A separate execution model for bulk UPDATE and DELETE that bypasses the 20,000-mutation limit per transaction by splitting the operation across partitions.
Architecture and Design Patterns
A Spanner database is a tree. The root is the instance. Below it sit databases, then tables, then splits, then rows. Every primary key is the address that locates a row inside this tree, and every query plan is a walk through it.
The most consequential pattern is co-location through interleaving. When you declare INTERLEAVE IN PARENT Customers, the rows of the child Orders table physically sit next to the parent customer row inside the same split. A join between Customers and Orders for a single customer becomes a local scan instead of a distributed lookup. This is the single most powerful schema move in Cloud Spanner Query Optimization, and it is also the most irreversible. You cannot un-interleave without rebuilding the table.
The opposite pattern is deliberate de-interleaving. If a child table grows unboundedly per parent (think audit logs per user), interleaving creates fat splits that Spanner cannot rebalance. In that case, keep the foreign key but drop the INTERLEAVE clause and let Spanner shard the child table independently.
A third pattern is the covering index sandwich. Build a secondary index on the filter columns and add the projected columns through STORING. Now the entire query plan is index-only: zero base-table fetches, predictable latency, no surprise hot rows.
For write-heavy workloads, the bit-reversed primary key pattern spreads inserts across the keyspace. Take a sequence number, reverse its bits, and use that as the leading key column. The reversed bits look random to Spanner's split logic, so writes fan out instead of stacking.
GCP Service Deep Dive
EXPLAIN and EXPLAIN ANALYZE
EXPLAIN returns the plan without running the query; EXPLAIN ANALYZE runs it and attaches actual row counts, latencies, and CPU. In the Cloud Console the plan renders as a tree where each node shows scan type, predicate pushdown, and rows processed. The leaf nodes you want to see are Index Scan (good) and Distributed Union over Index Scan (good if scoped). The leaf you do not want to see is Full Scan on a table with more than a few thousand rows.
A subtle gotcha: a node labeled Filter after a scan means Spanner pulled rows it then had to throw away. That is wasted I/O. Push the predicate into the index by adding the filter column to the index definition.
Interleaved Tables in Practice
Declaration looks like this:
CREATE TABLE Orders (
CustomerId INT64 NOT NULL,
OrderId INT64 NOT NULL,
OrderTs TIMESTAMP NOT NULL,
Total NUMERIC,
) PRIMARY KEY (CustomerId, OrderId),
INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
The ON DELETE CASCADE is not just a foreign-key convenience; it tells Spanner that the child rows are physically owned by the parent and can be deleted in the same atomic transaction without crossing splits. Cloud Spanner Query Optimization heavily favours this clause for parent-child relationships that are tightly coupled.
Secondary Indexes: NULL_FILTERED and STORING
CREATE NULL_FILTERED INDEX OrdersByEmail
ON Orders (CustomerEmail)
STORING (Total, OrderTs);
NULL_FILTERED excludes rows where the indexed column is NULL. This shrinks the index dramatically when the column is sparsely populated. It also changes semantics: a query that expects to see NULLs must not use this index, so the optimizer will skip it for WHERE CustomerEmail IS NULL.
STORING makes the index covering. Without it, every match incurs a back-reference to the base table to fetch Total and OrderTs. With it, the index alone answers the query.
FORCE_INDEX Hint
SELECT OrderId, Total
FROM Orders@{FORCE_INDEX=OrdersByEmail}
WHERE CustomerEmail = '[email protected]';
The hint is a hammer. Use it when you have proven via EXPLAIN ANALYZE that the optimizer chose poorly, and document why. Spanner upgrades may eventually remove the need, but the hint stays harmless because it pins behaviour against version drift.
Hot Key Avoidance
Anti-patterns: INT64 auto-increment, CURRENT_TIMESTAMP() as the leading key, sequential UUIDs (UUIDv1, ULID without entropy first). Patterns that work: UUIDv4 leading bytes, FARM_FINGERPRINT of a natural key prefixed to a sequence, bit-reversed sequences via Spanner's BIT_REVERSED_POSITIVE sequence kind.
For time-series, the canonical fix is (ShardId, EventTs) where ShardId = HASH(...) MOD N. Reads for a time range fan out across N splits in parallel; writes spread instead of piling up.
Batch DML versus Partitioned DML
Batch DML packs multiple statements into a single round trip but still executes inside one read-write transaction, bound by the 80,000-mutation cell limit and the 100 MB transaction size limit. Use it for moderate batches.
Partitioned DML is a different beast. It splits the statement across partitions and commits each partition independently. There is no atomicity across partitions, but there is no mutation cap either. Use it for backfills like "set status = 'archived' where created_at < 2023-01-01" against billions of rows. The statement must be idempotent because Spanner may retry partitions.
-- Partitioned DML is invoked at the API layer, not in pure SQL.
-- Java example:
spannerClient.executePartitionedUpdate(
Statement.of("DELETE FROM Sessions WHERE LastActive < @cutoff")
);
Read Timestamps: Strong versus Stale
Strong reads return the latest committed data and may need to wait for in-flight transactions to drain. They cost a network round trip to the leader. Stale reads come in two flavours: exact_staleness (read at exactly N seconds ago) and bounded_staleness (read at any point within the last N seconds, optimizer's choice). Both can be served by any replica, including the closest one to the client.
A 15-second bounded_staleness is the workhorse for dashboards, analytics, and any read that does not require read-your-writes consistency. The latency drop versus a strong read is often 10x.
Commit Timestamp
spanner.commit_timestamp=true on a TIMESTAMP column lets Spanner write the actual commit timestamp into the row at commit time, monotonically and gap-free. It is the right way to build change-data-capture, audit trails, and ordering keys without the hot-key trap of CURRENT_TIMESTAMP() (because the commit timestamp column is usually not the leading primary key).
Table Options and Schema Design
OPTIONS (optimizer_version = 6) pins the query optimizer per statement. OPTIONS (read_lock_mode = 'PESSIMISTIC') controls locking behaviour for read-write transactions. Schema design choices that pay back forever: small primary keys (every secondary index repeats them), STRING(MAX) only when truly variable, and TIMESTAMP over INT64 for time columns so Spanner can use commit-timestamp semantics.
Cloud Spanner Query Optimization always starts with EXPLAIN ANALYZE on production-shaped data. A plan that looks fine on 10,000 rows will mutate at 10 million. Capture plans in CI for your top 20 queries and diff them on every schema change. See: https://cloud.google.com/spanner/docs/query-execution-plans
Four levers carry most PDE Spanner tuning answers: INTERLEAVE IN PARENT co-locates child rows for local joins, a STORING clause on a secondary index turns hot queries into index-only scans, @{FORCE_INDEX=...} pins the plan when the optimizer picks wrong, and partitioned DML bypasses the 80,000-mutation-per-transaction cap for backfills like DELETE FROM Sessions WHERE LastActive < @cutoff. Pin query_optimizer_version per database so Google's rollouts cannot quietly re-plan these queries.
See: https://cloud.google.com/spanner/docs/secondary-indexes
Using CURRENT_TIMESTAMP() or a monotonic INT64 as the leading primary key is the single most common Spanner anti-pattern. Every write lands on the last split, which lives on one node, and your throughput caps at a few thousand QPS regardless of how many nodes you provision. Bit-reversed sequences or hash-bucketed keys are the fix.
See: https://cloud.google.com/spanner/docs/schema-design
Pin optimizer_version per database in production. Otherwise a Spanner rollout can quietly change your plans and shift a 20 ms query into 200 ms territory between Tuesday and Wednesday.
See: https://cloud.google.com/spanner/docs/query-optimizer
A child table whose rows are physically co-located with the parent row inside the same split. Declared with INTERLEAVE IN PARENT. Joins between parent and child for a single parent key become local operations, not distributed unions.
See: https://cloud.google.com/spanner/docs/schema-and-data-model#parent-child_table_relationships
For the PDE exam: STORING makes an index covering, NULL_FILTERED shrinks an index when nulls dominate, FORCE_INDEX overrides the optimizer, INTERLEAVE co-locates, bit-reversed keys spread writes, partitioned DML bypasses the mutation cap. These six levers cover most Spanner question stems. See: https://cloud.google.com/spanner/docs/secondary-indexes
Common Pitfalls and Trade-offs
The first pitfall is over-interleaving. A child table with thousands of rows per parent makes the parent split fat, and Spanner cannot split inside a parent's row group. The result is a split that grows past the recommended 4 GB and causes hotspots. Reserve interleaving for tight one-to-few relationships.
The second is secondary index write amplification. Every secondary index is a separate write on every insert and update touching the indexed columns. Five indexes on a high-write table mean six writes per insert. Cloud Spanner Query Optimization is always a balance: indexes speed reads, slow writes, and consume storage.
The third is trusting the default optimizer version. New versions sometimes regress specific plans. Pin the version, then upgrade deliberately with EXPLAIN diffs in hand.
The fourth is mixing strong and stale reads carelessly. A read-modify-write loop that uses stale reads on the read leg can write stale data on top of newer commits. Strong reads belong inside read-write transactions; stale reads belong inside read-only transactions.
The fifth is abusing partitioned DML. It is not transactional across partitions. A statement that sets balance = balance - 100 cannot be partitioned because partial application would corrupt the data. Partitioned DML is for idempotent, partition-independent updates.
The sixth is ignoring lock contention. A long-running read-write transaction holds locks across every row it touched. If it conflicts with a hot row, every other writer queues. Move bulk work to partitioned DML or split it into many small transactions.
Best Practices
- Run
EXPLAIN ANALYZEagainst production-shaped data and capture plans in CI for your top 20 queries. - Pin
query_optimizer_versionand upgrade deliberately, never automatically. - Build covering indexes (
STORINGclause) for the predicates that drive your p99 latency. - Use
NULL_FILTEREDindexes when the indexed column is sparsely populated. - Reserve
FORCE_INDEXfor cases where EXPLAIN proves the optimizer chose poorly, and leave a comment explaining why. - Avoid monotonic primary keys; prefer UUIDv4 prefixes, hash-bucketed shards, or bit-reversed sequences.
- Keep read-write transactions under one second; push bulk work to partitioned DML.
- Default to
bounded_stalenessreads (10 to 15 seconds) for dashboards and analytics. - Use
spanner.commit_timestamp=truefor audit and CDC columns instead ofCURRENT_TIMESTAMP(). - Limit the number of secondary indexes on hot-write tables; every index is another write per row.
Real-World Use Case
A mid-sized fintech runs a global payments ledger on Cloud Spanner across us-central1, europe-west4, and asia-northeast1. Their ledger table started with (TransactionId INT64 NOT NULL) as a sequential primary key. By month three of production, write throughput hit a wall at 12,000 TPS even though they had provisioned 30 nodes. CPU on one node sat at 95 percent; the other 29 sat at 8 percent.
The on-call team ran EXPLAIN ANALYZE on the slowest queries and saw every insert landing on the same split. The fix was a schema migration: introduce a ShardId column derived from MOD(FARM_FINGERPRINT(AccountId), 64) and rebuild the primary key as (ShardId, AccountId, TransactionId). They added a covering secondary index OPTIONS (optimizer_version = 6) on (AccountId, TransactionTs DESC) STORING (Amount, Currency) to keep the per-account history query plan as an index-only scan.
For their nightly archival job, which previously timed out hitting the 80,000-mutation limit, they switched from batch DML to partitioned DML: DELETE FROM Sessions WHERE LastActive < @cutoff. The job dropped from 4 hours of retried batches to 22 minutes of parallel partition deletes.
For their fraud dashboard, they switched read consistency from strong to bounded_staleness(15s) and replica routing from leader to nearest. The p95 latency for the dashboard's main query fell from 380 ms to 41 ms, and the leader region's CPU dropped 30 percent because reads no longer all funnelled there.
The end state: 38,000 sustained TPS on the same 30 nodes, ledger queries at 18 ms p95, and no hot splits. Cloud Spanner Query Optimization was the lever; the schema and the read patterns were the pivots.
Exam Tips
The PDE exam likes Spanner questions that frame a symptom and ask for the cleanest fix. A few patterns to internalize for Cloud Spanner Query Optimization questions.
When the stem says "high write latency on inserts" and shows a sequential primary key, the answer is almost always change the key strategy (hash bucket, bit-reverse, UUID prefix). Adding nodes is a distractor; nodes do not help when one split is the bottleneck.
When the stem describes "a query that filters on column X and column Y is slow" and offers options like "create index on X", "create index on X STORING Y", "create index on Y", the right answer is the covering index. STORING is the keyword the exam wants you to recognize.
When the stem mentions "needs to update billions of rows in one operation" the answer is partitioned DML, not batch DML, not Dataflow, not a loop.
When the stem says "dashboard tolerates 30 seconds of staleness", the answer involves bounded staleness reads, not strong reads. The exam writers explicitly use the staleness language to point you here.
When the stem mentions "must record the exact commit time and the column will be queried in time order", the answer is commit timestamp column, not application-supplied timestamps.
When two indexes look equally plausible, prefer the one with NULL_FILTERED if the question mentions a sparsely populated column. If the question never mentions nulls, do not use NULL_FILTERED.
Watch for distractors that propose Bigtable, Firestore, or BigQuery as the fix. The exam will sometimes test whether you can defend the Spanner choice; if the workload demands strong consistency across regions plus relational SQL, Spanner is the right answer and the question is about tuning it, not replacing it.
Frequently Asked Questions (FAQ)
When should I choose Cloud Spanner Query Optimization techniques over migrating to BigQuery?
Spanner is the right tool when you need transactional reads and writes with sub-second latency, relational SQL, and global consistency. BigQuery is the right tool for analytics over large historical datasets. If your workload is "tens of thousands of point-lookup transactions per second with strong consistency", optimize Spanner. If your workload is "analyst runs a 200 GB scan twice a day", move it to BigQuery. The two are not interchangeable, and the PDE exam tests whether you can tell them apart.
How many secondary indexes is too many for one table?
There is no hard limit, but every secondary index is another row written on every insert and update touching the indexed columns. A table with five indexes pays a 6x write cost. For high-write tables (more than a few thousand QPS of inserts), keep secondary indexes to two or three and lean on covering indexes (STORING) to make those few count. For low-write reference tables, ten indexes are fine.
What is the difference between EXPLAIN and EXPLAIN ANALYZE?
EXPLAIN returns the plan the optimizer would use, without executing. It is cheap and useful for plan diffing. EXPLAIN ANALYZE actually runs the query and attaches real row counts, per-node latency, and CPU. Use ANALYZE during tuning sessions; use plain EXPLAIN inside CI to detect plan regressions without paying query cost.
When should I use bounded_staleness instead of strong reads?
Use bounded_staleness whenever the application can tolerate slightly old data and you do not need read-your-writes consistency. Dashboards, analytics, recommendation lookups, and most read-heavy paths qualify. The latency win is usually 5 to 10x because the read can be served from the nearest replica instead of the leader. Use strong reads only when the read must reflect every commit up to the moment, such as inside a read-modify-write loop or when displaying the user's own just-submitted action.
Can I un-interleave a table after creating it?
No. INTERLEAVE is a physical layout decision baked into the table's storage. Removing it requires creating a new table without INTERLEAVE, copying the data, swapping references, and dropping the old table. This is why interleaving is the most consequential schema decision in Cloud Spanner Query Optimization. Decide carefully up front; if in doubt about the parent-to-child cardinality, leave the relationship as a plain foreign key.
Why does my query plan change between Spanner versions?
Spanner pins a query optimizer version per database. Google rolls out new versions periodically and the default for new databases follows the latest stable. New versions can change plans, sometimes for the better and occasionally for the worse on specific shapes. Pin query_optimizer_version explicitly in production, run EXPLAIN diffs when you upgrade, and only flip the pin once you have validated the top queries.
Is partitioned DML transactional?
No, not across partitions. Each partition commits independently. A statement that sets a flag based on a fixed predicate (UPDATE Sessions SET archived = TRUE WHERE last_active < @cutoff) is safe because the result is the same whether it ran in one transaction or many. A statement that does arithmetic on existing values (UPDATE Accounts SET balance = balance - 100 WHERE user_id = @id) is unsafe because partial application would corrupt the data. Use partitioned DML only for idempotent, partition-independent updates.
Related Topics
- Cloud Spanner High Availability Design — how multi-region configurations and replica placement interact with the query optimization choices in this note.
- BigQuery Data Modeling and Clustering — the analytical counterpart; learn when to push workloads from Spanner to BigQuery.
- Bigtable Schema Design Best Practices — the wide-column alternative when you need petabyte scale without SQL.
Further Reading
- Spanner query execution plans (official docs) — the canonical reference for reading EXPLAIN ANALYZE output, including every node type and what its metrics mean.
- Spanner schema and data model (official docs) — the authoritative source on interleaving, primary keys, and the storage layout that Cloud Spanner Query Optimization ultimately exploits.
- Spanner secondary indexes (official docs) — full coverage of NULL_FILTERED, STORING, and the FORCE_INDEX hint syntax.
- Partitioned DML (official docs) — when and how to use partitioned DML for bulk operations safely.