Introduction to Cloud SQL and AlloyDB for Analytics
Cloud SQL and AlloyDB for analytics is the GCP story for running operational PostgreSQL, MySQL, or SQL Server workloads that also need to answer analytical questions without standing up a full warehouse for every report. Cloud SQL gives you the familiar managed OLTP engines, while AlloyDB layers a columnar accelerator and vector search on top of PostgreSQL. The PDE exam expects you to know when each shines, and how Datastream and BigQuery federated queries glue them into the wider data platform.
白話文解釋(Plain English Explanation)
Three short stories make Cloud SQL and AlloyDB for analytics easier to internalise than any architecture diagram. Each one isolates a different tension the exam likes to test: latency vs throughput, row vs column, and the cost of moving data versus the cost of querying in place.
The Diner Kitchen vs the Catering Warehouse
A diner kitchen handles one ticket at a time: cook the eggs, plate them, send them out in three minutes. That is Cloud SQL doing OLTP — small, fast, transactional reads and writes. A catering warehouse, by contrast, prepares ten thousand identical sandwiches for a corporate event; it batches ingredients on pallets and never touches a single order. That is BigQuery scanning a billion rows. AlloyDB is the gastropub in between: the same kitchen can plate à la carte for the bar but also pre-assemble a hundred party platters in the back, because it keeps both row storage and a columnar cache. When you read about Cloud SQL and AlloyDB for analytics, picture which kitchen the workload belongs in before you reach for the engine.
The Library Card Catalog and the Photocopier
Imagine a library where every book is shelved by call number — finding one specific title is fast, but counting how many books were published in 1987 means walking every aisle. Cloud SQL works like that: B-tree indexes are great for point lookups, miserable for full scans. AlloyDB's columnar engine is the photocopier in the basement that preprints summary sheets of every column. When the exam question describes mixed transactional and analytical traffic on the same database, the columnar cache in Cloud SQL and AlloyDB for analytics is the photocopier that the librarian quietly maintains overnight.
The Conveyor Belt Between Two Factories
Datastream is a conveyor belt connecting the operational factory (Cloud SQL or AlloyDB) to the analytics factory (BigQuery). It picks up every change — insert, update, delete — and drops it on the BigQuery floor within seconds. You do not have to shut down either factory to run it, and you do not have to carry boxes by hand with nightly batch jobs. Whenever a scenario in the PDE exam mentions change data capture from Postgres or MySQL into BigQuery, the conveyor belt is the mental model. Cloud SQL and AlloyDB for analytics depends on this belt to stay near real time.
Core Concepts of Cloud SQL and AlloyDB for Analytics
Cloud SQL is Google's managed offering for PostgreSQL, MySQL, and SQL Server. You pick a machine type, attach storage, enable High Availability if you need a synchronous standby, and the service handles patching, minor version upgrades, and base backups. The engines are stock open-source builds (with Microsoft's binaries for SQL Server), so application compatibility is essentially identical to running the same version on a VM.
AlloyDB is Google's PostgreSQL-compatible engine built on a disaggregated storage layer. It separates compute (a primary instance plus read pool nodes) from storage (a regional, log-structured store with automatic three-zone replication). Cloud SQL and AlloyDB for analytics treats AlloyDB as the upgrade path when PostgreSQL workloads outgrow Cloud SQL on either throughput or analytical concurrency.
The columnar engine is the differentiator. AlloyDB keeps a configurable in-memory column store that mirrors selected tables or partitions in a vectorised, compressed format. The query planner inspects each query and decides per-block whether the row store or the column store will be faster. You do not maintain a second copy by hand — the engine warms and refreshes the columnar cache automatically based on usage.
AlloyDB AI extensions add google_ml_integration and pgvector support, letting you call Vertex AI models (text embeddings, generative endpoints) from SQL functions and store the resulting embeddings in indexed vector columns. For Cloud SQL and AlloyDB for analytics scenarios involving semantic search or RAG over operational data, this is the path that avoids exporting rows to a separate vector database.
Read replicas exist on both engines but behave differently. Cloud SQL read replicas are independent Postgres or MySQL instances using physical streaming replication; you size each one and pay for it. AlloyDB read pools are clusters of nodes sharing the same regional storage layer, which means adding a node only adds compute, not another full copy of the data.
Architecture and Design Patterns
A common reference architecture for Cloud SQL and AlloyDB for analytics has four tiers. The application tier writes through pgBouncer or Cloud SQL Auth Proxy into a primary writer. A High Availability standby in a second zone takes synchronous writes for failover. Read replicas (or AlloyDB read pool nodes) absorb dashboards, BI tools, and ad hoc analyst queries. Datastream tails the write-ahead log and streams changes into a BigQuery dataset, where heavy aggregation, ML feature pipelines, and federated joins with other warehouses live.
A second pattern, the hybrid OLTP plus light analytics design, runs everything on AlloyDB and skips the warehouse for the first year of the product. The columnar engine handles dashboards directly off the primary or read pool, and a single Datastream stream lands raw history in BigQuery only for long-term retention. This pattern is cheaper for early-stage products but caps out when daily data volumes pass a few hundred gigabytes.
A third pattern uses BigQuery federated queries. From a BigQuery SQL session, an analyst can run SELECT * FROM EXTERNAL_QUERY('connection-id', 'SELECT ... FROM cloud_sql_table') without any pipeline. This is the right tool for occasional joins between a small dimension table in Cloud SQL and a massive fact table in BigQuery. It is the wrong tool for nightly ETL because each federated call hits the source database live.
Network design matters more than people expect. Both Cloud SQL and AlloyDB live on a Google-managed VPC peered to your VPC via Private Service Access (or Private Service Connect for newer setups). Datastream needs either a private connectivity profile through PSC or an allow-listed public IP. Forgetting to plan the VPC peering range is the single most common reason a Cloud SQL and AlloyDB for analytics project is delayed in production.
GCP Service Deep Dive
Cloud SQL for PostgreSQL, MySQL, and SQL Server
Cloud SQL supports three engine families. PostgreSQL versions 12 through 16 are available, MySQL 5.7 and 8.0, and SQL Server 2017, 2019, and 2022 in Express through Enterprise editions. Machine sizes scale from shared-core db-f1-micro (deprecated for production) up to 128-vCPU instances with multiple terabytes of RAM. Storage uses regional persistent disks with automatic increase enabled by default.
High Availability on Cloud SQL means a synchronous standby in another zone within the same region. Failover is automatic and typically takes 60 to 120 seconds. The standby is not readable — it exists purely for failover. If you want readable copies, you provision read replicas separately.
Read replicas use asynchronous physical replication. PostgreSQL replicas can be cross-region, which lets you serve analytical queries close to BI users in a different geography. MySQL replicas can be cascaded. Replicas can be promoted to standalone primaries, which is the standard pattern for major version upgrades or for splitting tenants.
Backups come in two flavours. Automated backups run daily and use incremental snapshots; you choose a retention window of one to 365 days. Point-in-time recovery (PITR) replays write-ahead logs from any moment within the retention window. For Cloud SQL and AlloyDB for analytics workloads with strict RPO targets, PITR with seven-day retention is the baseline.
AlloyDB Compute and Storage
AlloyDB clusters consist of a primary instance (one writer plus optional read pool nodes for that instance) and any number of secondary instances (cross-region read clusters). The storage layer is regional and uses three-zone replication automatically; you cannot turn it off, and you do not pay extra per zone.
The primary instance handles all writes. Read pool nodes serve read-only queries with stale-by-milliseconds consistency. Each node has its own buffer cache, but they share the same underlying storage. Adding a read pool node is a one-click operation that takes a few minutes and does not require copying data.
AlloyDB Columnar Engine
The columnar engine is enabled per cluster. You allocate a portion of memory (typically 30 to 50 percent of the buffer cache) to the column store. The engine then either auto-selects which columns to populate based on workload analysis, or you specify columns explicitly with google_columnar_engine.recommend_columns() and add_table().
When a query runs, the planner evaluates each scan node. If the predicate hits a column already loaded into the column store and the selectivity favours columnar (typically scans of more than a few thousand rows), the engine uses SIMD-vectorised execution against the compressed columnar blocks. Otherwise it falls back to the heap. There is no separate query language — the same SQL benefits transparently.
The columnar cache is rebuilt asynchronously when the underlying rows change. There is a small lag between an update and its reflection in the column store, which the planner accounts for by reading deltas from the heap. For Cloud SQL and AlloyDB for analytics use cases this lag is normally invisible because analytical queries tolerate seconds of staleness.
AlloyDB AI Extensions
The google_ml_integration extension lets a SQL function call a Vertex AI endpoint. A typical example: SELECT ml_predict_row('gemini-1.5-flash', json_build_object('prompt', review_text)) FROM customer_reviews. Embeddings work the same way with embedding('text-embedding-004', text_column).
Vector storage uses pgvector, with both IVFFlat and ScaNN-backed indexes. The ScaNN index, contributed by Google Research, is significantly faster for high-dimensional embeddings than stock pgvector indexes. Cloud SQL and AlloyDB for analytics workloads doing semantic search on operational data — product catalogues, support tickets, internal documents — usually land on AlloyDB because the embedding generation, storage, and query all happen in one engine.
Datastream for Change Data Capture
Datastream is a serverless CDC service that reads the write-ahead log of PostgreSQL, the binlog of MySQL, or the redo log of Oracle and SQL Server, then writes change events to either Cloud Storage or BigQuery directly. The BigQuery destination is the most common target for Cloud SQL and AlloyDB for analytics pipelines because it requires no Dataflow job to land data.
When you target BigQuery directly, Datastream creates one BigQuery table per source table and applies inserts, updates, and deletes through the BigQuery Storage Write API with MERGE semantics. Latency from source commit to BigQuery query availability is usually under one minute, often closer to fifteen seconds.
The Cloud SQL source needs logical decoding enabled (cloudsql.logical_decoding=on for Postgres, binlog set to ROW format for MySQL). AlloyDB enables logical replication by default. A replication slot consumes WAL on the source until Datastream catches up, so monitoring slot lag is essential.
BigQuery Federated Queries
BigQuery federated queries reach into Cloud SQL and AlloyDB through EXTERNAL_QUERY. You create a connection in BigQuery's BI Engine resource, point it at the Cloud SQL or AlloyDB instance, and grant the connection's service account access to the database. From there, any BigQuery SQL session can push a SQL fragment down to the source.
Federated queries are not free of limits. They run synchronously, time out at six hours, and cannot be parallelised across the source. They are a pull mechanism, so they hit the operational database every time. The right use is small lookup joins or freshness-critical reads of recent rows, not bulk extraction.
Common Pitfalls and Trade-offs
Cloud SQL read replicas are sometimes treated as a free analytics tier. In practice, large analytical scans on a Postgres replica still consume the same shared buffer cache as the primary's recent writes, and replication lag rises whenever a long-running analytical query holds a snapshot. Teams discover this when their read replica falls hours behind during month-end reporting and the main app starts seeing stale dashboards.
Sizing the AlloyDB columnar cache too small is a common Cloud SQL and AlloyDB for analytics mistake. If only 10 percent of buffer memory is allocated and the working set of analytical columns exceeds it, the engine evicts and reloads constantly, and queries silently fall back to the heap. The columnar engine surfaces hit ratios in google_columnar_engine.columns() — check it.
Datastream into BigQuery is not a backup. The destination tables reflect the current state of source rows after MERGE. If an application bug deletes ten thousand rows, those deletes propagate to BigQuery within seconds. PITR on the source is what protects you, not the BigQuery copy.
BigQuery federated queries timeout at six hours, but the more practical limit is the source database. A federated query that asks Cloud SQL to scan a hundred-million-row table will lock memory and CPU for as long as the scan runs, blocking the primary writer's operational queries. Pushdown optimisation helps but does not save you from a poorly written query.
Major version upgrades on Cloud SQL are in-place but require downtime — sometimes more than thirty minutes. The standard mitigation is to create a read replica on the new version, replicate, and promote, but this only works between specific version pairs. AlloyDB sidesteps the issue because it handles minor versions transparently and has no major-version concept yet.
Cross-region read replicas in Cloud SQL for PostgreSQL provide read-only copies but do not provide automatic failover. Disaster recovery to another region requires you to promote the replica manually, which loses any unreplicated transactions. For an automated regional failover story, use AlloyDB cross-region secondary clusters or design an application-level failover. See Cloud SQL replication overview.
Enabling logical decoding on a busy Cloud SQL Postgres primary without a Datastream consumer attached will cause the WAL to grow unbounded until the disk fills. Always create the Datastream stream and confirm it is consuming from the slot before increasing write traffic. See Configure a source PostgreSQL database.
Best Practices
A small set of habits keeps Cloud SQL and AlloyDB for analytics deployments healthy as they grow.
- Always enable Point-in-Time Recovery on production instances; the cost is minimal and the alternative — losing hours of data — is unacceptable.
- For PostgreSQL workloads heading toward analytical use, start on AlloyDB rather than migrating later; the columnar engine cannot be added to Cloud SQL.
- Size read replicas (Cloud SQL) or read pool nodes (AlloyDB) for peak analytical concurrency, not average load; spillover to the primary is a worse outcome than a slightly oversized replica.
- Use Private Service Connect rather than Private Service Access for new deployments; it scales better across multiple consumer VPCs and avoids IP range exhaustion.
- Pin Datastream's BigQuery destination dataset in the same region as the source database to avoid network egress charges and to keep latency predictable.
- Monitor
pg_stat_replicationlag on Cloud SQL andreplication_lag_secondson AlloyDB read pools as first-class SLOs, not afterthoughts. - For semantic search workloads, prefer the ScaNN index over the default IVFFlat for vectors above 256 dimensions; recall and latency are both better.
Use cloudsql_iam_authentication (Postgres) or IAM database authentication (MySQL) instead of password-based logins for service accounts that connect from Cloud Run or GKE. It removes the secret rotation problem and integrates cleanly with audit logging. See IAM database authentication for Cloud SQL.
Real-World Use Case
A mid-sized fintech serving 2 million active users runs its core ledger on AlloyDB in us-central1. The ledger schema has roughly 200 tables; the largest is a transaction-events table that grows by 80 GB per month. Analysts demand near-real-time dashboards for fraud monitoring, while the product team needs sub-50 ms latency on user-facing balance lookups.
The team configured AlloyDB with a primary instance (16 vCPU, 128 GB RAM) and a three-node read pool sized for analyst peak. The columnar engine takes 40 percent of the buffer cache and is populated with the high-cardinality columns of the transactions table — merchant_id, mcc_code, amount, created_at. Fraud dashboards using GROUP BY merchant_id, mcc_code complete in under two seconds against billions of rows because the columnar engine handles the scan with SIMD vectorisation.
For long-term history beyond 90 days and for joins with marketing data warehoused in BigQuery, a Datastream stream replicates every change from AlloyDB into a BigQuery dataset. Latency from a transaction commit to its appearance in BigQuery averages 22 seconds. The data engineering team uses BigQuery scheduled queries to maintain rolling aggregations and ML feature tables, which feed a fraud-scoring model trained in Vertex AI.
A small set of internal tools needs to join a customer dimension stored in a separate Cloud SQL Postgres instance with the BigQuery transaction history. Rather than build a pipeline, they use BigQuery federated queries against the Cloud SQL connection — about thirty queries per day, each touching at most a few thousand customer rows. Cloud SQL and AlloyDB for analytics in this architecture is not one product but three working together: AlloyDB for live OLTP plus operational analytics, Cloud SQL for a smaller customer dimension, and BigQuery for warehoused history.
Exam Tips
The PDE exam phrases Cloud SQL and AlloyDB for analytics questions around three recurring trade-offs. First, when the workload mixes OLTP with non-trivial analytics on the same data, the right answer is almost always AlloyDB with the columnar engine — not Cloud SQL with a beefy read replica, and not BigQuery alone. The columnar engine is the differentiator the exam wants you to remember.
Second, when the question asks how to land relational change data into BigQuery with the lowest operational overhead, choose Datastream's direct BigQuery destination rather than building a Dataflow template or a Cloud Functions pipeline. The exam favours managed-to-managed options.
Third, federated queries from BigQuery into Cloud SQL or AlloyDB are correct when the question emphasises ad hoc joins, small result sets, and no scheduled ETL. They are incorrect when the question describes nightly bulk movement of large tables — that pattern wants Datastream or scheduled BigQuery loads.
Watch for distractors mentioning Cloud SQL for analytics with read replicas. They are sometimes correct for moderate workloads, but if the question mentions vector search, columnar acceleration, or PostgreSQL beyond what Cloud SQL supports, AlloyDB is the answer. Memorising Cloud SQL and AlloyDB for analytics decision criteria pays off across at least three or four exam questions.
For the PDE exam: AlloyDB columnar engine is in-memory, automatic, and accelerates analytical scans on PostgreSQL-compatible workloads without requiring a separate warehouse for moderate volumes. Datastream lands CDC into BigQuery in under a minute. BigQuery federated queries via EXTERNAL_QUERY suit ad hoc joins, not scheduled ETL. See AlloyDB columnar engine.
An in-memory, columnar representation of selected table columns maintained automatically by AlloyDB alongside the standard row store. The query planner picks per-scan whether to use the column store (for analytical scans) or the heap (for point lookups), letting one engine serve both OLTP and operational analytics. See AlloyDB columnar engine overview.
When evaluating Cloud SQL vs AlloyDB vs BigQuery for a workload, anchor on three numbers: working-set size, query concurrency, and freshness target. Cloud SQL fits working sets under a few hundred GB with low analytical concurrency. AlloyDB stretches to multi-TB with mixed workloads. BigQuery is the answer above tens of TB or whenever scan cost dominates over latency. See Choosing a database service on GCP.
Frequently Asked Questions (FAQ)
When should I choose AlloyDB over Cloud SQL for PostgreSQL?
Pick AlloyDB when the workload is PostgreSQL-compatible and either needs more throughput than Cloud SQL's largest machines deliver, requires the columnar engine for analytical queries on the same database, or uses pgvector for semantic search. Cloud SQL for PostgreSQL stays the default for smaller workloads, simpler upgrade tooling, and the broadest extension support. Cloud SQL is also the only option if you need PostgreSQL features that AlloyDB has not yet implemented, such as certain logical replication subscribers.
How does Datastream's direct BigQuery destination differ from using Dataflow?
The direct BigQuery destination is a fully managed write path: Datastream uses the BigQuery Storage Write API with MERGE semantics to keep destination tables in sync with source rows, and you do not write any pipeline code. A Dataflow-based path gives you in-flight transformations (filtering, masking, joining), custom destination layouts, and the ability to fan out to other sinks. For Cloud SQL and AlloyDB for analytics scenarios where you want raw mirroring with the lowest operational burden, the direct destination is the right pick; for transformations, Dataflow is still the answer.
Can I run analytical queries directly against a Cloud SQL read replica?
Yes, but with caution. Read replicas are full Postgres or MySQL instances and accept any read-only query, so a BI tool can connect directly. The risk is that long-running analytical queries hold snapshots that prevent the replica from advancing replication, causing lag to grow. For sustained analytical load on Postgres, AlloyDB read pools are a better fit because adding nodes is cheaper and the columnar engine accelerates scans.
What latency should I expect from Datastream into BigQuery?
End-to-end latency from a source commit to query availability in BigQuery is normally between 15 and 60 seconds for steady-state streams. Initial backfills of historical data take longer and depend on source database size and network throughput. Slot lag on the source database is the leading indicator to watch — if it grows for more than a few minutes, the stream is falling behind.
Are BigQuery federated queries to Cloud SQL or AlloyDB billed differently?
Federated queries themselves are billed as standard BigQuery query jobs based on bytes processed by the BigQuery side, but the source database also pays its own CPU and memory cost for the underlying scan. There is no separate per-query fee, but a federated call against a large table can spike the source database's load profile. Plan capacity on both sides, not just on BigQuery.
Does the AlloyDB columnar engine require schema changes?
No. The columnar engine works on existing tables and is transparent to queries. You enable it at the cluster level, allocate memory, and either let the recommender choose columns or call google_columnar_engine.add_table() explicitly. Application code does not change, and queries that do not benefit from columnar execution continue to use the row store with no penalty.