Datastream CDC Ingestion Architecture

Introduction to Datastream CDC Ingestion Architecture

A Datastream CDC ingestion architecture turns a transactional database into a continuous event source for analytics. Instead of nightly extracts, you tail the database's own change log and replay each insert, update, and delete into BigQuery or Cloud Storage within seconds. For the PDE exam, this pattern shows up whenever a question mentions "low latency," "without impacting the source," or "operational data into the warehouse."

Datastream is serverless. There is no cluster to size, no Debezium connector to patch, and no JVM heap to tune. You define a source connection profile, a destination connection profile, and a stream that lists the schemas and tables to replicate. Google Cloud handles the log reader, the buffering, the retries, and the delivery. The trade-off is that you give up some transformation flexibility in exchange for a managed pipeline that survives source restarts, network blips, and schema evolution without operator intervention.

The rest of this note walks through the architecture piece by piece: the supported source connectors, how connection profiles separate credentials from streams, what the BigQuery and Cloud Storage destinations actually look like on disk, how schema drift is propagated, and how to combine Datastream with Dataflow templates when you need real transformations between the database and the warehouse.

白話文解釋（Plain English Explanation）

Think of Datastream as a court stenographer

A court stenographer does not interrupt the trial, redirect the witnesses, or argue with the judge. They sit quietly and transcribe every spoken word in real time. Anyone reading the transcript later sees exactly what happened, in order, with timestamps. Datastream plays the same role next to your operational database. The database keeps serving production traffic. Datastream reads the redo log, the binlog, or the WAL and writes a faithful, ordered transcript into BigQuery. If lawyers want to replay yesterday's testimony, the transcript is already there.

Think of CDC like a hotel housekeeping log

Hotels keep a housekeeping log: room 412 checked in at 14:02, room 207 reported a broken lamp at 15:18, room 412 ordered room service at 19:30. Management does not need to walk every floor to know the state of the building; they read the log. A Datastream CDC ingestion architecture treats your tables the same way. The "log" is the database's own write-ahead log, and BigQuery becomes the management dashboard. The room (the row) only changes when something actually happens, which is far cheaper than re-inventorying the entire hotel every night.

Think of the BigQuery destination as a self-cleaning whiteboard

Imagine a whiteboard where every change to a customer record is rewritten on top of the previous version, with the most recent timestamp winning. You never have to wipe it manually; the board reconciles itself in the background using a primary key. That is what the Datastream BigQuery destination does with its append-and-merge model: change events land in the table, and BigQuery's storage engine compacts them into the current row state without you authoring a MERGE statement.

Core Concepts of Datastream CDC Ingestion Architecture

A Datastream pipeline is built from four primary objects. Understanding how they compose is the fastest path to recognising the right answer on a scenario question.

Connection profile

A connection profile is a saved set of credentials and network settings for one endpoint, either a source or a destination. You create one per Oracle instance, one per BigQuery project, one per Cloud Storage bucket. Profiles are reusable: the same source profile can feed several streams, and rotating a password only requires editing one object.

Stream

A stream binds a source profile to a destination profile and declares the include and exclude lists for schemas, tables, and columns. The stream is also where you choose CDC-only, backfill-only, or backfill-then-CDC behaviour, and where you set the destination layout (dataset name, file format, partitioning hints).

Private connectivity configuration

For sources behind a firewall, you create a private connectivity configuration. This object provisions a VPC peering between Google's managed Datastream tenant project and your VPC, so that traffic never crosses the public internet. A single private connectivity configuration can be referenced by many connection profiles in the same region.

Route

Inside a private connectivity configuration, a route declares which CIDR range your source database lives in. Datastream installs the route into its tenant VPC so that egress traffic destined for your database is sent through the peering rather than the public IP path.

A pattern that derives a stream of row-level events (insert, update, delete) by reading a database's transaction log instead of querying tables. Because the log is what the database already writes for crash recovery, CDC adds negligible load on the source. https://cloud.google.com/datastream/docs/overview

Source Connectors in Detail

Datastream supports a fixed list of source databases, and each one reads a different log format. The mechanics are worth memorising because exam questions often hinge on a permission or a configuration flag specific to one engine.

Oracle

Datastream reads Oracle redo logs through LogMiner. You enable ARCHIVELOG mode, switch on supplemental logging at the database and table level, and grant the Datastream user SELECT on the LogMiner views (V$LOGMNR_CONTENTS, V$ARCHIVED_LOG, and friends). Datastream supports Oracle 11g, 12c, 18c, 19c, and 21c, including Amazon RDS for Oracle and on-premises deployments. The most common setup mistake is forgetting ALTER TABLE ... ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS on tables that lack a primary key, which leads to incomplete UPDATE events.

MySQL

For MySQL, Datastream consumes the binary log. You set binlog_format=ROW, binlog_row_image=FULL, and a retention long enough to survive a brief outage (usually 7 days). Cloud SQL for MySQL exposes these as flags in the instance configuration. Self-managed MySQL on Compute Engine works the same way. Datastream supports MySQL 5.6, 5.7, 8.0, and 8.4. AWS Aurora MySQL is also supported because Aurora speaks the standard binlog protocol.

PostgreSQL

PostgreSQL CDC uses logical replication. You set wal_level=logical, raise max_replication_slots and max_wal_senders, install the pglogical extension on supported versions, and create a publication plus a replication slot for Datastream. Cloud SQL for PostgreSQL and AlloyDB both expose these flags. The trap to watch for: a replication slot that no client consumes will pin WAL on disk forever, eventually filling the boot volume. Always drop the slot if you delete the stream.

SQL Server

SQL Server CDC uses the engine's built-in change tables, populated by a SQL Agent job. You enable CDC at the database level, then enable CDC per table, granting the Datastream user SELECT on the change tables. Datastream supports SQL Server 2012 through 2022, including Cloud SQL for SQL Server, Amazon RDS, and on-premises. Note that the SQL Server Express edition does not support CDC at all because it lacks SQL Agent.

AlloyDB for PostgreSQL

AlloyDB inherits the PostgreSQL logical replication path. The integration is tight enough that AlloyDB exposes a one-click "Enable replication for Datastream" option that flips wal_level, sets the slot count, and creates the replication user. Throughput is meaningfully higher than vanilla PostgreSQL because AlloyDB's columnar engine and storage layer absorb the read pressure differently.

Datastream never installs an agent on the source database. It connects as a normal database client and reads the log through native protocols. This is why the source-side requirement list is mostly about flags, permissions, and log retention, not software installation. https://cloud.google.com/datastream/docs/sources

Architecture and Data Flow

A typical Datastream CDC ingestion architecture has four moving parts arranged in a straight line: source database, private connectivity, Datastream service, destination. The shape rarely varies, but the components inside each box change with the source engine and the destination type.

When the stream starts, Datastream first runs the optional backfill. Backfill performs a parallel snapshot of each included table, partitioning the read so that wide tables do not serialise on a single connection. While the snapshot runs, Datastream is also tailing the log so that any changes that occur during the backfill are captured. When the snapshot finishes, Datastream merges the snapshot rows with the captured changes using row-level versioning, so the destination converges to a consistent picture of the source as of the snapshot end time.

After backfill, the stream enters steady-state CDC. Datastream maintains a position pointer in the source log: SCN for Oracle, GTID or file/position for MySQL, LSN for PostgreSQL and SQL Server. Events are read, transformed into a unified internal schema, batched, and pushed to the destination. The pointer only advances after the destination acknowledges the batch, which is what makes the pipeline at-least-once delivery with idempotent merging on the BigQuery side.

Throughput tiers and parallelism

Datastream does not expose explicit "instance sizes." Instead it scales the number of concurrent log readers and writer threads automatically based on backlog. You influence throughput indirectly by:

Selecting fewer, larger tables instead of many tiny ones (per-table parallelism has overhead).
Splitting a single very large stream into several streams that target different schemas.
Increasing source-side log retention so that a temporary slowdown does not force a backfill restart.
Running the destination in the same region as the stream and the source.

For BigQuery destinations, the per-stream sustained throughput sits in the hundreds of MB/s range with the standard configuration. Cloud Storage destinations can go higher because object writes parallelise more cleanly than streaming inserts.

Datastream is log-based CDC across four source families — Oracle (LogMiner on redo logs, 11g–21c), MySQL (binlog with binlog_format=ROW, 5.6–8.4), PostgreSQL (logical replication via wal_level=logical plus a replication slot), and SQL Server (built-in CDC change tables, 2012–2022, Express unsupported) — landing in exactly two destinations, BigQuery (managed upsert via max_staleness, default 15 minutes) or Cloud Storage (Avro or JSON files). Private connectivity is VPC peering with a /29 CIDR carved from your VPC, and per-stream throughput sits in the hundreds of MB/s for the BigQuery destination. https://cloud.google.com/datastream/docs/sources

If a single stream cannot keep up, split by schema or by table set rather than by raising any throughput knob. Two streams running in parallel scale linearly because Datastream allocates an independent reader pool to each. https://cloud.google.com/datastream/docs/best-practices

BigQuery Destination Deep Dive

The BigQuery destination is the path that most PDE scenarios assume by default. Datastream creates one BigQuery table per source table, in a dataset named after the source schema (or a name you specify). Each row carries the source primary key plus a set of change metadata columns: _metadata_timestamp, _metadata_change_type (INSERT, UPDATE-INSERT, UPDATE-DELETE, DELETE), and _metadata_source.

Internally, Datastream uses BigQuery's CDC feature, which means it writes an upsert-style change log into the table and lets BigQuery's storage engine merge the changes asynchronously. Querying the table always returns the current row state, not the change log. The merge runs continuously in the background and is billed under BigQuery storage, not as a separate compute job. This is what people mean when they call the integration "zero-ETL": there is no MERGE statement to author, no staging table to manage, and no scheduled job to run.

You configure a max_staleness per table. A staleness of 0 means every query forces a synchronous compaction (slower but always fresh); a staleness of 15 minutes lets BigQuery batch the merges and is much cheaper for high-write tables. The default is 15 minutes, which is appropriate for most analytics workloads.

The Datastream service account needs bigquery.dataEditor on the destination dataset. If you create the dataset ahead of time and forget the IAM grant, the stream will sit in the STARTING state without producing data, and the error message buries the IAM hint several layers deep in the logs. https://cloud.google.com/datastream/docs/destination-bigquery

Cloud Storage Destination Deep Dive

The Cloud Storage destination writes change events as files into a bucket, partitioned by source table and by event time. The two supported file formats are Avro and JSON; Avro is the default because it carries the schema in the file header and compresses better. You choose the partitioning interval (1 minute, 15 minutes, 1 hour) and the file rotation size.

Each file contains a sequence of change events with the same metadata columns as the BigQuery destination. Because the events are appended rather than merged, the consumer is responsible for reconstructing the current state of each row, typically by partitioning on primary key and keeping the row with the highest _metadata_timestamp. This is more work than the BigQuery destination, but it gives you total control over how the data lands.

The Cloud Storage path is the right choice when:

You need to apply non-trivial transformations before loading.
You want to fan out the same change stream to multiple consumers (BigQuery, Spanner, an ML feature store).
You need a long-term raw archive of every change for audit or replay.
You are building a data lake on Cloud Storage with Iceberg, Hudi, or Delta tables.

The companion piece for Cloud Storage destinations is one of the Dataflow templates that Google ships specifically for Datastream output, covered later in this note.

Schema Drift Handling

Schemas change. A new column appears on the source, a column gets renamed, a column changes type from VARCHAR(50) to VARCHAR(100). A robust Datastream CDC ingestion architecture has to deal with all three without a human intervening at 2 AM.

For BigQuery destinations, Datastream automatically issues ALTER TABLE ADD COLUMN for new columns. The new column appears as NULLABLE and is back-populated as new change events arrive (existing rows stay NULL). Dropped columns are tolerated: Datastream simply stops writing to them, but the BigQuery column remains in place so historical queries do not break.

Type widenings (e.g., INT to BIGINT, VARCHAR(50) to VARCHAR(100)) are handled because Datastream uses lenient destination types from the start. A VARCHAR(50) column maps to BigQuery STRING, which has no length limit, so widening on the source has no impact on BigQuery. Numeric widenings work the same way through NUMERIC and BIGNUMERIC.

What Datastream cannot handle automatically:

A column rename, because the log only records the new name; Datastream sees this as drop-old plus add-new.
A type change that loses information, like BIGINT to INT.
A primary key change, which forces a full backfill of the table.

For Cloud Storage destinations, schema drift surfaces in the Avro file headers. New files start including the new column; older files retain the old schema. Downstream consumers must handle schema-on-read, which Avro is specifically designed for.

Test schema migrations against a non-production stream first. The drift is automatic but not instantaneous; a renamed column will produce a temporary mismatch where the BigQuery table has both the old and the new column populated for the rows that straddle the change. https://cloud.google.com/datastream/docs/schema-changes

Private Connectivity Patterns

Production Datastream workloads almost always need private connectivity. Public IP allowlisting works for proofs of concept but is rarely acceptable for a production database. Datastream offers three private patterns; the choice depends on where the source actually lives.

VPC peering for sources in Google Cloud

When the source is Cloud SQL, AlloyDB, or a self-managed database on Compute Engine, you create a private connectivity configuration that peers the Datastream tenant VPC with your VPC. Traffic stays inside Google's network. The setup requires a /29 CIDR carved out of your VPC for the Datastream-side interface, and a route declaring the source's CIDR. This is the lowest-latency option available.

VPC peering plus Cloud Interconnect or VPN for on-premises sources

For on-premises Oracle or SQL Server, you bridge the VPC peering with a Cloud Interconnect or HA VPN tunnel. Datastream peers with your VPC; your VPC reaches on-premises through Interconnect or VPN. The packet path is Datastream tenant VPC → your VPC → Interconnect → data centre. Latency depends mostly on the Interconnect leg, which is why dedicated Interconnect (10 Gbps or 100 Gbps) is preferred for high-throughput streams.

Forward SSH tunnel through a bastion

The forward SSH tunnel is the lightest-weight private option. You run a bastion VM with a public IP in a DMZ subnet, expose port 22, and Datastream uses the bastion to reach the database. Authentication uses an SSH key managed in Secret Manager. This pattern is convenient for dev environments and small migrations, but the bastion becomes a throughput bottleneck above a few tens of MB/s and a single point of failure unless you run a load-balanced pair.

Pick VPC peering for any production workload that processes more than a gigabyte per hour. The forward SSH tunnel is fine for getting a proof of concept running in an afternoon, but it does not scale and it adds an extra hop you have to monitor. https://cloud.google.com/datastream/docs/private-connectivity

Common Pitfalls and Trade-offs

The Datastream documentation makes the happy path look simple, and most of the time it is. The rough edges show up in operations.

Log retention too short. If the stream pauses (because the destination is briefly unavailable, or a person paused it for maintenance) and the source rotates past the last position Datastream read, the stream cannot resume. The only recovery is a full backfill, which on a multi-terabyte table can take days. Set source log retention to at least 7 days for production.

Replication slots left behind. Deleting a PostgreSQL stream does not drop the replication slot on the source. The slot keeps pinning WAL until disk fills. Always drop the slot manually after deletion.

LogMiner load on Oracle. Oracle LogMiner runs inside the source database and consumes CPU. On a busy OLTP system, LogMiner can add 5 to 15 percent CPU overhead. Right-size the database before turning on Datastream, and prefer dedicated standby databases as the source when possible.

BigQuery streaming quotas. The Datastream BigQuery destination uses the Storage Write API, which has per-project and per-region quotas. A stream that suddenly bursts to thousands of writes per second on a brand-new project may hit the default quota and throttle. Request a quota increase before the production cutover.

Cost of max_staleness=0. Setting zero staleness on a high-write table forces a synchronous merge on every query, which can dominate query cost. Default to 15 minutes unless a specific query truly needs second-level freshness.

Initial backfill load on the source. A backfill is a SELECT-everything operation. On a 10 TB table, it will saturate the source's read I/O for hours. Schedule backfills off-peak, or use a read replica as the backfill source if the engine supports it.

Best Practices

Create one connection profile per logical endpoint and reuse it across streams. Rotating credentials in one place beats rotating them in five.
Always enable backfill plus CDC for new streams unless the destination already has a known-consistent snapshot.
Use BigQuery destinations with max_staleness=15m as the default; tune down only for tables that need it.
Provision private connectivity from day one. Migrating from public IP allowlisting to VPC peering later requires a stream restart.
Set source log retention to at least 7 days, and alert when retention drops below 24 hours of headroom.
Monitor the system_latency and freshness metrics in Cloud Monitoring; alert at 5x the normal baseline rather than at a fixed value.
For Cloud Storage destinations, prefer Avro over JSON unless a specific consumer requires JSON; the size and parse-cost difference is significant at scale.
When a stream falls behind, split it by schema or table set rather than waiting for a hypothetical Google-side scale-up.

Real-World Use Case

A mid-sized e-commerce company runs its order management system on a self-hosted Oracle 19c cluster in their own data centre. Analytics happens in BigQuery, fed for years by a nightly Sqoop job that copied the entire orders and order_items tables. The job took six hours and produced day-old dashboards.

The team replaced the Sqoop job with a Datastream CDC ingestion architecture. They provisioned an HA VPN tunnel between their data centre and a Google Cloud VPC, set up a Datastream private connectivity configuration peered with that VPC, and added a route for the Oracle subnet. They enabled supplemental logging on the relevant tables, granted the LogMiner permissions to a new Datastream user, and created a stream with backfill-then-CDC against the production database during a weekend window.

The backfill of the 4 TB transactional schema took eighteen hours, running at roughly 60 MB/s sustained. After backfill, steady-state CDC settled at a system latency of 3 to 7 seconds. The Looker dashboards that previously refreshed at 7 AM each morning now show the current state of orders within ten seconds of the customer placing them. The Sqoop infrastructure was decommissioned, freeing up a Hadoop cluster they no longer needed.

The follow-on project used the same Datastream stream to land a parallel copy in Cloud Storage as Avro, which a Dataflow job then enriches with shipping data and writes to a separate analytics dataset. The same change events power both the operational dashboards and the curated analytics tables, with a single source of truth in the Datastream output.

Integration with Dataflow Templates

Google ships Dataflow templates specifically built to consume Datastream output. The two most common are the Datastream to BigQuery template and the Datastream to Spanner template. These templates exist because some scenarios need the GCS landing path (for transformations or fan-out) but still want the final destination to be BigQuery or Spanner with managed merging.

The Datastream to BigQuery template reads Avro files from the Cloud Storage destination, applies optional UDF transformations (JavaScript), and merges the change events into BigQuery tables using a primary-key-based MERGE. It is the right answer when you need to mask a column, join in reference data, or split one source table into multiple destination tables before the data lands in BigQuery.

The Datastream to Spanner template performs the same job for Cloud Spanner targets, which is the standard pattern for heterogeneous database migrations: Oracle source, Datastream capturing changes to GCS, Dataflow template applying them to Spanner, with the Spanner schema potentially restructured for Spanner's interleaved tables and primary key model.

For custom transformations beyond what the templates handle, you write a Dataflow job that reads from the Cloud Storage destination using the Datastream IO connector. This is the escape hatch for anything genuinely complex: time-window aggregations, joins against streaming reference data, ML inference on every change event, or routing to multiple heterogeneous destinations.

Exam Tips

"Low-latency replication of an operational database into BigQuery, no agents on the source, no infrastructure to manage" almost always points to a Datastream BigQuery destination.
"Capture changes from MySQL/PostgreSQL/Oracle for downstream transformation in Dataflow" points to a Datastream Cloud Storage destination plus the Datastream to BigQuery (or Spanner) template.
"Migrate Oracle to Cloud SQL with continuous replication during cutover" points to Database Migration Service, which uses Datastream under the hood but exposes a migration-specific workflow.
"Source database is on-premises and we cannot expose it to the public internet" points to Datastream private connectivity over Cloud Interconnect or HA VPN.
"We need to replay every historical change for audit" points to a Cloud Storage destination with infinite retention; the BigQuery destination only stores the current state.
A stream that falls behind is almost never solved by "increase the throughput tier"; the standard answers are split the stream, increase source log retention, or check the destination IAM and quotas.
The Datastream service account needs source-side database permissions plus destination-side Google Cloud IAM. Permission errors on either side stall the stream silently.

Frequently Asked Questions (FAQ)

Does Datastream support SQL Server?

Yes. SQL Server has been a generally available source since 2023. Datastream uses the engine's built-in CDC change tables (populated by SQL Agent), which is why SQL Server Express is unsupported but Standard, Enterprise, and Cloud SQL for SQL Server all work. You enable CDC at the database level, then per table, and grant the Datastream user SELECT on the change tables.

How is Datastream pricing calculated?

You pay per gigabyte processed, with a higher per-GB rate for backfill traffic than for steady-state CDC. There is no charge for connection profiles, private connectivity configurations, or the stream object itself. Destination storage and any Dataflow processing are billed separately under their own services. For a typical OLTP workload generating 100 GB of changes per day, Datastream itself is usually the smaller line item; BigQuery storage and queries dominate the total bill.

Can I transform data inside Datastream?

Only minimally. You can include or exclude schemas, tables, and columns, and you can rename the destination dataset. Anything richer (column masking, joins, type casting beyond the defaults) requires a Cloud Storage destination plus a Dataflow job. This is by design: keeping the Datastream service narrow makes it more reliable and easier to operate.

What is realistic end-to-end latency?

For a healthy stream with private connectivity and a destination in the same region as the source, end-to-end latency from source commit to BigQuery query visibility is typically 3 to 15 seconds. The dominant factors are source log generation rate and max_staleness configuration on the BigQuery destination. Public-internet sources add network variance; cross-region destinations add hundreds of milliseconds.

Does Datastream guarantee exactly-once delivery?

Datastream provides at-least-once delivery to the destination. The BigQuery destination deduplicates using the primary key and the change event timestamp, so the practical effect on a BigQuery table is exactly-once. The Cloud Storage destination may write the same event to two different files in rare retry scenarios; consumers must deduplicate by primary key plus _metadata_timestamp.

What happens if the source database is unavailable for several hours?

The stream pauses and Datastream remembers its log position. When the source comes back, the stream resumes from where it left off, provided the source still has the relevant log entries on disk. This is why source log retention matters: a four-hour outage is fine if you have seven days of binlog or archive log retention, and a disaster if you have only one hour.

BigQuery Data Modeling and Clustering covers how to design the destination tables that Datastream populates, including partitioning strategies that keep CDC queries cheap.
Dataflow Architecture Selection covers when to reach for a Dataflow template versus a custom pipeline on top of the Datastream Cloud Storage destination.
Batch vs Streaming Design covers the higher-level decision of when CDC streaming is the right pattern compared with periodic batch extracts.

Introduction to Datastream CDC Ingestion Architecture

白話文解釋（Plain English Explanation）

Think of Datastream as a court stenographer

Think of CDC like a hotel housekeeping log

Think of the BigQuery destination as a self-cleaning whiteboard

Core Concepts of Datastream CDC Ingestion Architecture

Connection profile

Stream

Private connectivity configuration

Route

Source Connectors in Detail

Oracle

MySQL

PostgreSQL

SQL Server

AlloyDB for PostgreSQL

Architecture and Data Flow

Throughput tiers and parallelism

BigQuery Destination Deep Dive

Cloud Storage Destination Deep Dive

Schema Drift Handling

Private Connectivity Patterns

VPC peering for sources in Google Cloud

VPC peering plus Cloud Interconnect or VPN for on-premises sources

Forward SSH tunnel through a bastion

Common Pitfalls and Trade-offs

Best Practices

Real-World Use Case

Integration with Dataflow Templates

Exam Tips

Frequently Asked Questions (FAQ)

Does Datastream support SQL Server?

How is Datastream pricing calculated?

Can I transform data inside Datastream?

What is realistic end-to-end latency?

Does Datastream guarantee exactly-once delivery?

What happens if the source database is unavailable for several hours?

Related Topics

Further Reading

Official sources

More PDE topics