Data Migration and Validation — GCP PCA Study Notes

Q: Q1. When should I use gsutil vs. Storage Transfer Service?

Use gsutil (or gcloud storage ) for ad-hoc, smaller transfers from your local machine. Use STS for large-scale, scheduled, or cross-cloud transfers where you want Google to manage the scaling and retries.

Q: Q2. How do I handle data that changes during a long migration?

Use Change Data Capture (CDC) . The Database Migration Service (DMS) does this for SQL databases by reading the transaction logs and "replaying" them on the destination. For analytics destinations (BigQuery), use Datastream .

Q: Q5. Why is my migration slow even with a 10Gbps Interconnect?

Common bottlenecks include: Small Files: Moving millions of 1KB files is much slower than moving one 1TB file due to metadata overhead. Disk I/O: Your source storage system might not be able to read data fast enough. MTU Mismatch: Network packet fragmentation can severely degrade performance.

Q: Q6. Do I need DVT if I already have STS checksums?

Yes. STS checksums prove object integrity (the bytes arrived). DVT proves semantic integrity (the rows, sums, and schemas match). They answer different questions; both are required.

Introduction to Data Migration

Data migration is often the most challenging part of a cloud transition. For a Professional Cloud Architect, the goal is not just moving bits from A to B, but doing so securely, efficiently, and with verified integrity.

A successful data migration requires selecting the right tool based on data volume, available bandwidth, downtime tolerance, and source data shape (object store, RDBMS, data warehouse, or SaaS API).

白話文解釋（Plain English Explanation）

The Migration Decision Matrix

Architects must evaluate the following variables:

Data Volume: How many Terabytes or Petabytes?
Bandwidth: What is the upload speed (VPN, Interconnect, or Public Internet)?
Time Window: How much downtime is acceptable?
Data Type: Structured (SQL), Semi-structured (JSON), or Unstructured (Files/Images)?

Tool	Use Case	Offline/Online
`gsutil` / `gcloud storage`	Small to medium datasets (<10TB)	Online
Storage Transfer Service	Large datasets from other clouds or S3	Online
Transfer Appliance	Very large datasets (>20TB) with low bandwidth	Offline
BigQuery Data Transfer	SaaS data (Ads, YouTube) or other Data Warehouses	Online
Database Migration Service	Migrating SQL/PostgreSQL to Cloud SQL/AlloyDB	Online (CDC)
Datastream	CDC from Oracle/MySQL/PostgreSQL/SQL Server to BigQuery/GCS	Online (CDC)
BigQuery Migration Service	Translate Teradata/Redshift/Snowflake SQL to BigQuery SQL	Schema/SQL only

The right tool is rarely a single product. A typical Teradata-to-BigQuery migration uses BigQuery Migration Service for schema and SQL translation, Storage Transfer Service or Transfer Appliance for the bulk file dump, Datastream (or a custom Kafka pipeline) for ongoing CDC, and the Data Validation Tool (DVT) for reconciliation. Treat the migration as a pipeline of services, not a single tool selection.

Database Migration Service (DMS) for Cloud SQL and AlloyDB

Database Migration Service is the managed path for moving MySQL, PostgreSQL, and SQL Server to Cloud SQL or AlloyDB for PostgreSQL. It is the default answer on the PCA exam when the scenario says "minimal downtime database migration to a managed Google Cloud database."

Change Data Capture (CDC): A technique that captures every INSERT, UPDATE, and DELETE from a source database's transaction log (MySQL binlog, PostgreSQL WAL, Oracle redo log, SQL Server transaction log) and replays them on a destination system. CDC enables near-zero-downtime migrations because the destination stays caught up with the source until cutover. DMS uses CDC internally; Datastream exposes it as a standalone pipe.

How DMS works

Connection profile — define the source (on-prem, GCE, AWS RDS, Azure Database). Connectivity options include IP allowlist, reverse SSH tunnel through a GCE bastion, or VPC peering over Interconnect/VPN.
Migration job — choose one-time (dump + restore) or continuous (initial dump + CDC via binlog/WAL replay).
Initial load — DMS uses mysqldump / pg_dump semantics under the hood and streams to the destination.
CDC phase — DMS tails the source binlog (MySQL) or replication slot (PostgreSQL) and replays changes to the destination in near-real-time.
Promotion — when replication lag is near zero, you stop writes on the source, wait for the last events to drain, and promote the destination. Promotion makes the destination writable and breaks the replication link.

DMS gotchas

Source prerequisites: binlog_format=ROW for MySQL, wal_level=logical and a replication slot for PostgreSQL. Missing these is the #1 cause of failed migrations.
Heterogeneous migrations (e.g., Oracle to PostgreSQL on AlloyDB) require Database Migration Service for Oracle, which uses Datastream under the hood plus a conversion workspace.
Sequences and AUTO_INCREMENT must be reset on the destination after promotion or you will hit duplicate-key errors on the first write.
DMS does not migrate users, roles, grants, or DBA-level objects — you must script those separately.

A common exam trap is choosing DMS for a multi-TB migration over a slow VPN. DMS streams the initial dump over the wire; if the link is 100 Mbps and the dump is 5 TB, the initial load alone takes 5+ days, during which CDC lag grows. The correct architecture is Transfer Appliance for the bulk dump + DMS in CDC-only mode seeded from the appliance restore.

Datastream CDC for Ongoing Replication

Datastream is the serverless change data capture (CDC) and replication service. Where DMS is opinionated (it lands data in a managed Cloud SQL/AlloyDB instance), Datastream is a CDC pipe — it reads changes from Oracle, MySQL, PostgreSQL, or SQL Server and writes them to BigQuery, Cloud Storage (as Avro/JSON), or downstream Dataflow templates.

Typical Datastream patterns

Oracle to BigQuery near-real-time analytics: Datastream reads Oracle redo logs via LogMiner, writes change events to a staging GCS bucket, and a Dataflow template merges them into BigQuery tables. Replication lag is typically 1-15 minutes.
MySQL to BigQuery direct write: Datastream now supports direct writes to BigQuery, removing the Dataflow merge step for simple cases.
Hybrid CDC + bulk: Datastream does a backfill (snapshot) of historical data, then switches to CDC for ongoing changes — all in one stream definition.

Datastream vs DMS

Dimension	DMS	Datastream
Destination	Cloud SQL, AlloyDB	BigQuery, GCS, Dataflow
Use case	Database lift-and-shift	Analytics CDC, ETL feeding
Promotion	Yes (breaks replication)	No (continuous)
Heterogeneous source/target	Limited	Yes (Oracle to BigQuery, etc.)

On the PCA exam, the phrase "near-real-time analytics on operational data" points to Datastream → BigQuery. The phrase "migrate the database with minimal downtime" points to DMS → Cloud SQL/AlloyDB. The destination is the disambiguator: analytics warehouse vs operational database.

Storage Transfer Service for GCS

Storage Transfer Service (STS) is a fully managed service for moving object data into Cloud Storage. It has two flavours:

STS for cloud-to-cloud

Sources: AWS S3, Azure Blob Storage, HTTP/HTTPS lists, another GCS bucket.
Authentication: AWS access keys, Azure SAS tokens, or S3-compatible credentials. For S3, you can also use Federated Identity via AWS STS AssumeRole — no long-lived keys.
Scheduling: one-time, daily, or event-driven (S3 event notifications via SQS).
Validation: STS computes CRC32C or MD5 on every object and compares with the destination. Mismatched objects are retried.

STS for on-premises

You install STS agents (Docker containers) on machines inside your data centre. The agents read from NFS/POSIX file systems and push to GCS via the STS control plane.
Bandwidth is throttleable per agent and per job, so you can avoid saturating production network links.
Agents auto-resume if the network drops — no rsync babysitting.

STS gotchas

STS preserves object names but not POSIX metadata (uid/gid/mode) unless you enable preserve_metadata.
For billions of small files, STS becomes metadata-bound. Consider gcloud storage cp with parallel composite uploads or Transfer Appliance for the bulk dump.
STS does not transcode or transform — it is byte-for-byte copy. Schema-aware transforms require Dataflow.

Transfer Appliance for Offline Migration

Transfer Appliance is Google's "sneakernet" — a rack-mountable, encrypted storage appliance shipped to your data centre. You fill it, ship it back to Google, and the data is ingested into GCS.

When to use Transfer Appliance

Bandwidth math fails: A useful heuristic is the "T-shirt rule" — if uploading over your link takes more than 1 week, consider an appliance. For 100 TB over a 100 Mbps link, online transfer takes ~100 days; an appliance takes ~2-3 weeks door-to-door.
One-time bulk seed: Often paired with Datastream or DMS CDC, where the appliance carries the bulk historical data and CDC handles the delta.
Air-gapped or restricted networks: Sites that cannot egress to the public internet at scale.

Appliance variants

TA40 — 40 TB usable, desktop form factor, FedEx-able.
TA300 — 300 TB usable, rack-mounted, freight-shipped.

Security model

Data is encrypted on the appliance using AES-256 with a customer-supplied passphrase. Google does not hold the key.
On arrival at Google, the appliance is decrypted, data is ingested into your GCS bucket, then the appliance is securely wiped (NIST 800-88).
If the appliance is lost in transit, the data is unreadable without your passphrase.

Order the Transfer Appliance early. Lead time for TA300 is typically 10-15 business days for shipping, plus your loading time, plus 5-10 days for return shipping and Google-side ingestion. Architects who don't budget 4-6 weeks of total wall-clock time miss their cutover window.

BigQuery Data Transfer Service for SaaS Sources

BigQuery Data Transfer Service (BQ DTS) is a scheduled, managed pull-based loader for SaaS and external warehouse sources into BigQuery datasets.

Supported sources

Google SaaS: Google Ads, Campaign Manager, Display & Video 360, Search Ads 360, YouTube Channel/Content Owner, Google Merchant Center, Google Play.
External cloud storage: Amazon S3, Azure Blob Storage (via the GCS connector).
External warehouses: Teradata, Amazon Redshift, Snowflake (via the BigQuery Migration Service path).

How BQ DTS works

You authorise a transfer config with OAuth (for SaaS sources) or IAM/service-account credentials (for cloud storage).
DTS runs on a schedule (daily, hourly, or custom cron) and pulls the latest data.
Data lands in partitioned tables matching the source schema. Schema changes are auto-handled for most SaaS sources.
Failures send Cloud Monitoring alerts; backfill is a CLI flag.

Common architecture

A marketing analytics platform on BigQuery typically uses BQ DTS for Google Ads + YouTube + DV360, plus Storage Transfer Service for S3-hosted CSV exports from third-party DSPs, all unioned in a BigQuery dataset.

BigQuery Migration Service for SQL Translation

The BigQuery Migration Service is the umbrella product for schema and SQL translation from legacy data warehouses (Teradata, Redshift, Snowflake, Oracle, Hive, Vertica, SQL Server, Netezza) into BigQuery.

Components

Assessment — analyses your source warehouse logs to estimate BigQuery slot needs, identifies unsupported SQL patterns, and reports table-level read frequency so you can prioritise.
Schema and Data Transfer — generates BigQuery DDL from the source DDL, handling type mapping (e.g., Teradata DECIMAL(38,18) → BigQuery NUMERIC), and orchestrates a one-time data dump via STS or appliance.
SQL Translation (Batch and Interactive) — converts source SQL dialect to GoogleSQL. The interactive translator is exposed in the BigQuery console; the batch translator processes thousands of queries via a config file.

What it does not do

It does not migrate stored procedures or workflows verbatim — Teradata BTEQ and Redshift stored procedures often need manual rewrite or a tool like Cloud Workflows or Dataform.
It does not handle CDC. Pair with Datastream or a custom CDC pipeline for ongoing replication during the parallel-run period.

Translate first, validate second, cut over third. The BigQuery Migration Service handles translate, the Data Validation Tool handles validate, and your runbook handles cut over. Skipping validation between translate and cut over is the most common cause of production incidents post-migration.

Data Validation Tool (DVT)

The Data Validation Tool (DVT) is the open-source companion to BigQuery Migration Service, maintained by Google Professional Services on GitHub (GoogleCloudPlatform/professional-services-data-validator). It is the de facto standard for post-migration reconciliation.

What DVT does

DVT runs the same validation query against the source and destination and compares results. Supported source/target combinations include Teradata, Redshift, Snowflake, Oracle, SQL Server, PostgreSQL, MySQL, BigQuery, Spanner, AlloyDB.

Validation types

Column validation — COUNT, SUM, AVG, MIN, MAX over numeric columns. Cheap, runs in seconds.
Row validation — hash each row (concatenate columns, SHA-256) on both sides and compare. Expensive but catches per-row drift.
Schema validation — compare table DDL: column names, types, nullability.
Custom query validation — bring your own SQL (e.g., SELECT customer_id, SUM(amount) FROM orders GROUP BY 1) and DVT runs it on both sides.

How DVT is operated

Run as a Python CLI, in a Cloud Run job, or as a Cloud Composer (Airflow) task.
Results are written to a BigQuery results table (pso_data_validator.results), making it trivial to dashboard with Looker Studio.
Threshold-based: you can configure tolerances (e.g., ±0.01% on a SUM) to account for floating-point rounding between dialects.

Row-Count and Checksum Reconciliation Patterns

Even without DVT, every migration runbook must include reconciliation queries that an auditor can re-run.

The three reconciliation tiers

Tier 1 — Row count (SELECT COUNT(*) FROM source.t vs SELECT COUNT(*) FROM dest.t). Catches dropped rows. Runs in seconds. Always required.
Tier 2 — Aggregate checksums (SELECT SUM(amount), MIN(created_at), MAX(created_at), COUNT(DISTINCT customer_id) FROM ...). Catches silent type coercion (e.g., DECIMAL truncation) and date-shift bugs.
Tier 3 — Row-level hash (SELECT MD5(CONCAT(col1, col2, ...)) FROM ...). Catches per-row drift. Expensive — usually sampled (e.g., latest 7 days, or WHERE MOD(id, 1000) = 0).

Patterns that catch real bugs

NULL handling: COUNT(*) matches but COUNT(col) differs by 1,200 — the source had NULLs that got coerced to empty string during ETL.
Time zone drift: MAX(created_at) differs by 8 hours — the source was Asia/Taipei, destination assumed UTC.
Float precision: SUM(amount) differs by 0.03 across 10M rows — DOUBLE source to NUMERIC destination rounding. Tolerable; document it.

Reconciliation queries must be reproducible by an auditor 6 months later. Check them into Git, pin them to a SQL filename like recon_orders.sql, and run them via Cloud Composer or Cloud Run jobs. A reconciliation PDF emailed to a Slack channel is not auditable.

Cutover Runbook

The cutover is the moment of truth — the application switches from old to new. A good runbook is minute-by-minute and reviewed by the on-call team a week before.

Cutover phases (DMS to Cloud SQL example)

T-7 days: DMS replication running. Lag dashboard green for 7 consecutive days.
T-1 day: Dress rehearsal — run cutover on a staging clone, time every step, measure rollback time.
T-2 hours: Freeze schema changes on the source. Notify stakeholders.
T-0: Enter maintenance window. Set source to read-only (FLUSH TABLES WITH READ LOCK for MySQL, ALTER DATABASE ... ALLOW_CONNECTIONS false for PostgreSQL).
T+2 min: Wait for DMS lag to reach 0. Confirm with SHOW SLAVE STATUS on the destination.
T+5 min: Promote the destination in DMS. Run smoke-test queries (read + write on a canary table).
T+10 min: Flip application connection string (via Cloud DNS swap, HAProxy reconfig, or Secret Manager version). Restart app pods.
T+15 min: Run Tier 1 reconciliation (row counts on top 20 tables).
T+30 min: Run Tier 2 reconciliation (aggregate checksums) and DVT.
T+60 min: Sign-off or rollback decision.

Common cutover failures

Connection-string cache: Application pods cached the old DNS for 5 minutes; users hit the read-only source. Fix: pre-shorten TTL to 30 seconds 24 hours before cutover.
Sequence drift: Auto-increment counter on destination was lower than source max — duplicate key errors. Fix: SELECT setval('orders_id_seq', (SELECT MAX(id) FROM orders)) post-promotion.
App-server clock skew: Application server's NTP drifted, inserting future timestamps. Fix: validate clock skew across all writers in the dress rehearsal.

Backout Plan

A backout plan (also called rollback plan) is a documented procedure to return to the source system if cutover fails. No backout plan = no production cutover.

What a real backout plan contains

Decision criteria — at T+60 min, if any of the following are true, rollback: row-count delta > 0.01%, P99 latency > 2x baseline, error rate > 0.5%, business owner says no.
Mechanism — DMS does not have a reverse-promotion button. You must keep the source in read-only mode during the validation window, then flip the connection string back. If the source was already taken read-write again, you need a reverse CDC pipeline (Datastream from Cloud SQL back to on-prem) — set this up before cutover.
Time budget — most backout plans target < 30 minutes from decision to traffic flowing on the source again. Anything longer and the business will demand answers about the cutover window itself.
Communication plan — Slack channel, incident commander, customer-facing status page, regulator notification (for FSI/healthcare).

Backout window length

Keep the source available read-only for at least 7-14 days post-cutover. This is your insurance policy for slow-burn bugs (e.g., a monthly billing report runs on day 8 and finds the data is wrong).
For BigQuery migrations, keep the source warehouse + DTS schedules running in parallel for one full business cycle (typically 30 days) so you can A/B compare reports.

Many teams treat the backout plan as a paragraph in a Confluence doc. On the PCA exam and in real production, the backout plan must be executable in under 30 minutes without paging the original migration engineers. If your rollback requires the senior architect to "remember the steps," it does not exist.

Data Validation Strategies: Ensuring Integrity

Migration is a failure if the data is corrupted or incomplete. Architects must implement a Validation Framework.

1. Checksums (MD5/CRC32C)

Ensure that every bit sent is every bit received. GCS automatically performs MD5/CRC32C checks on upload. For large migrations, verify the checksums on the source and destination after the transfer.

2. Count Validation

The simplest check: SELECT COUNT(*) on the source vs. the destination. If the numbers don't match, rows were dropped.

3. Schema Validation

Ensure that data types haven't changed (e.g., a "Date" field didn't turn into a "String" field during an ETL process).

4. Business Logic / Spot Checks

Pick a few "Key Customers" or "Complex Records" and manually verify their data across both systems.

Migration Patterns: Lift & Shift vs. Refactor

Offline (Lift & Shift): Export data -> Ship disk -> Import. High downtime, low complexity.
Online (Replatform): Set up a replica -> Sync data -> Cutover. Low downtime, higher complexity.
Streaming (Refactor): Use Pub/Sub and Dataflow to pipe data into the new system in real-time.

FAQ — Data Migration and Validation

Q1. When should I use `gsutil` vs. Storage Transfer Service?

Use gsutil (or gcloud storage) for ad-hoc, smaller transfers from your local machine. Use STS for large-scale, scheduled, or cross-cloud transfers where you want Google to manage the scaling and retries.

Q2. How do I handle data that changes during a long migration?

Use Change Data Capture (CDC). The Database Migration Service (DMS) does this for SQL databases by reading the transaction logs and "replaying" them on the destination. For analytics destinations (BigQuery), use Datastream.

Q3. What is the "Cutover" and how do I minimize downtime?

The cutover is the moment you point your application from the old database to the new one. To minimize downtime, use a "Read-Only" mode on the source during the final sync, or use a Proxy layer to flip the traffic instantly. See the Cutover Runbook section above.

Q4. Does Transfer Appliance encrypt my data?

Yes. The data is encrypted at the source before it ever hits the appliance using AES-256 with a customer-supplied passphrase. Google does not have the keys; you manage the encryption keys.

Q5. Why is my migration slow even with a 10Gbps Interconnect?

Common bottlenecks include:

Small Files: Moving millions of 1KB files is much slower than moving one 1TB file due to metadata overhead.
Disk I/O: Your source storage system might not be able to read data fast enough.
MTU Mismatch: Network packet fragmentation can severely degrade performance.

Q6. Do I need DVT if I already have STS checksums?

Yes. STS checksums prove object integrity (the bytes arrived). DVT proves semantic integrity (the rows, sums, and schemas match). They answer different questions; both are required.

Final Architect Tip

On the PCA exam, if you see a scenario with "Limited Bandwidth" and "Large Data Volume", the answer is almost always Transfer Appliance. If the data is coming from AWS S3, the answer is Storage Transfer Service. If it's an operational database to Cloud SQL/AlloyDB, it's DMS. If it's CDC into BigQuery for analytics, it's Datastream. If it's a legacy data warehouse SQL conversion, it's BigQuery Migration Service plus DVT for validation. Always pair migration with a cutover runbook and a tested backout plan — checksums are your best friend for proving to stakeholders that the migration was successful.

Introduction to Data Migration

白話文解釋（Plain English Explanation）

The Migration Decision Matrix

Database Migration Service (DMS) for Cloud SQL and AlloyDB

How DMS works

DMS gotchas

Datastream CDC for Ongoing Replication

Typical Datastream patterns

Datastream vs DMS

Storage Transfer Service for GCS

STS for cloud-to-cloud

STS for on-premises

STS gotchas

Transfer Appliance for Offline Migration

When to use Transfer Appliance

Appliance variants

Security model

BigQuery Data Transfer Service for SaaS Sources

Supported sources

How BQ DTS works

Common architecture

BigQuery Migration Service for SQL Translation

Components

What it does not do

Data Validation Tool (DVT)

What DVT does

Validation types

How DVT is operated

Row-Count and Checksum Reconciliation Patterns

The three reconciliation tiers

Patterns that catch real bugs

Cutover Runbook

Cutover phases (DMS to Cloud SQL example)

Common cutover failures

Backout Plan

What a real backout plan contains

Backout window length

Data Validation Strategies: Ensuring Integrity

1. Checksums (MD5/CRC32C)

2. Count Validation

3. Schema Validation

4. Business Logic / Spot Checks

Migration Patterns: Lift & Shift vs. Refactor

FAQ — Data Migration and Validation

Q1. When should I use gsutil vs. Storage Transfer Service?

Q2. How do I handle data that changes during a long migration?

Q3. What is the "Cutover" and how do I minimize downtime?

Q4. Does Transfer Appliance encrypt my data?

Q5. Why is my migration slow even with a 10Gbps Interconnect?

Q6. Do I need DVT if I already have STS checksums?

Final Architect Tip

Official sources

More PCA topics

Q1. When should I use `gsutil` vs. Storage Transfer Service?