examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 21 min

Handling Duplicate and Corrupt Data

4,180 words · ≈ 21 min read ·

Practical GCP PDE guide to handling duplicate and corrupt data: Pub/Sub EOD, Dataflow stateful dedup, BigQuery MERGE, Dataform assertions, Dataplex scans, DLQs.

Do 20 practice questions → Free · No signup · PDE

Introduction to Handling Duplicate and Corrupt Data

Handling duplicate and corrupt data sits at the centre of every real production data pipeline on Google Cloud. The PDE exam keeps coming back to this theme because messy input is the default state of the world, and engineers who pretend otherwise end up paying for it in BigQuery storage bills and customer-trust incidents.

This note walks through the full toolkit: Pub/Sub at-least-once and exactly-once delivery, Dataflow stateful deduplication, BigQuery MERGE patterns, Dataform assertions, Dataplex auto data quality scans, dead letter queues, schema evolution, malformed row routing during BigQuery loads, and idempotent transformations. By the end you should be able to draw the failure modes on a whiteboard and pick the right mitigation for each.

白話文解釋(Plain English Explanation)

Think of Pub/Sub Delivery Like a Postal System

Imagine you mail a wedding invitation by registered post. The postal service guarantees the envelope arrives, but it does not promise it arrives only once. If the postman is unsure whether you signed the receipt, they may come back tomorrow with the same envelope. That is at-least-once delivery. You will never miss the invitation, but you might get two of them.

Pub/Sub default behaviour works the same way. Every message is guaranteed to land in your subscriber, but a network blip during ack can cause a redelivery a few seconds later. Exactly-once delivery (EOD) is like paying extra for a single-attempt courier with a tamper-proof tracking system inside one region. The courier still tries hard to deliver, but the postal company internally deduplicates so you sign for the envelope only once. EOD is not magic, it is bookkeeping.

Think of Dataflow Stateful Dedup Like a Nightclub Bouncer

A bouncer at the door has a clipboard. Every time someone walks up, the bouncer writes down their ID. If the same ID appears again within the night, they get sent home. The clipboard is state, the night is the time window, and the ID is the dedup key.

Dataflow's Deduplicate transform and stateful DoFn work the same way. A worker keeps a per-key state cell (the clipboard) and a timer that wipes the cell after a configurable window (closing time). Within that window, repeats are dropped. After the window, memory is freed so the bouncer is not carrying around years of guest lists.

Think of Data Quality Scans Like Restaurant Health Inspections

A health inspector visits a restaurant once a quarter. They check whether the chicken is at the right temperature, whether the dishwasher reaches 80 degrees, and whether the staff washed their hands. If anything fails, they file a report and may shut the kitchen down. The chef does not run the inspection, an outside body does.

Dataplex auto data quality scans play the inspector role for your BigQuery tables and Cloud Storage buckets. They run on a schedule, check declarative rules (row_count > 0, null_percent(email) < 1), publish results to a dashboard, and can fire Cloud Logging alerts. The pipeline keeps cooking, but you find out fast when something is off.

Think of Dead Letter Queues Like a Hospital Triage Bay

When ambulances arrive at a busy emergency room, the worst-case patients go straight to triage rather than blocking the main ward. Once stable, they get reviewed by a senior doctor who decides whether to treat, transfer, or discharge.

A DLQ is the triage bay for messages your pipeline could not parse. They sit safely on a separate Pub/Sub topic or BigQuery error table, ready for a human or replay job to inspect. The main ward (your production pipeline) keeps moving instead of grinding to a halt over one malformed JSON.

Core Concepts of Handling Duplicate and Corrupt Data

A duplicate is a record that represents the same logical event as another record. The two records may have identical bytes, identical primary keys with newer timestamps, or different surface forms but the same business meaning. The classification matters because the dedup strategy depends on which type you face.

Corrupt data is a wider bucket. It includes malformed JSON or Avro, schema drift (new field, removed field, type change), out-of-range values, encoding issues (UTF-8 vs Latin-1), truncated payloads, and semantically wrong values that pass syntactic validation. Some of these can be detected early (parser fails); others only show up at the analytics layer (revenue is suddenly negative).

A transformation that produces the same output regardless of how many times it runs on the same input. Writing UPDATE users SET status='active' WHERE id=42 is idempotent. Writing UPDATE users SET balance = balance + 10 WHERE id=42 is not. Idempotence is the cheapest dedup tool you have. See https://cloud.google.com/architecture/exactly-once-processing-dataflow-pubsub

The exam asks you to reason about delivery semantics in three layers: producer to broker, broker to consumer, and consumer to sink. End-to-end exactly-once requires every layer to cooperate, which is why most production systems aim for at-least-once delivery plus idempotent sinks rather than chasing true exactly-once everywhere.

Architecture and Design Patterns

The standard end-to-end pattern for handling duplicate and corrupt data on GCP looks like this:

  1. Producers attach a stable event_id (UUID v4 or a deterministic hash of the payload) and an event_time timestamp.
  2. Pub/Sub transports messages with at-least-once delivery, optionally upgraded to EOD on the regional subscription.
  3. A Dataflow streaming job applies a windowed Deduplicate transform keyed on event_id.
  4. The job parses payloads inside a try/except wrapper. Successes flow to the main PCollection; failures get tagged and pushed to a DLQ Pub/Sub topic or a BigQuery error table.
  5. Validated records land in a BigQuery staging table.
  6. A scheduled BigQuery MERGE upserts staging into a curated table keyed on the natural primary key.
  7. Dataform assertions and Dataplex scans run after the merge to flag anomalies.
  8. A separate replay job pulls from the DLQ, applies fixes, and re-ingests through the same pipeline.

This pattern has three properties that make it survivable in production. Failures are isolated, replays do not cause double-counting, and humans get visibility through both metrics and a queryable DLQ table.

The most common architectural mistake is treating the DLQ as a fire-and-forget destination. A DLQ without a replay process is just a graveyard. Build the replay path on day one, even if it starts as a manual SQL script. See https://cloud.google.com/dataflow/docs/guides/common-errors

A second pattern, used for high-volume IoT or clickstream data, is the lambda-style split: a streaming job for low-latency dedup with relaxed accuracy, plus a nightly batch reconciliation job over the raw landing zone that re-derives the curated table with full dedup logic. The batch job is allowed to be expensive because it runs once a day; the streaming job is allowed to be approximate because batch will fix it.

GCP Service Deep Dive

Pub/Sub At-Least-Once vs Exactly-Once Delivery

Pub/Sub default delivery is at-least-once. The broker holds a message until the subscriber acknowledges within the ack deadline (default 10 seconds, configurable up to 600). If the ack does not arrive in time, Pub/Sub assumes the worker died and redelivers. Network jitter alone produces a baseline duplicate rate of roughly 0.01 to 0.1 percent in healthy systems, much higher during incidents.

Exactly-once delivery, enabled per subscription, changes two things. First, Pub/Sub guarantees no redelivery of a successfully acknowledged message within the ack deadline. Second, it provides a stronger ack contract: the modAckDeadline and ack calls return success only when the broker has durably recorded them. EOD is regional, requires pull subscriptions (not push), and adds latency. It does not protect against duplicates produced by the publisher itself, only against broker-side redeliveries.

EOD is often misunderstood as "no duplicates ever." It only eliminates broker-side redelivery within the ack deadline. If your publisher retries on a network timeout without an idempotency key, you still get duplicates. Always pair EOD with a stable event_id and a downstream dedup step. See https://cloud.google.com/pubsub/docs/exactly-once-delivery

EOD costs more per message than standard delivery and caps subscription throughput lower. For most pipelines the right answer is at-least-once Pub/Sub plus dedup in Dataflow, not EOD plus complacency.

PDE scenarios that mention producer retries, network timeouts, or end-to-end exactly-once expect the answer: stable application-level event_id plus a Deduplicate transform in Dataflow plus an idempotent sink (BigQuery MERGE or Storage Write API offsets). EOD alone is the wrong answer because it only covers the broker-to-subscriber leg. The messageId Pub/Sub assigns is per-publish, not per-event, so it cannot be used as the dedup key. See https://cloud.google.com/architecture/exactly-once-processing-dataflow-pubsub

Dataflow Deduplication Using Stateful Processing

Apache Beam ships a built-in Deduplicate transform. You hand it a key extractor and a duration. The transform keeps a per-key state cell with a timer; if the same key arrives twice within the duration, the second copy is dropped.

from apache_beam.transforms.deduplicate import Deduplicate
from apache_beam.utils.timestamp import Duration

(p
 | 'Read' >> beam.io.ReadFromPubSub(subscription=SUB)
 | 'Parse' >> beam.Map(parse_event)
 | 'Dedup' >> Deduplicate(
       processing_time_duration=Duration(seconds=600))
 | 'Write' >> beam.io.WriteToBigQuery(TABLE))

For more control, write a custom stateful DoFn that uses BagState or ValueState. This lets you track the latest version of an event and emit only the freshest copy when ties occur, useful for change-data-capture pipelines where the same primary key gets updated multiple times in flight.

State is stored in the worker's runner state backend (typically Spanner-backed for Streaming Engine). The dedup window directly trades memory for accuracy: a 1-hour window catches more duplicates but costs more state storage than a 10-minute window.

BigQuery MERGE for Upserts

BigQuery MERGE combines INSERT, UPDATE, and DELETE into a single atomic statement. It is the canonical way to dedupe a staging table into a curated table on a key.

MERGE `proj.curated.orders` T
USING (
  SELECT * EXCEPT(rn) FROM (
    SELECT *,
      ROW_NUMBER() OVER (
        PARTITION BY order_id
        ORDER BY event_time DESC
      ) AS rn
    FROM `proj.staging.orders_today`
  )
  WHERE rn = 1
) S
ON T.order_id = S.order_id
WHEN MATCHED AND S.event_time > T.event_time
  THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ROW

The inner ROW_NUMBER collapses duplicates within the staging set; the MERGE then upserts against the curated table. Because MERGE runs as a single transaction, there is no race window where queries see a half-merged state.

For large curated tables, partition on event_date and cluster on the merge key. Without partition pruning, every MERGE rewrites the entire table, which becomes prohibitively expensive past a few hundred GB. Add a WHERE T.event_date BETWEEN ... pruning predicate inside the ON clause. See https://cloud.google.com/bigquery/docs/best-practices-performance-patterns#optimize_merge_statements

Dataform Assertions

Dataform sits on top of BigQuery and lets you declare data quality assertions alongside your SQL transformations. An assertion is a SQL query that should return zero rows; if it returns any rows, the assertion fails and downstream models skip.

config {
  type: "assertion"
}

SELECT order_id, COUNT(*) AS dupes
FROM ${ref("orders_curated")}
GROUP BY order_id
HAVING COUNT(*) > 1

Assertions encode invariants like uniqueness, non-null, referential integrity, and accepted value lists. They run as part of the Dataform schedule, fail loudly in Cloud Logging, and prevent broken data from reaching dashboards. The mental model is unit tests for your warehouse.

Dataplex Data Quality Scans

Dataplex auto data quality (AutoDQ) is the managed scan service for BigQuery and Cloud Storage assets. You define a scan spec with rules, point it at a table, and Dataplex runs it on a schedule (or on demand). Rules cover null checks, value ranges, regex matches, statistical bounds (mean within tolerance), and SQL expressions for custom logic.

The scan publishes results to BigQuery for trend analysis, emits Cloud Logging entries, and can trigger Cloud Monitoring alerts. Unlike Dataform assertions, AutoDQ runs outside your transformation graph, so it does not block writes; it observes them. That separation is useful when the warehouse is shared across many teams and you want centralised quality monitoring without owning each team's pipeline.

Dead Letter Queues in Pub/Sub and Dataflow

Pub/Sub native DLQs are configured per subscription. You set a deadLetterTopic and a maxDeliveryAttempts. When a message exceeds the attempt threshold, Pub/Sub publishes it to the DLQ topic and acks the original. This rescues your subscriber from poison messages without manual intervention.

In Dataflow, the convention is to use tagged outputs. The parsing DoFn yields good records to the main output and bad records (with the raw bytes plus the exception message) to a side output. The side output is written to a BigQuery error table or a Pub/Sub DLQ topic.

class ParseEvent(beam.DoFn):
    def process(self, element):
        try:
            yield json.loads(element)
        except Exception as e:
            yield beam.pvalue.TaggedOutput(
                'errors',
                {'raw': element.decode('utf-8', 'replace'),
                 'error': str(e),
                 'received_at': datetime.utcnow().isoformat()})

The error table schema typically includes the raw payload, the exception class and message, the timestamp received, and a pipeline version tag. With those columns you can write replay queries that filter by error type, fix the bug, and re-ingest.

Schema Evolution

Schema drift is one of the leading causes of pipeline incidents. A producer team adds a new field without telling consumers. The schema check fails. The pipeline stalls. The PDE exam expects you to know how each storage layer handles this.

BigQuery supports schema relaxation on load: --schema_update_option=ALLOW_FIELD_ADDITION permits new optional columns to appear in the source file and be added to the table. ALLOW_FIELD_RELAXATION allows REQUIRED columns to be relaxed to NULLABLE. Type changes and column removals require manual migration.

Pub/Sub schemas (Avro or Protocol Buffers) enforce contracts at publish time. Schema revisions are versioned, and you mark each revision as backwards-compatible, forwards-compatible, or full. Subscribers tied to a specific revision continue to work even as new revisions roll out.

Backwards compatible: new readers can read old data. Forwards compatible: old readers can read new data. Full compatibility: both directions hold. For event-driven pipelines, aim for full compatibility so producers and consumers can deploy independently. See https://cloud.google.com/pubsub/docs/schemas

Malformed Row Handling in BigQuery Load Jobs

The bq load command (and the underlying Load Jobs API) accepts a --max_bad_records flag. If a row fails to parse, BigQuery skips it and increments a counter. The job succeeds as long as the counter stays below the threshold. The skipped rows are not stored anywhere by default.

For visibility, prefer Storage Write API with stream-level error handling, or do the parsing in Dataflow where you control the DLQ path. Setting max_bad_records=10000 and forgetting about it is the classic anti-pattern: data quietly disappears and you find out months later when a finance audit fails.

Avoid max_bad_records for production pipelines. Either set it to zero and fail fast, or move parsing into Dataflow where bad rows go to a queryable error table. Silent skips are how you lose 2% of your revenue events for six months. See https://cloud.google.com/bigquery/docs/loading-data

Idempotent Transformations and Insert IDs

The cheapest dedup primitive is making your write idempotent. BigQuery streaming inserts (legacy tabledata.insertAll) accept an insertId field. The service maintains a best-effort dedup window (about one minute) keyed on insertId. Two inserts with the same id within the window collapse into one row.

For the newer Storage Write API, exactly-once semantics are achieved with offsets on a committed stream. The client provides a monotonically increasing offset per row. The server rejects rows whose offset is out of order, giving you provable exactly-once behaviour even across client retries.

For batch writes, idempotence comes from designing the load job to be re-runnable. Always use WRITE_TRUNCATE over a date partition rather than WRITE_APPEND against the whole table. Re-running the same job overwrites the same partition without producing duplicates.

Common Pitfalls and Trade-offs

A frequent mistake is layering EOD on top of an at-least-once publisher and assuming the system is now exactly-once end to end. EOD only addresses the broker-to-subscriber leg. If your publisher retries with a fresh messageId on every attempt, the broker sees them as different messages and delivers all of them. The fix is to give every event a stable application-level event_id independent of the Pub/Sub messageId.

Another trap is undersizing the dedup window in Dataflow. If your network can buffer messages for up to 30 minutes during an incident, a 5-minute dedup window will let duplicates through. Sizing the window requires knowing your worst-case redelivery latency, not your typical case.

MERGE performance degrades sharply on unpartitioned tables. Engineers sometimes use MERGE against a 5 TB table without a WHERE predicate, then file a support ticket complaining about cost. The fix is partition pruning in the ON clause, or switching to a partition-overwrite strategy where you INSERT OVERWRITE a single partition per run.

Dataform assertions can become a maintenance burden if you write them as snapshots. An assertion saying "row count must be greater than 1 million" breaks every Monday because Sunday is a slow day. Better assertions check ratios and trends rather than absolutes.

Schema evolution often breaks because a backwards-compatible change is deployed without backfill. Adding a new REQUIRED field is technically a breaking change even if the field has a default; old consumers may not know how to populate it on replay. Treat schema migrations like database migrations: write the rollback plan first.

Best Practices

  • Attach a stable event_id at the producer. Never rely on broker-assigned ids for application-level dedup.
  • Default to at-least-once delivery plus idempotent sinks. Reach for EOD only when the latency and cost trade-off is justified.
  • Size your Dataflow dedup window for the worst case, not the median case. Document the assumption.
  • Always build the DLQ replay path on day one, even if it starts as a hand-written SQL script.
  • Partition curated tables on event date and add a pruning predicate to every MERGE.
  • Use Dataform assertions for write-time invariants and Dataplex AutoDQ for ongoing monitoring. They complement, they do not replace each other.
  • Avoid max_bad_records in load jobs. Push parsing to Dataflow so failures land in a queryable error table.
  • Version every schema change with full compatibility where possible. Run a shadow consumer on the new schema before cutting over.
  • Monitor the duplicate rate as a first-class SLO. A sudden jump usually points to a publisher bug, not a broker issue.
  • Keep raw landing zone data for 30 to 90 days so you can replay from source whenever the curated layer needs to be rebuilt.

Real-World Use Case

A mid-sized European fintech (around 400 engineers, 80 million users) processes payment authorisation events through GCP. Volume peaks at 50,000 events per second on Black Friday, with strict regulatory requirements that no transaction can be lost or double-counted in the ledger.

The architecture they settled on after two years of iteration:

Producers (the card-network adapters) generate a stable event_id from a hash of card_token + merchant_id + amount + nonce. They publish to Pub/Sub with at-least-once delivery; EOD was tested but rejected because the throughput cap was too low for peak season.

A Dataflow streaming job consumes the subscription. The first step applies a 30-minute Deduplicate window keyed on event_id. The 30 minutes covers their observed worst-case redelivery during a regional incident. Parse failures are tagged and written to a payment_errors BigQuery table partitioned by ingestion date.

Validated events stream into a payments_staging table via the Storage Write API with offsets, providing exactly-once semantics on the BigQuery side. Every five minutes, a scheduled query runs a MERGE from staging into payments_curated, partitioned by payment_date and clustered on card_token. The MERGE includes a WHERE T.payment_date BETWEEN CURRENT_DATE() - 1 AND CURRENT_DATE() predicate to keep partition scans bounded.

A nightly Dataform run executes a suite of assertions: per-country totals reconcile against the card-network's settlement file, no duplicate transaction_id exists in the last 24 hours, and the payment_errors table row count is below 0.01 percent of payments_curated. Failures page the on-call engineer.

Dataplex AutoDQ runs a separate weekly scan covering long-window checks: card-token format compliance, currency code in ISO 4217, merchant category code in the allowed list. Results feed a Looker dashboard reviewed by the data quality team.

The DLQ replay job is a daily Cloud Run service that pulls from payment_errors, applies a configurable transform (currently a regex-based JSON repair for the most common producer bug), and re-publishes to the main Pub/Sub topic. Replayed events flow through the same pipeline; the dedup window catches any that already made it through cleanly.

In two years of operation the team has measured a duplicate rate of 0.003 percent at the producer, 0.0001 percent after Dataflow dedup, and zero confirmed duplicates in the curated table. The error table sees roughly 200 to 500 rows per day, almost all from a single legacy producer that the team is migrating away from.

Exam Tips

The PDE exam loves scenario questions where a pipeline produces duplicates and asks you to pick the most appropriate fix. Watch for the layer the question is testing.

If the scenario describes producer retries on network timeouts, the answer is a stable event_id plus downstream dedup, not EOD. EOD does not protect against publisher-side duplicates.

If the scenario mentions broker-level redelivery within an ack deadline and demands no duplicates with low engineering effort, EOD on a regional pull subscription is the right answer, paired with the caveat about throughput and latency cost.

When the scenario asks about deduping in Dataflow, look for the keyword "stateful." If present, the expected answer is Deduplicate transform or a stateful DoFn. Avoid answers that suggest GroupByKey-based dedup; that works but is more expensive and less idiomatic.

For BigQuery upserts at scale, MERGE with partition pruning is the canonical answer. Watch for distractors suggesting DELETE followed by INSERT (not atomic, race condition) or scheduled queries that rebuild the entire table (correct but expensive).

For data quality, distinguish Dataform assertions (in-pipeline, blocking) from Dataplex AutoDQ (out-of-band, observational). Both questions appear; the right answer depends on whether the scenario emphasises preventing bad data downstream or monitoring quality across the warehouse.

DLQ questions usually test whether you reach for the managed Pub/Sub DLQ feature (subscription-level deadLetterTopic) or roll your own with tagged outputs. Pub/Sub DLQ is the answer when the failure is "too many delivery attempts." Tagged outputs are the answer when the failure is "parsing exception inside the worker."

Schema evolution questions typically pit ALLOW_FIELD_ADDITION against rebuilding the table. Almost always the right answer is the additive option; rebuilding is reserved for type changes or column removals.

Finally, idempotence is the unifying theme. When in doubt, pick the answer that makes the write idempotent rather than the answer that adds another layer of dedup checking.

Frequently Asked Questions (FAQ)

Does Pub/Sub Exactly Once Delivery guarantee no duplicates end to end?

No. EOD only eliminates duplicates introduced by Pub/Sub broker redelivery within the ack deadline. Publisher retries that produce new messageIds, and consumer-side bugs that re-process committed offsets, can still create duplicates. End-to-end exactly-once requires a stable application-level event_id and an idempotent sink in addition to EOD.

When should I use BigQuery MERGE versus INSERT OVERWRITE on a partition?

Use MERGE when you need fine-grained upserts where most rows in the target stay unchanged and you only update or insert a subset. Use INSERT OVERWRITE (via WRITE_TRUNCATE partition decorator) when you can fully recompute a partition from source. The overwrite pattern is simpler, idempotent by construction, and often cheaper for daily batch pipelines. MERGE wins when the source set is small relative to the partition.

How big should my Dataflow dedup window be?

Size the window for your observed worst-case redelivery latency, not the median. A common starting point is 10 minutes for healthy systems, 30 to 60 minutes for systems with known network variability, and up to a few hours for cross-region pipelines. Bigger windows catch more duplicates but consume more state storage. Measure your actual redelivery distribution from Pub/Sub metrics rather than guessing.

What goes in a DLQ error table schema?

At minimum: the raw payload as bytes or a UTF-8 string with replacement characters, the exception class name, the exception message, the timestamp received, the pipeline version or build sha, and a unique error id. Optional but useful: the source partition or topic, the worker id, and a structured error category (parse, validation, schema). With those columns you can group errors, count by category, and write replay queries that filter to one bug at a time.

Can Dataform assertions block downstream tables from running?

Yes. By default, when an assertion fails, Dataform marks downstream actions as failed and skips them in the same run. You can override this with disabled: true on the assertion or with explicit dependency configuration, but the default behaviour is what you usually want: stop the pipeline so bad data does not reach dashboards or feature stores.

How do I dedup an event stream where the same key gets legitimately updated multiple times?

This is change-data-capture, not duplicate elimination. The right pattern is to keep all updates in a staging table, then in BigQuery use ROW_NUMBER() OVER (PARTITION BY key ORDER BY event_time DESC) to pick the latest version per key during the MERGE into the curated table. In Dataflow, a stateful DoFn that tracks the latest seen event_time per key and emits only when a newer version arrives accomplishes the same in streaming.

What is the difference between Dataform assertions and Dataplex data quality scans?

Dataform assertions are SQL queries that run inline with your transformation graph. They block downstream models on failure and are versioned alongside the pipeline code. Dataplex AutoDQ runs as a managed scan outside the pipeline, on a schedule or on demand. It does not block writes; it observes them and publishes results to a centralised dashboard. Use Dataform for write-time invariants owned by the pipeline team. Use Dataplex for cross-warehouse monitoring owned by a central data quality function.

Further Reading

Official sources

More PDE topics