examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 21 min

Apache Beam Programming Model

4,200 words · ≈ 21 min read ·

Master Apache Beam programming model fundamentals for the GCP PDE exam: PCollection, PTransform, ParDo, GroupByKey, Combine, side inputs, windowing, watermarks, triggers, runners, and Beam SQL.

Do 20 practice questions → Free · No signup · PDE

Introduction to the Apache Beam Programming Model

The Apache Beam programming model is the unified abstraction that powers Cloud Dataflow and most modern stream-batch pipelines on GCP. For the PDE exam, you need to internalise how a Beam pipeline becomes a directed acyclic graph, how PCollections flow through PTransforms, and how the same code can run as a one-shot batch job or a 24/7 streaming service. This note walks the model from the ground up: data containers, transforms, windows, triggers, runners, and the SQL DSL on top.

If you only remember one thing about the Apache Beam programming model, remember this: write the pipeline once, choose the runner at submit time. That single decision shapes most exam questions about Dataflow, portability, and migration off Hadoop.

白話文解釋(Plain English Explanation)

The Apache Beam programming model gets much easier once you stop staring at the API docs and picture something physical. Three analogies cover most of the model.

The Restaurant Kitchen Line

Think of a Beam pipeline as the line in a busy restaurant kitchen. Raw ingredients arrive at the prep station — that is your source, the file in Cloud Storage or the Pub/Sub topic. A PCollection is the bin sitting between two stations: it holds chopped vegetables, marinated chicken, or plated dishes, but you never reach into it to mutate what is already there. You only pass the bin forward.

Each station — chop, sauté, plate, garnish — is a PTransform. Some stations work on one ingredient at a time, like the chopper, which is exactly what ParDo does. Some stations group similar items, like the salad station that collects all the lettuce before tossing, which mirrors GroupByKey. The expediter who calls "all table 12 plates ready" is your trigger firing on a window.

The runner is the kitchen manager. The Direct Runner is one cook doing everything on a tiny home stove for testing. The Dataflow Runner is a 200-cook industrial kitchen with autoscaling. The Flink Runner is the kitchen next door using the same recipes on different equipment. The recipes do not change.

The Postal Sorting Facility

A streaming Beam pipeline behaves like an overnight postal sorting facility. Letters arrive continuously through the night. You cannot wait until "all the mail" arrives — there is always more coming. So you slice time into windows: every five minutes, sort whatever has landed.

The watermark is the supervisor walking the loading dock with a clipboard saying, "I am now reasonably sure no more letters timestamped before 3:05 AM will arrive." Once that mark passes 3:05, the 3:00–3:05 window closes and gets sorted. Late letters that show up at 3:10 are stragglers; the trigger policy decides whether to reopen the bin and re-sort, or drop them.

Side inputs are the zip code reference book sitting on every sorter's desk. It changes rarely, every sorter needs to read it, and it is broadcast to all of them rather than streamed letter-by-letter.

The Open-Book Exam Cheat Sheet

Combine and CoGroupByKey behave like organising your notes for an open-book exam. Combine is the highlighter pass: you reduce a stack of pages down to a single summary sheet — a sum, an average, a top-10 list. CoGroupByKey is the cross-referencing pass: you have notes from chapter 3 keyed by topic and notes from chapter 7 keyed by the same topic, and you staple them together so each topic has both sources side by side. Coders are the handwriting style: as long as both you and the grader agree on how to read the squiggles, the bytes survive the trip across the network.

Core Concepts of the Apache Beam Programming Model

The Apache Beam programming model rests on five primitives. Memorise them in this order — Pipeline, PCollection, PTransform, I/O, Runner — because every other feature is a refinement of one of those.

A Pipeline is the top-level container. It holds the DAG of operations and the pipeline options that tell the runner how to execute. You construct it once, attach transforms, and call run().

A PCollection is a distributed, immutable, potentially unbounded dataset. The "P" stands for parallel. You never modify a PCollection in place; every transform produces a new one. PCollections can be bounded (finite, like a CSV file) or unbounded (infinite, like a Pub/Sub subscription). The boundedness is what determines whether windowing is optional or mandatory.

A PTransform is an operation that takes one or more PCollections as input and produces zero or more PCollections as output. Beam ships built-in transforms: ParDo, GroupByKey, CoGroupByKey, Combine, Flatten, Partition, and the Window transform.

ParDo is the workhorse. It applies a DoFn to every element in the input PCollection. Inside the DoFn you can emit zero, one, or many output elements. ParDo handles per-element transformation, filtering, enrichment, and one-to-many fan-out.

GroupByKey takes a PCollection<KV<K, V>> and produces PCollection<KV<K, Iterable<V>>>. It is the shuffle step. On streaming pipelines, GroupByKey only fires per window, which is why windowing is required upstream of any grouping operation on unbounded data.

CoGroupByKey is GroupByKey across two or more PCollections that share a key type. It is how you join streams without writing custom shuffle logic.

Combine is an associative-commutative reduction. Combine.globally collapses the entire PCollection to a single value; Combine.perKey collapses per key. Beam can push Combine into the shuffle as a partial aggregation, which is why a Sum.integersPerKey() is dramatically cheaper than a GroupByKey followed by a manual sum DoFn.

A PCollection is Beam's immutable, distributed dataset abstraction. It can be bounded (batch) or unbounded (streaming), and every PTransform consumes and produces PCollections rather than mutating them in place. See Apache Beam Programming Guide — PCollections.

Architecture and Design Patterns

A Beam pipeline is not executed where you write it. The SDK builds a graph; the runner translates that graph into native execution units. On Dataflow, that means the graph is uploaded to the Dataflow service, optimised (fusion, combiner lifting, sink fusion), and dispatched to autoscaled VM workers. On Flink, the graph is converted to Flink operators and submitted to a Flink cluster. The portability framework — the Job Service and Fn API — is what makes this language-agnostic.

The fundamental architectural pattern is deferred execution. When you write p.apply(ParDo.of(new MyDoFn())), nothing runs. You are appending a node to the DAG. Execution begins at p.run(), and even then, on Dataflow, the local process exits while the service runs the job. This deferred model is why DoFns must be serialisable and why side effects in pipeline construction code (like opening a file) usually backfire.

Three composition patterns dominate real codebases:

Linear chains apply a sequence of ParDos: parse, validate, enrich, write. This is the default shape and it fuses well — Dataflow collapses adjacent ParDos into a single execution stage, avoiding shuffle overhead.

Fan-out / fan-in uses Partition or multiple outputs from a ParDo (via tagged outputs) to split a stream into branches, then Flatten to merge them. Useful for routing valid records to BigQuery while sending malformed records to a dead-letter Pub/Sub topic.

Composite transforms wrap a sub-DAG behind a single PTransform class. They give you reusable building blocks that show up as collapsed nodes in the Dataflow execution graph UI, which is invaluable for debugging.

Pipeline construction code runs once on the submitting machine; DoFn code runs on workers, possibly thousands of times in parallel, possibly serialised across JVM boundaries. Anything captured by your DoFn must be Serializable, and any per-worker resource (JDBC connection, ML model) belongs in @Setup, not the constructor. See Beam Programming Guide — DoFn lifecycle.

GCP Service Deep Dive: Dataflow as the Beam Runner

Dataflow is Google's fully managed Apache Beam runner and the assumed execution target for every Beam-related PDE question unless the prompt explicitly mentions Flink, Spark, or Direct.

The Dataflow service does several things no open-source runner does for free. It autoscales workers from one to hundreds based on backlog and CPU. It performs dynamic work rebalancing — when one worker finishes early, Dataflow splits remaining work from a slow worker and reassigns the slice. It applies fusion optimisation, collapsing chains of element-wise operations into a single stage so intermediate PCollections never materialise to disk. It applies combiner lifting, moving partial aggregations before the shuffle.

For streaming, Dataflow runs in Streaming Engine mode, which separates compute from state. Worker VMs become stateless and can be replaced or rescaled without re-shuffling all the windowed state. Dataflow Prime adds vertical autoscaling and right-fits memory automatically.

The contrast with other runners matters for the exam:

Runner Use case Trade-off
Direct Runner Local unit tests, IDE debugging Single-process, not for production scale
Dataflow Runner Production on GCP Managed, autoscaling, GCP-only
Flink Runner On-prem or hybrid streaming Self-managed cluster, no managed autoscaling
Spark Runner Existing Spark infrastructure Limited streaming feature support

A common exam scenario gives you a team migrating from on-prem Flink to GCP. The right answer is usually "keep the Beam code, change the runner from FlinkRunner to DataflowRunner via pipeline options" — that is the entire point of the Apache Beam programming model's portability promise.

For development, run pipelines with the Direct Runner inside your IDE first. It surfaces serialisation bugs and coder mismatches in seconds rather than after a 10-minute Dataflow startup. Once green locally, switch the runner via --runner=DataflowRunner. See Direct Runner docs.

Windowing, Watermarks, and Triggers

Streaming questions on the PDE exam almost always test the windowing model. Three concepts work together.

A window assigns each element to one or more time intervals. Beam ships several window functions: fixed (tumbling) windows of fixed size, sliding windows that overlap, session windows that close after a gap of inactivity, and the global window which is the implicit default for batch.

A watermark is the runner's estimate of event-time progress. It is a timestamp t such that the runner believes no more elements with event time earlier than t will arrive. Watermarks are heuristic, not exact. For Pub/Sub sources, Dataflow derives the watermark from message publish times and source backlog signals.

A trigger decides when to emit results for a window. The default trigger fires once when the watermark passes the end of the window — perfect for batch-correct output. Custom triggers allow early firing (speculative results), late firing (handle late data), and combinations via AfterWatermark.pastEndOfWindow().withEarlyFirings(...).withLateFirings(...).

The interaction with accumulation mode is where exam takers slip. accumulatingFiredPanes() re-emits the cumulative result on each firing — good for dashboards that overwrite. discardingFiredPanes() emits only the delta since the last firing — good for downstream sinks that aggregate further. Pick the wrong mode and your dashboard double-counts.

Windowing applies to the PCollection, not the source. Reading from Pub/Sub does not automatically window your data; you must explicitly apply Window.into(FixedWindows.of(Duration.standardMinutes(5))) before any GroupByKey or Combine.perKey, otherwise the pipeline will fail at submission with a windowing-incompatible error. See Beam windowing docs.

Side Inputs, Side Outputs, and Stateful Processing

Three advanced features round out the Apache Beam programming model and show up regularly in PDE scenarios.

Side inputs are additional read-only PCollections that a ParDo can access during element processing. They are broadcast to every worker. Use side inputs for small reference data — country code lookup tables, currency conversion rates, configuration. The trap is size: a 5 GB side input will exhaust worker memory because each worker materialises it. For large lookups, use BigQuery or Bigtable directly inside the DoFn instead.

In streaming, side inputs can be windowed — your enrichment ParDo reads the most recent window of the side input PCollection. This enables patterns like "join an event stream against a slowly changing lookup that updates every 10 minutes" without rebuilding the pipeline.

Side outputs (called tagged outputs) let a single ParDo emit to multiple PCollections. Define TupleTag<T> instances, declare them on the ParDo with withOutputTags, and call out.get(tag).output(element) inside the DoFn. The classic pattern: one tag for parsed records, one tag for parse failures, one tag for schema-violating records that you route to a dead-letter sink.

Stateful and timely processing is the escape hatch when windowing is not enough. Inside a ParDo on a keyed PCollection, you can declare @StateId fields (ValueState, BagState, MapState, CombiningState) and @TimerId fields. This gives you per-key, per-window mutable state and event-time or processing-time callbacks. Use it for sessionisation logic too custom for session windows, or for fraud detection state machines.

The Apache Beam programming model is portable because the SDK builds a language-neutral pipeline graph (via the Portability Framework's Fn API) and the runner translates it. The same Java or Python code runs on Dataflow, Flink, or Spark by changing one pipeline option: --runner. Memorise this for any "migrate pipeline" exam question. See Beam portability.

Beam SDKs: Java, Python, and Go

The Beam Java SDK is the reference implementation. It is the most feature-complete, has the lowest-overhead Dataflow runner, and is the only SDK with first-class support for every connector. Most production Dataflow shops use Java for high-throughput pipelines.

The Beam Python SDK trails Java by a release or two on new features but covers the entire core model. Python is the natural choice when the pipeline integrates with NumPy, pandas, scikit-learn, or TensorFlow inside DoFns. Performance is meaningfully lower than Java for pure throughput because Python DoFns run inside a sidecar SDK harness container that communicates with the Dataflow worker over the Fn API.

The Beam Go SDK is production-ready but has a smaller connector ecosystem. It shines for teams already invested in Go and for lightweight ETL where Java JVM memory overhead is unwelcome.

For the exam, the SDK choice rarely matters except in two scenarios. First, ML inference pipelines almost always pick Python because the model libraries live there. Second, any "lowest latency" or "highest throughput" framing usually points to Java. Beam SQL, covered below, is available primarily in Java with limited Python support.

Beam SQL

Beam SQL lets you define transforms using SQL syntax over PCollections that have a schema. It is implemented on top of Apache Calcite. You apply SqlTransform.query("SELECT user, COUNT(*) FROM PCOLLECTION GROUP BY user") to a PCollection and Beam compiles it into a sub-DAG of native transforms.

The win is accessibility: analysts who know SQL can author pipelines without learning the ParDo-Combine-Window vocabulary. The catch is feature coverage — Beam SQL supports most of standard SQL plus windowing extensions like TUMBLE, HOP, and SESSION, but advanced custom DoFn logic still needs to be written as a ParDo.

A common pattern mixes both: ingest and parse with ParDo, normalise schema, then run a Beam SQL block to do the analytical aggregation, then write to BigQuery. Keep in mind that BigQuery itself often does the same job better when the data is already at rest. Beam SQL is most valuable inside streaming pipelines where data is in flight.

Beam SQL only operates on PCollections with an attached schema (Row type or annotated POJOs). Apply setRowSchema() or use SchemaCoder before the SQL transform, otherwise the query parser fails with a cryptic error. See Beam SQL overview.

Common Pitfalls and Trade-offs

Real Beam pipelines fail in predictable ways. The PDE exam loves these.

Hot keys ruin GroupByKey. If 90% of your events have key "unknown", all those elements end up on one worker and the pipeline stalls. Mitigations: use Combine.perKey (which lifts before shuffle), add a salt prefix to keys (unknown-0, unknown-1, ...) and combine in two passes, or filter the unknown bucket out earlier.

Coder mismatches crash the pipeline at runtime, not at construction time. If a DoFn returns a custom POJO and Beam cannot infer a coder, you get Unable to infer Coder at submission. Register a coder explicitly with pipeline.getCoderRegistry().registerCoderForClass(MyPojo.class, AvroCoder.of(MyPojo.class)) or annotate with @DefaultCoder.

Wide fan-out from a single DoFn can choke fusion optimisation. If one input element produces 10,000 output elements and the next stage is a heavy ParDo, Dataflow will fuse them and run the heavy stage on the producing worker without parallelisation. Insert Reshuffle.viaRandomKey() between them to force a shuffle break and recover parallelism.

Side input size is a silent killer. A 200 MB side input is fine; a 5 GB side input causes OOM on every worker. The error often masks itself as "worker harness died" without obvious cause. Profile side input size before deploying.

Streaming pipelines without drain plans become migration disasters. When you need to update a streaming pipeline, you must either update in place (which requires a compatible job graph) or drain and restart (which requires the new pipeline to handle the gap). Plan for this on day one.

Reading the same source twice in one pipeline does not deduplicate. If you pipeline.apply(TextIO.read().from(...)) twice, you get two independent reads and twice the data. To branch a stream, read once into a PCollection and apply multiple downstream transforms to that single PCollection.

Best Practices

Production-grade Beam pipelines tend to converge on the same patterns regardless of the team writing them.

  • Keep DoFns small and single-purpose; compose with linear chains rather than building one giant DoFn.
  • Use Combine whenever possible instead of GroupByKey-then-reduce. The pre-shuffle combiner lifting can cut shuffle volume by orders of magnitude.
  • Always emit failures to a dead-letter sink via tagged outputs. Crashing the pipeline on bad data is rarely the right behaviour for streaming.
  • Set explicit coders on custom types. Inferred coders work until they do not.
  • For streaming, set --enableStreamingEngine on Dataflow. State management gets simpler and rescaling becomes free.
  • Test with the Direct Runner using TestStream for streaming pipelines — it lets you inject elements with controlled timestamps and watermark advances.
  • Use composite transforms for any logic that exceeds three chained operations. The execution graph UI becomes readable and the unit tests get cleaner.
  • Instrument with Metrics.counter() and Metrics.distribution() inside DoFns. These surface in Cloud Monitoring and are your only window into worker-level behaviour.

Real-World Use Case

A mid-sized ad-tech company processes 8 billion bid request events per day across two regions. They use the Apache Beam programming model to run a single Java codebase as both a backfill batch job (replaying S3 archives during model retraining) and a 24/7 streaming job (live bidding telemetry from Pub/Sub).

The pipeline reads from Pub/Sub, parses Avro into a BidRequest POJO with a registered AvroCoder, drops malformed records to a dead-letter Pub/Sub topic via tagged outputs, enriches each record with a windowed side input of advertiser configuration (refreshed every 5 minutes from BigQuery), applies fixed 1-minute windows with AfterWatermark.pastEndOfWindow().withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))) for sub-minute speculative dashboards, runs Combine.perKey to compute bid-win-rate per advertiser, and writes both to BigQuery (for analytics) and Bigtable (for real-time serving).

When they backfill, the same code runs against Cloud Storage Avro files instead of Pub/Sub. The runner option flips from streaming to batch. Window semantics still apply because the Avro records carry event timestamps. Total engineering cost of supporting batch and streaming: zero extra code. That is the Apache Beam programming model's payoff.

Exam Tips

Expect 3–5 questions on the PDE exam that map directly to Beam concepts. Patterns to recognise:

  • Any question with "process the same logic for batch and streaming" points to Beam's unified model. Avoid answers that suggest separate Spark and Spark Streaming jobs.
  • Any question about late-arriving data needs a trigger answer with withLateFirings and an explicit withAllowedLateness. Without allowedLateness, late elements are dropped.
  • Any "join two streams" question wants CoGroupByKey, not nested ParDos with side inputs (side inputs are for small reference data, not full streams).
  • Any "small reference data refreshed periodically" question wants a windowed side input.
  • "Migrate from on-prem Spark/Flink" usually means write Beam code and pick the appropriate runner — do not propose rewriting in BigQuery SQL unless the question asks about leaving Beam entirely.
  • Hot key questions want Combine.perKey or salting; resist answers that suggest "increase worker count" because more workers do not split a single hot key.
  • "Pipeline cannot scale past N workers" often points to fusion. The answer is Reshuffle.viaRandomKey() to break fusion.
  • Direct Runner is for testing only. Any production answer that proposes Direct Runner is a distractor.

For the Apache Beam programming model questions specifically, internalise the order: Pipeline → PCollection → PTransform → Window → Trigger → Runner. Most exam scenarios trace some path through that sequence.

Frequently Asked Questions (FAQ)

What is the difference between ParDo and Map in Apache Beam?

ParDo is the general-purpose per-element transform; it can emit zero, one, or many output elements per input via out.output() calls inside the DoFn. MapElements is a convenience wrapper for the simpler one-in-one-out case and is implemented on top of ParDo. Use MapElements when you have a clean lambda; reach for ParDo when you need filtering, fan-out, side outputs, state, or timers.

When do I need windowing in Apache Beam?

Windowing is mandatory before any GroupByKey, CoGroupByKey, or Combine.perKey on an unbounded PCollection. Bounded PCollections use the implicit Global Window and do not require explicit windowing unless you specifically want sub-grouping. If you forget to window a streaming PCollection before a grouping operation, the pipeline fails at construction time.

How do watermarks differ from event time and processing time?

Event time is the timestamp embedded in the data — when the event actually happened. Processing time is the wall clock when the worker handles the element. The watermark is the runner's running estimate of "how far event-time progress has advanced," allowing it to decide when a window's results are ready to emit. The watermark always lags processing time and may lag event time too if data is late.

Can I use Apache Beam without Cloud Dataflow?

Yes. Beam is open source and runs on Apache Flink, Apache Spark, Apache Samza, Hazelcast Jet, the local Direct Runner, and others. Dataflow is the Google-managed runner with the deepest GCP integration and the only one with managed autoscaling. For the PDE exam, assume Dataflow unless told otherwise.

What is the difference between Combine.globally and Combine.perKey?

Combine.globally reduces an entire PCollection (or window thereof) to a single value — total event count, global average, single top-N list. Combine.perKey operates on a PCollection<KV<K, V>> and produces one reduced value per key. Both benefit from combiner lifting, where Beam runs partial aggregations on workers before the shuffle, dramatically reducing data movement.

When should I use a side input versus a CoGroupByKey?

Use a side input when one PCollection is small (under a few hundred MB), changes infrequently, and is broadcast to every worker for lookups. Use CoGroupByKey when both PCollections are large and you want a true distributed join keyed on a shared field. Mixing them up — using a side input on a 10 GB stream — is one of the fastest ways to crash a Dataflow job.

Does Beam SQL replace BigQuery?

No. Beam SQL operates on in-flight PCollections inside a streaming or batch pipeline. BigQuery operates on data at rest in its managed storage. They complement each other: Beam SQL is useful for in-flight aggregation, joins between streams, or pushing simple transforms close to the ingestion path. BigQuery is the right tool once data has landed and you want interactive analytics.

Further Reading

Official sources

More PDE topics