Dataflow Architecture Selection — GCP PDE Study Notes

Introduction to Dataflow Architecture Selection

Dataflow architecture selection is the process of matching an Apache Beam pipeline to the right Dataflow runtime, scheduler, and packaging model on Google Cloud. The decision is not one switch. You pick a service tier (Classic or Prime), a streaming backend (Streaming Engine yes or no), a shuffle backend, a scheduler (regular or FlexRS), a runner harness (Runner v2 or legacy), and a deployment artifact (raw job, classic template, or Flex template). Each choice shifts cost, latency, autoscaling behavior, and operational blast radius.

For the GCP PDE exam and for real production work, getting Dataflow architecture selection right is the difference between a pipeline that scales to millions of events per second on a sane bill, and one that quietly burns money or lags the SLA.

白話文解釋（Plain English Explanation）

Before the deep dive, three analogies that make Dataflow architecture selection click. The point is not to memorize feature names but to feel why each layer exists.

Dataflow as a restaurant kitchen

A regular Dataflow Classic job is like a restaurant where every cook brings their own knives, cutting board, walk-in fridge, and dishwasher. The kitchen works, but each cook uses bench space for storage instead of cooking. Streaming Engine is like moving the walk-in fridge and dishwasher into a shared back-of-house facility. The cooks now have full benches for prep, the kitchen needs fewer hands during slow hours, and storage scales independently of stove capacity.

Dataflow Prime takes this further. Prime is like a kitchen where each station is sized for its actual job. The pastry station gets a small oven, the grill station gets the heavy hood. In Classic, every cook gets the same bench because the worker shape is uniform. In Prime, each stage of the pipeline gets a right-sized worker through Vertical Autoscaling and Right Fitting.

Pipelines as a postal sorting facility

A batch pipeline is like the night shift at a sorting facility. A truck arrives at 2 AM with a known volume of mail, the crew sorts it all, and goes home. You can predict the staffing.

A streaming pipeline is the daytime counter. Mail trickles in unpredictably, and customers expect their package to be scanned within minutes. You need staff on standby, but you do not want a full night-shift crew during a quiet hour. Streaming Engine and Streaming Autoscaling are the staffing manager that sends people home when the queue is short and pulls them back when a surge hits.

FlexRS is the discount express service. The customer agrees the package will arrive within six hours instead of one, and the carrier saves money by routing it on whatever truck has spare capacity. Same destination, same package, lower price, looser SLA.

Templates as a vending machine

A raw Dataflow job submission is like cooking a meal from groceries. You need the kitchen, the recipe, and the time. A classic template is like a frozen dinner. The recipe is locked, the ingredients are pre-measured, and the operator who runs it does not need to know how to cook. The trade-off is that you cannot change the recipe at run time without rebuilding the frozen dinner.

A Flex template is the vending machine that lets the customer pick toppings. The base meal is pre-built into a container image, but the operator can pass dynamic parameters at launch. You get the operational simplicity of templates with most of the flexibility of raw submission.

Core Concepts of Dataflow Architecture Selection

Several moving parts define a Dataflow architecture, and the exam expects you to know what each one does and when it matters.

Service tier: Classic vs Prime

Dataflow Classic is the original service. You pick a machine type, set min and max workers, and Horizontal Autoscaling adjusts worker count based on backlog and CPU. All workers in a job share the same shape.

Dataflow Prime adds Vertical Autoscaling, Right Fitting, and per-stage resource hints. Prime decides not just how many workers you need but what shape each stage should have. A stage that is memory-bound on aggregation gets a memory-heavy worker; a stage that is CPU-bound on parsing gets a CPU-heavy worker. Prime is built on top of Streaming Engine for streaming jobs and Dataflow Shuffle for batch jobs, so you cannot disable those backends when running on Prime.

Streaming Engine

Streaming Engine moves the streaming pipeline state, windowing data, and timer storage off the worker VMs and into a Google-managed backend. Without Streaming Engine, every worker keeps its share of state on local disk. That means scaling out requires reshuffling state across workers, which is slow and error-prone.

With Streaming Engine, workers become close to stateless. Autoscaling reacts in seconds instead of minutes, persistent disks shrink dramatically, and worker upgrades happen without losing watermark progress. Streaming Engine is the default for new streaming jobs and is required for Prime.

Dataflow Shuffle

Dataflow Shuffle is the batch equivalent. The shuffle phase of a batch pipeline (the all-to-all data redistribution between stages) runs in a Google-managed service rather than on the worker VMs. Workers no longer need huge persistent disks to spill shuffle data. Jobs finish faster, autoscaling is more responsive, and worker failures are cheaper because shuffle data survives independently.

FlexRS (Flexible Resource Scheduling)

FlexRS schedules batch jobs against a mix of preemptible and regular VMs, and delays job start by up to six hours to wait for cheaper capacity. The trade-off is that the job is not guaranteed to start immediately. FlexRS suits non-urgent batch workloads, like nightly reprocessing or model training feature builds, where the cost saving (around 40 percent in many regions) is worth a few hours of scheduling slack.

Runner v2

Runner v2 is the unified runner harness based on the portability framework. It supports cross-language pipelines (a Python pipeline calling a Java I/O connector, for example), uses container-based SDK execution, and is required for many newer features including Dataflow Prime, multi-language transforms, and several Beam I/O connectors.

The legacy runner still exists for older Java pipelines, but new development should default to Runner v2. The exam may test that you know Runner v2 is the path forward and is the only way to use cross-language transforms.

The watermark is Dataflow's estimate of event-time progress. It tells the pipeline "we believe all events with timestamps before T have arrived." Windows fire when the watermark crosses their end. Streaming Engine maintains watermark state in the managed backend so it survives worker scale events. See https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines

Architecture and Design Patterns

Dataflow architectures follow a few recurring shapes. Recognizing them on the exam saves you time.

The streaming ingestion pipeline

Pub/Sub feeds events into a Dataflow streaming job. The job parses, validates, applies windowing, and writes to BigQuery via the Storage Write API for analytics, plus Cloud Storage for archival. Streaming Engine is on. Autoscaling caps are tuned to the expected peak. A dead-letter Pub/Sub topic catches malformed messages.

This is the canonical PDE answer for "real-time analytics" questions. The Storage Write API beats the older legacy streaming inserts on cost, throughput, and exactly-once semantics, and pairs naturally with Dataflow.

The batch ETL pipeline

A scheduled trigger (Cloud Scheduler, Composer, or Workflows) launches a Dataflow batch job from a Flex template. The job reads from Cloud Storage (Avro or Parquet), joins reference data from BigQuery, transforms, and writes back to BigQuery partitioned tables. Dataflow Shuffle is on. For non-urgent runs, FlexRS shaves the bill.

The change data capture (CDC) pipeline

A Datastream job lands change events into Cloud Storage or Pub/Sub. A Dataflow streaming job consumes the changes, applies ordering and deduplication, and merges into BigQuery using the Storage Write API in upsert mode. This is a common modernization pattern for replicating operational databases into the analytics warehouse.

The ML feature pipeline

A Dataflow job reads raw events, computes features, and writes to a feature store (Vertex AI Feature Store) or directly to BigQuery for offline training. Dataflow ML extends this pattern with the RunInference transform, letting the same pipeline call models for online scoring without separate serving infrastructure.

For new streaming pipelines on GCP, default to: Streaming Engine on, Runner v2, Storage Write API for BigQuery sink, Pub/Sub source. Deviate only with a documented reason. See https://cloud.google.com/dataflow/docs/streaming-engine

GCP Service Deep Dive

This section unpacks the connectors, packaging models, and ML extensions that show up in PDE scenarios.

Source and sink connectors

Pub/Sub is the streaming source of choice on GCP. The Beam PubsubIO connector handles message acknowledgement, watermark tracking, and at-least-once delivery. For exactly-once processing, combine Pub/Sub with idempotent sinks or use the Storage Write API's offset-based semantics on the BigQuery side.

BigQuery as a sink offers two paths. The legacy streaming inserts API is older and more expensive. The Storage Write API is newer, cheaper per row, supports exactly-once via stream offsets, and integrates with Dataflow's BigQueryIO via the STORAGE_WRITE_API or STORAGE_API_AT_LEAST_ONCE methods. The PDE exam now expects Storage Write API as the default answer.

Cloud Storage is the universal source and sink. For batch, use file-based formats: Avro for row-oriented streaming, Parquet for analytical columnar layouts, and TextIO for JSON or CSV. Watch out for the small-files problem; thousands of tiny files cripple parallelism. Combine small files upstream or use FileIO's continuous match for streaming file ingestion.

Spanner, Bigtable, Cloud SQL, and Firestore all have Beam connectors. Bigtable in particular is a frequent pairing for time-series analytics built on Dataflow.

Dataflow templates: classic vs Flex

A classic template is a JSON serialization of the pipeline graph plus parameter metadata, stored in Cloud Storage. The pipeline structure is fixed at template build time. Operators launch the template with runtime parameters via the gcloud CLI, REST API, or Cloud Scheduler. Classic templates only support Java and Python via specific staging steps and have a limited set of runtime parameter types (ValueProvider).

A Flex template wraps the pipeline in a container image, stores image and metadata in Artifact Registry and Cloud Storage respectively, and supports any pipeline structure including conditional branches based on runtime parameters. The launch path goes through a launcher container that submits the job. Flex templates are the recommended modern choice; classic templates persist mainly for legacy reasons and for the Google-provided template gallery.

Use Google-provided templates (Pub/Sub to BigQuery, GCS to BigQuery, JDBC to BigQuery, etc.) as a starting point. They handle dead-lettering, schema evolution, and error metrics out of the box. Fork only when you need custom logic. See https://cloud.google.com/dataflow/docs/guides/templates/provided-templates

Dataflow ML

Dataflow ML is the umbrella for machine-learning patterns on Dataflow. The headline feature is the RunInference transform, which takes a Beam PCollection and a model handler (TensorFlow, PyTorch, scikit-learn, or remote endpoint) and returns predictions. RunInference handles model loading, batching, and worker affinity efficiently.

Dataflow ML also includes MLTransform for feature preprocessing (vocabulary computation, scaling, embedding generation) with consistent semantics between training and serving pipelines. Combined with Vertex AI for training and serving, Dataflow ML lets you keep feature engineering in one place.

Runner v2 and cross-language

Runner v2 enables multi-language pipelines. A Python pipeline can use a Java-only I/O connector via the expansion service, and vice versa. This matters because some connectors (especially newer ones) are Java-first or Python-first. Runner v2 is also where future Dataflow features land first.

A Dataflow Prime feature that lets you assign per-stage resource hints (memory, CPU, accelerator) so that each stage of the pipeline runs on a worker shape sized for its needs, rather than forcing a one-size-fits-all worker for the whole job. See https://cloud.google.com/dataflow/docs/guides/right-fitting

Common Pitfalls and Trade-offs

Dataflow gives you many knobs. Several combinations look reasonable but cause real pain in production.

Hot keys and skewed shuffles

A streaming job that groups by user_id will eventually meet a user that generates 1000x the traffic of the median. That hot key serializes processing through a single worker thread. Symptoms: one worker pinned at 100 percent CPU, watermark stuck, the rest of the fleet idle. Fixes: salt the key with a random suffix and re-aggregate, use Combine.perKey with a hot-key fanout, or pre-aggregate upstream.

Streaming Engine cost surprises

Streaming Engine charges per processed-data unit on top of worker compute. For high-throughput pipelines, this line item can exceed the worker cost. Always model the Streaming Engine charge against the worker savings before assuming it is a free win. For very high throughput, cost-optimized custom worker sizing on Classic without Streaming Engine sometimes wins, though you give up the autoscaling responsiveness.

Autoscaling oscillation

Aggressive max-worker caps with bursty inputs cause autoscaling to thrash: scale up, drain backlog, scale down, backlog returns, scale up again. Each cycle costs startup time. Tune by smoothing the input (Pub/Sub flow control), setting realistic min-worker floors, and letting the autoscaler see a stable backlog signal.

FlexRS misuse

Putting a latency-sensitive job on FlexRS to save money fails the SLA the first time the scheduler delays the start by four hours. FlexRS is for jobs where "starts within six hours, completes whenever" is acceptable. Anything user-facing or downstream-blocking does not qualify.

Do not enable Streaming Engine on a job that has not been tested with it. Streaming Engine changes state-handling semantics. Migrating an existing stateful job sometimes requires draining and restarting from a fresh state, not a simple in-place update. See https://cloud.google.com/dataflow/docs/streaming-engine

Template parameter rigidity

Classic templates lock pipeline structure at build time. A team that ships a classic template and then needs a conditional branch (skip enrichment for trusted sources) will rebuild the template. Flex templates avoid this pain at the cost of a slightly more complex build process.

Pub/Sub ack deadline mismatch

If a Dataflow stage downstream of PubsubIO is slow, Pub/Sub may redeliver messages before the pipeline acknowledges them, causing duplicates. Tune the subscription's ack deadline and the pipeline's processing time budget together, and rely on idempotent sinks for the safety net.

Best Practices

A short, opinionated list that maps to PDE-grade architectures.

Default to Streaming Engine, Runner v2, and the BigQuery Storage Write API for new streaming pipelines. The combination is faster, cheaper at scale, and matches the current Google guidance.
Use Flex templates rather than classic templates for new operational pipelines. The container model integrates with CI/CD and Artifact Registry, and parameter handling is more flexible.
Set realistic min and max worker counts. Min = 1 with no floor causes cold-start latency spikes. Max = unlimited burns money during a runaway backlog.
Measure shuffle and Streaming Engine line items separately on the bill. Cost surprises usually live there, not in worker compute.
Build dead-lettering from day one. A streaming pipeline without a dead-letter path will eventually choke on a malformed message and stall the watermark.
Use FlexRS for non-urgent batch only. Tag those jobs explicitly so operators understand the SLA.
For Dataflow Prime, start with a small pilot. Vertical Autoscaling reshapes workers across stages, and you want to validate the per-stage cost profile before broad rollout.
Keep pipelines idempotent at the sink. Combined with the Storage Write API's exactly-once semantics on BigQuery, this is the safest path to reliable analytics.

Real-World Use Case

A mid-sized fintech (about 600 employees, 20 million end users) needed to migrate fraud detection from a self-managed Spark-on-Kubernetes cluster to a managed GCP architecture. The team chose Dataflow after a two-week spike against Dataproc.

The architecture: Pub/Sub ingests transaction events at peak 80,000 events per second. A Dataflow streaming job on Streaming Engine + Runner v2 parses, enriches with account metadata from Bigtable, applies sliding windows for velocity features, and uses RunInference (Dataflow ML) to call a fraud model loaded into the worker container. Scored events go to BigQuery via the Storage Write API for analytics, and high-risk events route to a separate Pub/Sub topic that triggers a Cloud Run function for real-time customer notifications.

A nightly batch Dataflow job rebuilds account-level aggregates. That job runs on FlexRS to cut compute cost by about 40 percent, since the batch output is only consumed by morning dashboards.

Both pipelines deploy as Flex templates from a CI/CD pipeline. Dataflow Prime is enabled on the streaming job after a two-month pilot showed Vertical Autoscaling cut the bill another 18 percent by right-sizing the enrichment stage, which had been over-provisioned in Classic.

Operationally, the team replaced a four-engineer rotation maintaining the Spark cluster with a one-engineer on-call. The Dataflow architecture selection decisions (Prime, Streaming Engine, FlexRS for batch, Flex templates) were the load-bearing parts of the cost and ops story.

Decision Matrix: Dataflow vs Dataproc vs Cloud Functions

Picking Dataflow is half the work. Knowing when not to pick it matters just as much.

Choose Dataflow when

The workload is a true data pipeline (transform, aggregate, enrich, join), the team wants serverless and unified batch + streaming, the source/sink mix is GCP-native (Pub/Sub, BigQuery, GCS, Bigtable), and you want the operational profile of a managed service. Dataflow shines when throughput is high, latency targets are minutes-to-seconds, and the team values not running a cluster.

Choose Dataproc when

You are migrating an existing Hadoop or Spark workload and rewriting it as a Beam pipeline is not justified. You need Hive, HBase, Presto, or other Hadoop-ecosystem tools. You want to control cluster shape tightly (GPU-heavy clusters, specific Spark versions, custom JARs). Dataproc Serverless narrows the gap by offering serverless Spark, but Dataflow remains the better default for greenfield streaming.

Choose Cloud Functions (or Cloud Run) when

The work is event-driven and short-lived, fits in a single function invocation, and does not require cross-event state or windowing. Examples: resize an image when uploaded to GCS, write a Firestore record when Pub/Sub fires, call a webhook on a BigQuery job completion. The moment you need windowing, joining streams, or aggregations across events, you cross into Dataflow territory.

Choose BigQuery (alone) when

The transformation is expressible as SQL on data already landed in BigQuery. Scheduled queries, materialized views, and BigQuery's BI Engine cover a surprising amount of "ETL" without any Dataflow. The PDE exam often tests this: not every transformation needs a pipeline.

Choose Datastream + Dataflow when

You need CDC from operational databases (MySQL, PostgreSQL, Oracle, SQL Server) into BigQuery. Datastream handles the source connector and change capture; Dataflow (or BigQuery's built-in Datastream integration for simple cases) handles merging.

The exam loves trick questions where the right answer is "use BigQuery scheduled query" or "use Cloud Function," not Dataflow. Read the scenario for state, windowing, joins across streams, or sustained throughput. If none of those are present, Dataflow is probably overkill. See https://cloud.google.com/architecture/data-lifecycle-cloud-platform

Exam Tips

A focused checklist for PDE candidates.

Streaming questions almost always want Streaming Engine, Runner v2, Pub/Sub source, and BigQuery Storage Write API sink. Memorize this stack.

When the question mentions Hadoop, Spark, Hive, HBase, or "lift and shift," answer Dataproc, not Dataflow. When the question mentions cost-sensitive non-urgent batch, FlexRS is the keyword.

Dataflow Prime is the answer when the question stresses per-stage resource heterogeneity, vertical autoscaling, or right-sizing different stages of the same pipeline. Prime requires Streaming Engine for streaming and Shuffle for batch.

Templates: classic for the existing Google-provided gallery and simple parameterized launches; Flex for new operational pipelines, especially when CI/CD packaging matters.

Watermarks, windowing, and triggers are heavily tested. Know fixed, sliding, and session windows, and the difference between event time and processing time. Streaming Engine handles state for these constructs.

For exactly-once semantics on the BigQuery side, the answer is the Storage Write API in pending or buffered mode with stream offsets. The legacy streaming inserts API is at-least-once.

When the scenario describes a single event triggering a tiny computation, the answer is Cloud Functions or Cloud Run, not Dataflow. Watch for sustained throughput cues to switch to Dataflow.

Streaming Engine is required for Dataflow Prime on streaming jobs. Dataflow Shuffle is required for Prime on batch jobs. You cannot run Prime without these managed backends. See https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime

Frequently Asked Questions (FAQ)

Is Dataflow Prime always cheaper than Classic?

Not always. Prime adds vertical autoscaling and right fitting, which usually save money on heterogeneous pipelines where stages have different resource shapes. For uniform pipelines (one big map-heavy stage), the Classic flat shape can be cheaper. Always benchmark on a representative workload before flipping the switch fleet-wide.

When should I use the Storage Write API instead of legacy streaming inserts?

Almost always for new pipelines. The Storage Write API is cheaper per row, supports exactly-once via stream offsets, and integrates cleanly with Dataflow's BigQueryIO. Legacy streaming inserts remain for backward compatibility with older clients but are no longer the recommended choice.

Can I run Dataflow without Streaming Engine for streaming jobs?

Yes, but you give up most of the operational benefits: faster autoscaling, smaller persistent disks, smoother worker upgrades. The only common reason to disable Streaming Engine is extreme cost optimization on a very high-throughput, stable pipeline where the Streaming Engine line item dominates and you can size workers manually. Dataflow Prime requires Streaming Engine, so disabling it forces you onto Classic.

What is the difference between a classic template and a Flex template in practice?

A classic template freezes the pipeline graph at build time and exposes a fixed parameter set (ValueProvider). A Flex template ships the entire pipeline as a container image and runs a launcher container that builds and submits the job per launch, allowing dynamic pipeline structure based on parameters. For new operational pipelines, default to Flex templates.

How does FlexRS actually save money?

FlexRS schedules batch jobs against a mix of preemptible and regular VMs, and delays job start by up to six hours to wait for cheap capacity. Typical savings are around 40 percent versus an immediate-start regular VM batch job. The trade-off is the start delay and the slightly higher chance of preemption-driven retries during execution.

When should I pick Dataproc over Dataflow?

When you are migrating existing Spark or Hadoop code and rewriting as Beam is not worth the effort, when you need Hadoop-ecosystem tools (Hive, HBase, Presto), or when you need fine-grained cluster control (specific Spark versions, GPU shapes, custom Hadoop config). For greenfield streaming and unified batch + streaming with managed operations, Dataflow remains the default.

Does Runner v2 affect job cost or performance?

Runner v2 is the modern unified harness and is the default for new pipelines. Performance is comparable to the legacy runner for most workloads and superior for cross-language and container-based jobs. Some specific legacy Java pipelines run slightly faster on the legacy runner, but Runner v2 is required for many new features including Dataflow Prime and multi-language transforms.

Batch vs Streaming Design covers how to decide between batch and streaming pipelines, which sits upstream of Dataflow architecture selection.
Dataproc Modernization Strategies digs into when Dataproc beats Dataflow and how to migrate Hadoop/Spark workloads.
BigQuery Data Modeling and Clustering is the natural sink for most Dataflow pipelines and shapes how you write from Dataflow.

Introduction to Dataflow Architecture Selection

白話文解釋（Plain English Explanation）

Dataflow as a restaurant kitchen

Pipelines as a postal sorting facility

Templates as a vending machine

Core Concepts of Dataflow Architecture Selection

Service tier: Classic vs Prime

Streaming Engine

Dataflow Shuffle

FlexRS (Flexible Resource Scheduling)

Runner v2

Architecture and Design Patterns

The streaming ingestion pipeline

The batch ETL pipeline

The change data capture (CDC) pipeline

The ML feature pipeline

GCP Service Deep Dive

Source and sink connectors

Dataflow templates: classic vs Flex

Dataflow ML

Runner v2 and cross-language

Common Pitfalls and Trade-offs

Hot keys and skewed shuffles

Streaming Engine cost surprises

Autoscaling oscillation

FlexRS misuse

Template parameter rigidity

Pub/Sub ack deadline mismatch

Best Practices

Real-World Use Case

Decision Matrix: Dataflow vs Dataproc vs Cloud Functions

Choose Dataflow when

Choose Dataproc when

Choose Cloud Functions (or Cloud Run) when

Choose BigQuery (alone) when

Choose Datastream + Dataflow when

Exam Tips

Frequently Asked Questions (FAQ)

Is Dataflow Prime always cheaper than Classic?

When should I use the Storage Write API instead of legacy streaming inserts?

Can I run Dataflow without Streaming Engine for streaming jobs?

What is the difference between a classic template and a Flex template in practice?

How does FlexRS actually save money?

When should I pick Dataproc over Dataflow?

Does Runner v2 affect job cost or performance?

Related Topics

Further Reading

Official sources

More PDE topics