Introduction to Batch vs Streaming Pipeline Design
Choosing between batch and streaming is the first architectural fork most data engineers hit on a new project, and it is a recurring theme on the Google Cloud Professional Data Engineer exam. The decision is rarely about which mode is "better." It is about matching the shape of your data, the freshness your users need, and the budget you have to the right Google Cloud primitive. Get this wrong and you either burn money keeping idle workers warm for data that arrives once a day, or you make stakeholders wait twelve hours for a number they need every five minutes.
This note walks through bounded versus unbounded data, the Lambda and Kappa patterns, micro-batching, the latency-throughput trade space, exactly-once semantics, replay, the difference between Dataflow's two execution modes, the choice between Pub/Sub plus BigQuery streaming inserts and scheduled batch loads, and what changed when the Storage Write API replaced the legacy tabledata.insertAll. It closes with a decision framework you can apply during the exam and on real projects.
白話文解釋(Plain English Explanation)
Before stacking up Beam transforms and IAM policies, it helps to ground the vocabulary in scenes you already understand. The three analogies below come from outside the data world on purpose. They are designed to give you intuitive anchors you can pull on when an exam question wraps the same concept in cloud jargon.
The kitchen during dinner service
Picture a small bistro at 7pm. The prep cook spent the afternoon making stock, dicing onions, and portioning sauces. That is a batch job. The ingredients are bounded, the deadline is known, and the work happens in one big push so the kitchen can get on with the rest of the day. Then service begins. Tickets print one at a time, the line cook reads each one, fires the proteins, and plates dishes as the orders flow in. That is streaming. The ticket printer never tells the kitchen "this is the last order tonight," so the cooks have to keep their stations hot and their hands ready.
Notice that the bistro does not pick one mode. The afternoon prep makes the evening service possible, because cooking stock to order would take three hours and no diner waits that long. Real architectures work the same way: precomputed batch tables feed low-latency stream processors that join, filter, and decorate live events.
The post office and a courier
A national post office picks up letters from every blue mailbox once a day. The truck does one loop, the sorting facility processes everything in one overnight shift, and delivery happens the next morning. Cost per letter is tiny, but if you mail something on Saturday it might not move until Monday. That is batch.
A bike courier across the street works on demand. You hand them a package, they hop on the bike, and the package is across town in twenty minutes. Cost per delivery is much higher, but the latency is measured in minutes, not days. That is streaming. When you size your pipeline you are essentially asking "is the value of this data worth courier pricing, or can it wait for the truck?"
The power grid
Electricity utilities run two kinds of generation side by side. Baseload plants like nuclear and large gas turbines run continuously at a steady output, planned weeks ahead. Peaker plants spin up in seconds when demand spikes during a heatwave. The grid operator does not choose one over the other; it tunes the mix so that the cheap, predictable generation handles the bulk of the load and the expensive, fast generation covers the surges. A well-designed data platform makes the same trade. Cheap nightly batch jobs cover the bulk of analytics, and a smaller, more expensive streaming pipeline handles the events that need to act now.
Core Concepts of Batch vs Streaming Pipeline Design
The terminology around batch and streaming gets sloppy in marketing copy, so it is worth pinning the words down before going further.
Bounded versus unbounded data
A dataset is bounded when its size is finite and known when the job starts. A CSV file in Cloud Storage, a Hive partition for the previous day, or a snapshot of a Cloud SQL table are all bounded. The runner can plan resource usage, calculate progress, and signal completion when the last record is processed.
A dataset is unbounded when records keep arriving with no end in sight. Click events from a website, IoT telemetry from a fleet of trucks, or transactions from a payment switch are unbounded. There is no "end of file" sentinel, so the system has to invent its own notion of progress using event time and watermarks.
The distinction is more than academic. Beam and Dataflow let you write the same PCollection transforms against either kind of source, but the runner picks a different execution strategy underneath. A bounded pipeline can finish; an unbounded one runs until you cancel it.
A watermark is the runner's heuristic estimate of "we have probably seen all events with an event-time on or before this point." It powers window closing and late-data handling. See the official explanation at https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#watermarks.
Micro-batch is a third option, not a compromise
Some frameworks (notably Spark Structured Streaming and BigQuery's scheduled queries running every few minutes) implement streaming as a series of small batches. The runner accumulates events for a short interval, runs a bounded job over that slice, then immediately starts the next one. Throughput is excellent, code is simple, and operational tooling is the same as batch. The trade-off is that end-to-end latency floors at one micro-batch interval, typically tens of seconds to a few minutes. For dashboards refreshed every two minutes this is fine; for fraud detection that must react in under a second it is not.
Latency, throughput, and cost form a triangle
Engineers often talk about latency and throughput as a pair, but cost is the third corner of the triangle and it usually decides the design. Streaming pipelines on Dataflow keep workers warm 24/7 so that an event arriving at 03:14am is processed in seconds. A nightly batch job spins workers up, processes a billion rows in two hours, and releases them. If you take a workload that genuinely runs once a day and put it on a streaming runner, you pay for 22 hours of idle compute every day for no benefit.
Before defaulting to streaming, ask "what does the business do with data that is one hour fresh versus data that is one minute fresh?" If the answer is "nothing different," batch is almost always the right call. https://cloud.google.com/architecture/data-lifecycle-cloud-platform
Architecture and Design Patterns
Two named patterns dominate the conversation about mixing batch and streaming. Both come from the pre-cloud era but they still describe how teams structure GCP pipelines today.
Lambda architecture
Lambda, coined by Nathan Marz, splits the platform into three layers. A batch layer reprocesses the entire historical dataset on a schedule and produces authoritative views. A speed layer processes the live stream and produces approximate, recent views. A serving layer merges the two so queries see fresh data backed by accurate history. The pattern is forgiving because mistakes in the speed layer get corrected by the next batch run. The cost is that you maintain two codebases, two deployment pipelines, and two monitoring stacks for what is conceptually the same logic.
On GCP, Lambda often shows up as Dataflow batch jobs writing to BigQuery partitioned tables, plus Dataflow streaming jobs writing to a separate "live" table or to Bigtable, with a view in BigQuery UNION ALL-ing the two. It works, but the duplication wears teams down.
Kappa architecture
Jay Kreps proposed Kappa as a simplification. There is one pipeline. It reads from a durable log (Kafka or Pub/Sub with a long retention window) and processes events as a stream. When the logic changes, you replay the log from the beginning into a new instance of the pipeline, switch consumers to the new output table, and decommission the old one. There is one codebase, one deployment, one monitoring story. The catch is that your log must hold enough history to reprocess everything you care about, which can mean weeks or months of retention.
Pub/Sub now supports message retention up to 31 days and Pub/Sub Lite supports much longer windows, which makes Kappa practical for many workloads on GCP. For workloads that need years of replay, teams often combine Pub/Sub for live ingestion with a Cloud Storage archive that the same Beam pipeline can read in batch mode for historical reprocessing.
The unified model in Apache Beam
Beam was built specifically to dissolve the Lambda dichotomy. The same PCollection and PTransform code runs against bounded sources in batch mode and unbounded sources in streaming mode. Windowing, watermarks, and triggers are first-class concepts in batch too; a batch run just sees a single global window by default. This is why so many GCP shops standardise on Dataflow even when their workload is mostly batch: the day they need to add a streaming consumer, they do not have to rewrite anything.
When you write Beam transforms, avoid baking assumptions about boundedness into your DoFns. Read the source's isBounded property if you need to branch, but most logic should be agnostic so the same code can serve both modes. https://cloud.google.com/dataflow/docs/concepts/beam-programming-model
GCP Service Deep Dive
The exam expects you to know which managed service does what, and to recognise when the question is steering you toward a specific product. Here is how the main pieces fit together for batch and streaming work.
Dataflow batch versus streaming mode
Dataflow is one product with two execution modes. You select the mode implicitly through the source: a TextIO.read() over a Cloud Storage glob produces a bounded PCollection and the runner picks batch mode; a PubsubIO.readMessages() produces an unbounded PCollection and the runner picks streaming mode. Batch mode uses FlexRS or regular workers, finishes when the input is exhausted, and bills only for the resources actually used. Streaming mode keeps a worker pool warm, uses Streaming Engine to offload state and shuffle from VMs to a managed service, and bills continuously until you drain or cancel the job.
Streaming Engine is worth understanding because the exam loves it. By moving shuffle and state out of the worker VMs, it reduces worker disk and memory needs, speeds up autoscaling reactions, and lets workers be replaced without losing in-flight work. Dataflow Prime extends this with vertical autoscaling and right-fitting, which the exam may mention as the modern default.
Pub/Sub as the universal ingestion bus
Pub/Sub is the default landing zone for events on GCP. It accepts billions of messages per day, durably stores them across multiple zones, and fans them out to many subscribers. Two subscription styles matter: pull subscriptions, where consumers ask for messages and acknowledge them at their own pace, and push subscriptions, where Pub/Sub delivers messages to an HTTPS endpoint or a Cloud Run service. Streaming Dataflow jobs almost always use pull, because pull plays well with backpressure and parallelism.
Pub/Sub also supports exactly-once delivery on pull subscriptions within a region, which materially changes how you write downstream consumers. With it enabled, you can rely on Pub/Sub redelivering only on real failures, instead of designing every consumer to be idempotent against arbitrary duplicates. It does not eliminate the need for idempotency at the sink, but it reduces the duplicate rate by orders of magnitude.
BigQuery streaming inserts versus batch loads
Loading data into BigQuery has historically meant choosing between two paths. Batch loads via bq load or the jobs.insert API are free of per-row charges, support every file format BigQuery understands, and commit data in atomic chunks. They are perfect for hourly or nightly drops from Cloud Storage. The trade is latency: a load job is a job, and jobs queue, schedule, and complete on their own clock.
Streaming inserts via the legacy tabledata.insertAll method push rows into a streaming buffer that is queryable within seconds. They are billed per gigabyte ingested, have per-table quotas, and historically suffered from a buffer-to-storage transition that complicated DML against recent rows. For real-time dashboards they were the default for years.
The Storage Write API changes the calculus
The Storage Write API is the modern, unified ingestion path. It replaces both tabledata.insertAll and, for many workloads, batch load jobs. It exposes three write modes: committed, pending, and buffered. Committed writes appear immediately and are billed per gigabyte. Pending writes stage rows in a stream and only become visible when you finalise the stream, giving you exactly-once batch semantics with streaming-API ergonomics. Buffered writes give per-row commit control, useful for complex CDC patterns.
The Storage Write API is cheaper than legacy streaming inserts, supports much higher per-table throughput, and integrates with Dataflow's built-in BigQueryIO so you usually do not write client code against it directly. For new pipelines on the exam, the Storage Write API is the right answer when the question contrasts it with insertAll.
Storage Write API has three modes worth memorising: committed (rows visible immediately, billed per GB, used for streaming dashboards), pending (rows staged until the stream is finalised, giving atomic exactly-once batch loads), and buffered (per-row commit control for CDC). Pub/Sub message retention tops out at 31 days, and Pub/Sub exactly-once delivery is a per-region, pull-subscription feature, not a global guarantee. https://cloud.google.com/bigquery/docs/write-api
The Storage Write API offers exactly-once semantics when used with stream offsets, which is the feature Dataflow uses under the hood to give you end-to-end exactly-once writes from Pub/Sub through Beam into BigQuery. https://cloud.google.com/bigquery/docs/write-api
Dataproc and the Spark side of the house
Dataproc runs Hadoop, Spark, Flink, and Presto on GCP. For batch, Dataproc Serverless lets you submit a Spark job and pay only for the runtime, much like Dataflow batch. For streaming, Spark Structured Streaming on Dataproc is a valid choice when your team has existing Spark expertise, but Dataflow is usually the lower-operations option because there is no cluster to manage.
Common Pitfalls and Trade-offs
Some failure modes only show up after a pipeline has been in production for a while. Knowing them ahead of time saves a postmortem.
A team builds a Dataflow streaming job that reads from Pub/Sub and writes to BigQuery using legacy streaming inserts. Months later they want to backfill six months of historical data. They point the same job at a Cloud Storage archive of old messages and the streaming buffer overflows, costs spike, and rows duplicate. The fix: run a separate batch Dataflow job that reads the archive and uses the Storage Write API in pending mode, so the backfill is atomic and avoids the streaming-insert path entirely. https://cloud.google.com/bigquery/docs/write-api
Hot keys are another trap. If 80% of your events have the same user_id because one customer dominates your traffic, then any keyed transform (group-by, session windowing) bottlenecks on a single worker. You see this in the Dataflow UI as one stage stuck at high CPU while the rest idle. The cure is usually salting the key for the heavy aggregation step and unsalting at the end.
Watermark stalls happen when a single slow source freezes the global watermark. A Pub/Sub subscription with a backlog of one stuck message can hold the whole pipeline back, because the runner cannot prove that newer events have all arrived. Monitor the data-watermark metric per stage, and use withTimestampAttribute carefully so a misbehaving publisher cannot poison your event time.
Cost surprises usually come from leaving streaming jobs running over weekends, forgetting that Streaming Engine costs even when traffic is zero, or sending tiny micro-batches into BigQuery streaming inserts at a per-row cost that adds up fast. Set up budget alerts on the Dataflow and BigQuery SKUs separately so you see the line item before the invoice.
Best Practices
A short list of the habits that separate pipelines that age well from pipelines that get rewritten in eighteen months.
- Default to batch when the business cadence is daily or longer; only move to streaming when freshness directly drives a decision or revenue line.
- Use the Storage Write API for any new BigQuery ingestion, in committed mode for streaming and pending mode for backfills.
- Pick exactly-once at the system level, not per stage. End-to-end means Pub/Sub exactly-once delivery plus Dataflow exactly-once processing plus Storage Write API stream offsets.
- Keep state small. Persistent timers and very wide aggregations explode Streaming Engine state size and slow down autoscaling.
- Treat your pipeline code as a product. Unit-test transforms with
TestPipeline, run integration tests against a small Dataflow job in a sandbox project, and version your pipeline templates. - Archive raw events to Cloud Storage even if you stream them to BigQuery. Cheap storage today is the only way to enable a Kappa-style replay tomorrow.
- Tag every resource (Dataflow jobs, BigQuery tables, Pub/Sub topics) with the pipeline name so cost attribution and incident response are straightforward.
Real-World Use Case
Consider a mid-sized fintech with about four million end users and a payments switch that emits roughly twenty thousand transaction events per second at peak. The data team has three jobs to do with that stream.
The first is fraud scoring. Each transaction must be scored against a model in under 200 milliseconds so that the switch can approve or decline. This is a streaming workload, end of story. Pub/Sub ingests the events, a Dataflow streaming job in exactly-once mode reads them, calls a Vertex AI online prediction endpoint, and writes the score back to a Bigtable table that the switch reads on the next request. The Storage Write API also commits a copy of each scored event into BigQuery for analytics.
The second job is a daily settlement report for the operations team. This needs to be exact, must reconcile against the bank's own ledger, and is consumed by a human at 9am the next morning. There is no point streaming this. A scheduled BigQuery job runs at 02:00, joins the previous day's transactions against the ledger drop in Cloud Storage, and produces a settlement table that Looker dashboards read.
The third job is a six-month historical reprocess whenever the fraud model is retrained. The team archives every Pub/Sub message into Cloud Storage with a 24-month retention. When a new model goes live, a batch Dataflow job reads the archive, scores each historical event against the new model, and writes the results to a parallel BigQuery table for backtesting. This is the Kappa pattern in action: one Beam codebase, two execution modes, one source of truth for the events.
The architecture costs roughly half what an equivalent always-on streaming-only design would cost, because the settlement and reprocess work runs on bounded jobs that finish and release their workers.
Exam Tips
The PDE exam phrases batch versus streaming questions in a few recognisable shapes. Knowing the shapes helps you answer in seconds rather than minutes.
When the scenario mentions "real-time dashboard" or "sub-second alerting," the answer almost always involves Pub/Sub plus Dataflow streaming plus either BigQuery (via the Storage Write API) or Bigtable as the sink. The discriminator between BigQuery and Bigtable is query pattern: ad-hoc analytical SQL points to BigQuery, point lookups by key point to Bigtable.
When the scenario mentions "daily report," "monthly aggregation," or "historical analysis," the answer involves Cloud Storage as a landing zone, a batch Dataflow or Dataproc job, and BigQuery as the warehouse. If the question stresses cost optimisation, look for FlexRS or Dataflow Shuffle in the answer choices.
When the question pits tabledata.insertAll against the Storage Write API, the Storage Write API is the modern answer unless the question is specifically asking about legacy behaviour.
When you see "exactly-once" in the question, scan the answers for the combination of Pub/Sub exactly-once delivery, Dataflow streaming with the appropriate setting, and the Storage Write API. Any answer that relies on application-level dedup logic alone is usually wrong on a modern exam form.
When the question asks about replaying historical data through the same logic, Kappa with Pub/Sub retention or Cloud Storage archive plus Beam in batch mode is the answer. Lambda is rarely the right choice on a new design today, although the exam may still mention it as background.
If two answers look correct but one uses a managed service (Dataflow, BigQuery, Pub/Sub) and the other uses a self-managed equivalent (Spark on GCE, Kafka on GKE), the managed-service answer is almost always what Google wants. https://cloud.google.com/architecture/data-lifecycle-cloud-platform
Frequently Asked Questions
When should I pick streaming over batch on GCP?
Pick streaming when the business value of an event decays in minutes or seconds, not hours. Fraud detection, real-time personalisation, operational alerting, and IoT anomaly detection all qualify. If the downstream consumer is a human looking at a dashboard once a morning, batch is cheaper and simpler. Run a quick thought experiment: if the data were one hour stale, would anyone act differently? If not, batch wins.
What is the difference between Lambda and Kappa architectures in practice?
Lambda runs two pipelines, one batch and one streaming, and merges their outputs at query time. It is forgiving because the batch layer corrects mistakes in the streaming layer, but it doubles your maintenance load. Kappa runs one streaming pipeline against a durable log, and replays the log when logic changes. It is simpler operationally but requires a log that retains enough history. Most new GCP designs lean Kappa because Apache Beam plus Pub/Sub plus Cloud Storage make replay practical.
Should I use BigQuery streaming inserts or the Storage Write API?
Use the Storage Write API. It is the modern unified path, costs less per gigabyte, supports higher throughput, and offers exactly-once semantics through stream offsets. Legacy streaming inserts via tabledata.insertAll still work but are no longer the recommended choice for new pipelines. If you use Dataflow's BigQueryIO, configure it to use the Storage Write API and you get the benefit without writing low-level client code.
How does Dataflow guarantee exactly-once processing?
Dataflow tracks every record through the pipeline using deterministic shuffles and persistent state in Streaming Engine. When a worker fails, the runner replays the in-flight bundle on a new worker and discards the old worker's partial output, so each record is processed exactly once. End-to-end exactly-once also requires the source (Pub/Sub exactly-once delivery) and the sink (Storage Write API stream offsets, or idempotent writes to Bigtable) to cooperate.
What is replay and why does it matter?
Replay means re-running a pipeline against historical events to either fix a bug, test a new logic version, or rebuild a downstream table from scratch. It matters because data pipelines have bugs and models get retrained, and without replay you have to write one-off backfill scripts every time. Designing for replay from day one means archiving raw events to Cloud Storage, keeping pipeline logic deterministic, and making downstream tables idempotent under reprocessing.
Is micro-batching a form of batch or streaming?
It depends on who is selling it. Technically micro-batching processes unbounded data in small bounded slices, so it sits between the two. Operationally it feels like streaming because the pipeline runs continuously, but the latency floor is set by the slice interval. On GCP, BigQuery scheduled queries running every two minutes are a form of micro-batching, and they are a sensible cheap alternative to Dataflow streaming for dashboards that tolerate a couple of minutes of lag.
Related Topics
- Dataflow Architecture Selection for choosing between batch, streaming, FlexRS, and Dataflow Prime.
- BigQuery Data Modeling and Clustering for the sink-side schema decisions that affect ingestion design.
- Cost Optimization Architectures for keeping always-on streaming bills under control.
Further Reading
- Dataflow streaming pipelines overview at https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines covers windowing, watermarks, and Streaming Engine in depth.
- BigQuery Storage Write API documentation at https://cloud.google.com/bigquery/docs/write-api walks through committed, pending, and buffered modes with code samples.
- Pub/Sub exactly-once delivery at https://cloud.google.com/pubsub/docs/exactly-once-delivery explains the regional guarantees and how to enable them on a subscription.
- The Apache Beam programming guide at https://beam.apache.org/documentation/programming-guide/ remains the canonical reference for the unified model behind Dataflow.