Pub/Sub Messaging and Streaming Ingestion

Introduction to Pub/Sub Messaging and Streaming Ingestion

Pub/Sub messaging and streaming ingestion sits at the front door of nearly every real-time data pipeline on Google Cloud. It is the asynchronous, globally distributed messaging service that absorbs unpredictable bursts of events, hands them to downstream processors at a steady pace, and keeps publishers and subscribers from ever needing to know about each other. For the Professional Data Engineer exam, expect Pub/Sub messaging and streaming ingestion to appear in scenarios about IoT telemetry, change data capture, log aggregation, click streams, and any architecture where decoupling matters more than synchronous responses.

This study note walks through the moving parts you need to design, operate, and debug a production Pub/Sub messaging and streaming ingestion system. We will cover topics, subscription models, ordering keys, retention windows, ack deadlines with auto-extension, dead-letter handling, snapshots and seek, the trimmed-down Pub/Sub Lite product, the direct-write BigQuery subscription, the Cloud Storage subscription, the schema registry, exactly-once delivery semantics, filter expressions, and IAM. Each section is written to map directly onto the kinds of trade-off questions the PDE exam loves to ask about Pub/Sub messaging and streaming ingestion.

白話文解釋（Plain English Explanation）

Before drowning in API names, it helps to anchor Pub/Sub messaging and streaming ingestion in everyday objects. Three analogies, drawn from the suggested analogy library, capture different facets.

The Post Office Sorting Hub

Picture the central sorting hub of a national postal system. People drop letters into a single inbound chute marked with a destination keyword, the hub sorts them by that keyword, and parallel teams of carriers pick them up to deliver to many neighborhoods at once. In Pub/Sub messaging and streaming ingestion, the chute is the topic, the destination keyword is implicit in which topic you publish to, and each carrier team is a subscription. The publisher walks away the moment the envelope leaves their hand, and the hub keeps the letter safe until a carrier confirms delivery. If a single neighborhood goes offline for a snowstorm, the hub holds onto undelivered mail rather than throwing it away.

The Telephone Switchboard

Old-school telephone switchboards routed callers to whichever operator was free, and operators could ring multiple downstream extensions when needed. Pub/Sub messaging and streaming ingestion behaves similarly: the topic is the switchboard, subscriptions are the trunks heading to specific departments, and a single inbound message can ring many lines at once. Push subscriptions are like the switchboard actively dialing the department on your behalf, while pull subscriptions are like the department repeatedly calling the switchboard to ask whether anything is waiting. Both models reach the same end state, just with different control over flow.

The Restaurant Kitchen Ticket Rail

Walk into a busy kitchen and you will see paper tickets clipped to a rail above the line. The expediter pins each ticket up, and stations downstream (grill, fry, salad) read the tickets independently and start cooking. Tickets stay clipped until a cook tears one off, signaling completion. If a cook gets pulled away, the ticket is still hanging there for the next cook to grab. Pub/Sub messaging and streaming ingestion is that rail, the ack is the cook tearing off the ticket, the ack deadline is the time before the expediter assumes the cook abandoned the ticket and re-clips it, and the dead-letter topic is the bin where tickets that fail too many times end up for the chef to investigate.

Core Concepts of Pub/Sub Messaging and Streaming Ingestion

This section catalogs the vocabulary that the PDE exam expects you to use precisely. Pub/Sub messaging and streaming ingestion has its own dialect, and mixing terms with Kafka or RabbitMQ vocabulary will trip you up.

Topics

A topic is a named resource that represents a feed of messages. Publishers send data to a topic without knowing or caring who reads it. Topics in Pub/Sub messaging and streaming ingestion are global resources within a project, accessible from any region, and they store no messages on their own beyond what subscriptions retain.

Subscriptions

A subscription is the bridge between a topic and one or more subscriber clients. Each subscription receives its own copy of every message published to its parent topic. This is the fan-out model that makes Pub/Sub messaging and streaming ingestion so flexible: you can have one subscription feeding Dataflow for analytics, another feeding Cloud Run for real-time alerting, and a third feeding BigQuery for archival, all reading the same source.

Messages and Attributes

A message carries a binary payload plus optional string attributes. Attributes are tiny key-value tags that travel alongside the payload and are useful for filtering, routing, and ordering. Pub/Sub messaging and streaming ingestion limits message size to 10 MB, but most workloads stay well under 1 MB to keep throughput high.

The acknowledgement deadline is the window of time, after Pub/Sub delivers a message, during which the subscriber must call acknowledge before Pub/Sub redelivers the message. Default is 10 seconds, configurable up to 600 seconds. See Pub/Sub subscriber overview.

Publishers and Subscribers

Publishers are the clients (apps, services, devices) that call the publish API. Subscribers are the clients that consume messages, either by pulling or by receiving pushed HTTP callbacks. Pub/Sub messaging and streaming ingestion uses gRPC under the hood for both directions, with REST as a secondary surface.

Subscription Models in Pub/Sub Messaging and Streaming Ingestion

Choosing the right subscription type is one of the most common PDE exam questions. Each model trades latency, control, and operational overhead differently.

Pull Subscriptions

Pull subscribers ask Pub/Sub for messages on their own schedule. The client opens a streaming pull connection or makes synchronous pull RPCs, processes the batch, then acks. Pull is the right pick when subscribers run inside a VPC where they cannot accept inbound HTTPS, when batching boosts throughput, or when downstream rate limits demand client-side flow control. Dataflow, GKE workloads, and most Java or Python services use pull.

Push Subscriptions

Push subscribers register an HTTPS endpoint, and Pub/Sub messaging and streaming ingestion sends each message as an HTTP POST. The endpoint must respond with a 2xx status to ack. Push is convenient for serverless targets like Cloud Run, Cloud Functions, and App Engine because the platform scales the receivers up and down based on traffic. Push uses adaptive flow control: Pub/Sub watches your endpoint's response codes and latencies, then throttles or expands delivery accordingly.

BigQuery Subscriptions

A BigQuery subscription writes messages directly into a BigQuery table with no Dataflow job in between. This is a relatively new addition to Pub/Sub messaging and streaming ingestion and slashes the cost of straightforward ingestion-to-warehouse pipelines. The subscription handles schema mapping in two modes: writing the message data as a single column, or parsing the payload against a Pub/Sub schema and projecting fields into typed columns.

Cloud Storage Subscriptions

A Cloud Storage subscription batches messages and writes them as files (Avro or text) to a GCS bucket on a size or time trigger. This is ideal for cheap long-term archival, for feeding data lakes, or for satisfying compliance retention without standing up a streaming job.

BigQuery and Cloud Storage subscriptions remove the need for a Dataflow streaming job for the simplest ingestion patterns. They cost noticeably less than running Dataflow continuously, but they offer no in-flight transformation. If you need any reshaping beyond schema projection, you still need Dataflow. Reference: Pub/Sub BigQuery subscriptions.

Architecture and Design Patterns

Pub/Sub messaging and streaming ingestion shines in a handful of recurring patterns. Knowing them by name helps when the exam describes a scenario in business language.

Fan-Out for Multi-Consumer Pipelines

A single topic feeds many subscriptions, each driving an independent consumer. This is how teams add a new analytics use case without touching producers. Marketing's clickstream topic might fan out to a Dataflow job for real-time dashboards, a BigQuery subscription for warehouse loading, and a Cloud Run service for personalization, all in parallel.

Buffering for Spike Smoothing

When traffic is bursty, putting Pub/Sub messaging and streaming ingestion in front of slower processors absorbs the peaks. The topic acts as a shock absorber: producers publish at peak rate, consumers drain at their own sustainable rate, and message retention covers the gap.

Dead-Letter Routing

Messages that repeatedly fail processing are forwarded to a dead-letter topic for inspection. The main pipeline keeps moving, and a separate workflow handles the bad apples. This pattern is essential for any consumer that depends on external services that might temporarily reject specific records.

Replay via Snapshots and Seek

When a downstream bug corrupts data, you do not want to lose every message that arrived during the bad window. Snapshots and seek let you rewind a subscription to a known-good moment and re-process. This is the safety net that makes streaming pipelines recoverable.

Always create a snapshot before you deploy a non-trivial change to a streaming consumer. If the new code mishandles messages, seek the subscription back to the snapshot timestamp and replay. Snapshots are cheap, regret is expensive. Reference: Replaying and discarding messages.

Change Data Capture Hub

Datastream, Debezium-style connectors, and Spanner change streams all funnel into Pub/Sub messaging and streaming ingestion as a normalization layer before downstream sinks. The topic becomes the single source of truth for "things that changed in our operational stores."

GCP Service Deep Dive

This is where the exam-focused detail lives. Each subsection covers a distinct capability of Pub/Sub messaging and streaming ingestion that you should be able to recognize from a scenario description.

Ordering Keys

By default, Pub/Sub messaging and streaming ingestion does not guarantee delivery order. Ordering keys change that: if a publisher tags messages with the same ordering key, Pub/Sub delivers them to subscribers in the order they were published. Different keys can still be processed in parallel, so you do not lose all concurrency. Ordering must be enabled on both the topic and the subscription, and the subscriber must process messages from the same key serially. Common keys are user IDs, device IDs, or session IDs.

Message Retention

Topics retain messages for a configurable window (up to 31 days) so that subscriptions created later can still see historical data, and subscriptions retain unacked messages for up to 7 days. Retention costs money per GB, so set the window to the smallest value that still covers your replay needs.

Ack Deadlines and Auto-Extension

When Pub/Sub messaging and streaming ingestion delivers a message, it starts a timer (the ack deadline). If the subscriber does not ack within that window, the message is redelivered. The default is 10 seconds, but the high-level client libraries auto-extend the deadline using the lease management feature: as long as the message is held in the client buffer, the library calls modifyAckDeadline to keep the lease alive, up to the maxAckExtensionPeriod (default 60 minutes). This is why most subscribers never need to manually tune the deadline.

Setting the ack deadline very high to avoid redelivery is an anti-pattern. If a subscriber crashes mid-processing, you will wait the full deadline before redelivery, delaying recovery. Let the client library auto-extend instead, and keep the base deadline at 10 to 60 seconds. Reference: Lease management for messages.

Dead-Letter Topics

Configure a dead-letter topic on a subscription, and Pub/Sub messaging and streaming ingestion will forward any message that exceeds the configured maximum delivery attempts (5 to 100) to that topic. The dead-letter topic is itself a regular topic, so you attach a subscription to it for triage, alerting, or manual reprocessing. Without dead-lettering, a single poison message can block a subscription forever as it loops on redelivery.

Snapshots and Seek

A snapshot captures the current state of a subscription's acked and unacked messages. Seek replays the subscription to either a snapshot or an arbitrary timestamp within the retention window. Snapshots themselves expire after 7 days. The combination is the foundation of disaster recovery for streaming pipelines.

Pub/Sub Lite

Pub/Sub Lite is a separate, lower-cost zonal product for workloads where you can pre-provision capacity and tolerate zonal failures. You buy fixed publish and subscribe throughput and storage, the way you would with a Kafka cluster. It is roughly an order of magnitude cheaper per GB but lacks global routing, and it does not support BigQuery subscriptions, schemas, or filters. Use Lite only when cost dominates and the workload is predictable.

BigQuery Subscription Direct Write

The BigQuery subscription writes via the BigQuery Storage Write API, which gives you exactly-once semantics into the destination table. Schema can be inferred from a Pub/Sub schema or supplied via the use_topic_schema flag. The subscription handles schema evolution by adding new fields as the topic schema evolves.

Cloud Storage Subscription

The Cloud Storage subscription batches messages on a max bytes or max duration trigger and writes Avro or newline-delimited text. Filenames follow a configurable prefix and a timestamp suffix. Useful for hydrating a data lake without a streaming compute layer.

Schema Registry

The Pub/Sub schema registry stores Avro and Protocol Buffer schemas at the topic level. When a topic has a schema, Pub/Sub validates every published message against it and rejects malformed messages at publish time. Schemas integrate with the BigQuery subscription so that fields project into typed columns, and they support compatible evolution rules so that you can add optional fields without breaking existing publishers.

Exactly-Once Delivery

By default, Pub/Sub messaging and streaming ingestion is at-least-once: a message can be delivered more than once if an ack is lost or a subscriber times out. Exactly-once delivery, an opt-in feature on pull subscriptions, narrows that window. With it enabled, Pub/Sub guarantees that an acked message will not be redelivered to the same subscription, and acks become reliable rather than best-effort. Exactly-once does not eliminate duplicates across different subscriptions, and it requires that the subscription region match the publisher region for the strongest guarantees.

Exactly-once delivery in Pub/Sub messaging and streaming ingestion is per-subscription, pull-only, and depends on regional alignment. It does not make consumers idempotent for free; you still need stable message IDs if you write to non-transactional sinks. Reference: Exactly-once delivery.

Filter Expressions

Subscriptions can declare a filter expression over message attributes. Pub/Sub messaging and streaming ingestion evaluates the filter at delivery time and only sends matching messages to the subscriber. Non-matching messages are auto-acked from the subscription's perspective, so you do not pay subscriber-side processing for irrelevant data. Filters operate on attributes, not payload contents, so design your publishers to set attributes you might want to filter on.

IAM and Access Control

Pub/Sub messaging and streaming ingestion uses fine-grained IAM roles. Key roles to remember: roles/pubsub.publisher lets a principal publish to a topic, roles/pubsub.subscriber lets it pull or receive push messages from a subscription, roles/pubsub.viewer grants read-only metadata access, and roles/pubsub.editor grants resource management. For BigQuery subscriptions, the Pub/Sub service account also needs roles/bigquery.dataEditor on the destination dataset. For push subscriptions to private endpoints, you authenticate with a service account token that the receiver validates.

Common Pitfalls and Trade-Offs

Real systems break in patterns. These are the ones that show up both on the exam and in production postmortems.

Treating Push as a Free Lunch

Push subscriptions look simple, but they can hammer a downstream service that is not auto-scaling fast enough, leading to a cascade of 5xx responses and message redelivery storms. Pub/Sub will back off, but only after damage is done. For unpredictable traffic, pull with controlled concurrency is often safer.

Forgetting Ordering Costs Throughput

Ordering keys serialize processing within a key. If your key has high cardinality (millions of users), throughput stays high. If your key has low cardinality (a handful of regions), you bottleneck on the largest key. Choose ordering keys with care, and avoid ordering when you do not need it.

Mismatched Retention and Replay Plans

Teams often set retention to the default and discover, mid-incident, that the data they wanted to replay was already discarded. If your recovery plan involves seek, set retention to cover your worst-case detection-plus-fix time, not just typical operations.

Dead-Letter Topic Without a Consumer

Configuring a dead-letter topic and then never reading from it is the same as logging errors to /dev/null. Always attach a subscription that at minimum alerts on volume, and ideally feeds a triage workflow.

Confusing Pub/Sub Lite with Pub/Sub

The two products share a name but differ in availability model, pricing, and feature set. Lite is zonal, capacity-provisioned, and lacks half the features described above. Picking Lite for a global, mission-critical pipeline is a costly mistake.

Schema Evolution Surprises

Adding a required field to a topic schema breaks existing publishers. Schema evolution rules require new fields to be optional or have defaults. Always test schema changes against a non-production topic first.

Best Practices

A short, opinionated checklist for production Pub/Sub messaging and streaming ingestion deployments:

Default to pull subscriptions for any consumer that runs continuously, push only for genuinely event-driven serverless workloads.
Enable dead-letter topics on every subscription that processes business-critical data, and wire alerts to the dead-letter volume metric.
Set message retention to cover your worst-case replay window, not the default.
Use ordering keys only when downstream logic actually depends on order, and pick high-cardinality keys.
Turn on exactly-once delivery for pipelines whose sinks are not idempotent and that cannot tolerate duplicates.
Attach a Pub/Sub schema to every production topic; it catches malformed payloads at publish time rather than in subscribers.
Take snapshots before risky deployments so you can seek back if needed.
Grant the minimum IAM role per principal: publishers get roles/pubsub.publisher on specific topics, never project-level.
Monitor oldest_unacked_message_age and num_undelivered_messages; these are the leading indicators of consumer lag.

Real-World Use Case

Consider a mid-sized rideshare company processing roughly 50,000 trip events per second across three continents. Each trip generates events for ride request, driver assignment, pickup, route updates, and completion. The data team needs three downstream uses: real-time dispatch dashboards, fraud detection, and warehouse loading for finance.

The architecture lands on Pub/Sub messaging and streaming ingestion as the central hub. Trip-event publishers in each region post to a single global topic with a Pub/Sub schema enforcing the event payload structure. Three subscriptions fan out: a Dataflow streaming job for the dispatch dashboards (pull, with ordering keys on trip_id so events for a single trip stay sequential), a Cloud Run push subscription powering the fraud model, and a BigQuery subscription writing directly into a partitioned table for finance.

Exactly-once delivery is enabled on the Dataflow subscription because dashboard duplicate counts caused customer complaints in the past. A dead-letter topic captures messages that fail validation, and a Cloud Function reads it to alert on-call engineers. Retention is set to 3 days so that any incident detected within 72 hours can replay using seek. The BigQuery subscription costs a fraction of the previous Dataflow-to-BigQuery streaming insert pipeline it replaced, freeing budget for additional compute on the analytics side.

When a fraud model bug ships and starts mislabeling trips, the on-call engineer takes a snapshot of the fraud subscription, rolls back the Cloud Run revision, then seeks the subscription back two hours to replay events through the fixed model. No data is lost, no downstream sink is corrupted, and the dispatch and finance pipelines are entirely unaffected because they use independent subscriptions.

Exam Tips

The PDE exam asks scenario questions where Pub/Sub messaging and streaming ingestion is one of several plausible answers. Use these heuristics to pick correctly:

If the scenario stresses global scale, automatic capacity, and managed operations, choose Pub/Sub. If it stresses fixed-cost predictability and the team is willing to manage capacity, Pub/Sub Lite is the trap-door alternative.
When the question describes ingesting straight into BigQuery with no transformation, prefer the BigQuery subscription over a Dataflow job. The exam rewards the simpler, cheaper option.
When the question mentions ordered processing per user or per device, ordering keys are the answer, not a separate ordering service.
If the scenario describes a poison message blocking processing, dead-letter topics are the fix.
For replay after a downstream bug, snapshots and seek are the right tools, not re-publishing from source.
Exactly-once delivery is pull-only and per-subscription; if a question asks about exactly-once on a push subscription, that option is wrong.
Filter expressions reduce subscriber-side processing cost for fan-out scenarios where each consumer wants only a slice of the topic.
For long-term archival to a data lake, Cloud Storage subscriptions with Avro output are the cost-optimal choice.
IAM at the topic and subscription level matters: project-wide roles/pubsub.editor on a service account is almost always the wrong answer.
Schema registry catches bad payloads at publish time; the exam favors prevention over downstream cleanup.

On the PDE exam, when you see a scenario that includes "decouple producers and consumers," "absorb spiky traffic," "fan-out to multiple downstream systems," or "ingest streaming events into BigQuery," Pub/Sub messaging and streaming ingestion is almost certainly part of the correct answer. Reference: Pub/Sub overview.

Frequently Asked Questions (FAQ)

How does Pub/Sub messaging and streaming ingestion differ from Apache Kafka?

Both are publish-subscribe systems, but Pub/Sub is fully managed, globally available, and capacity-elastic, while Kafka is a partitioned log that requires capacity planning and operational care. Pub/Sub does not expose partitions or offsets to clients (Pub/Sub Lite does), and it scales horizontally without operator intervention. Choose Pub/Sub when you want managed simplicity, Kafka when you need fine-grained control over the log or have an existing Kafka ecosystem.

When should I pick a BigQuery subscription instead of a Dataflow streaming job?

Pick the BigQuery subscription when the messages already match the destination schema (or can be projected via a topic schema) and you do not need to enrich, join, aggregate, or window the data before it lands. The BigQuery subscription is dramatically cheaper for pure ingestion. Switch to Dataflow the moment any in-flight transformation is required.

Does enabling ordering keys hurt throughput?

It limits parallelism within a single ordering key but not across keys. If you publish with millions of distinct keys (for example, per-user IDs), throughput stays close to non-ordered performance. If you publish with a handful of keys, those keys become bottlenecks because messages with the same key must be processed sequentially.

What happens if a subscriber never acks a message?

Pub/Sub messaging and streaming ingestion redelivers the message after the ack deadline expires. If a dead-letter topic is configured and the message exceeds the maximum delivery attempts, it is forwarded there. Without a dead-letter topic, the message keeps being redelivered until it is acked or the subscription's message retention window passes (default 7 days, max 7 days).

Is exactly-once delivery truly exactly once?

Exactly-once delivery in Pub/Sub messaging and streaming ingestion guarantees that, within a single subscription, an acked message will not be redelivered. It does not deduplicate messages that the publisher sent twice (different message IDs), and it does not coordinate across multiple subscriptions to the same topic. For end-to-end exactly-once, combine the feature with idempotent writes downstream or with the BigQuery Storage Write API exactly-once mode.

How do filter expressions interact with billing?

Filtered-out messages are still delivered (and billed) up to the point where the filter rejects them, then auto-acked from the subscription. You pay for ingress to the topic regardless. Filters reduce subscriber-side compute and egress, not topic-level publish charges.

Can I push messages to an endpoint inside a private VPC?

Yes, but the endpoint must be reachable via a public hostname or via Internal HTTPS Load Balancer with the appropriate Pub/Sub service-attachment configuration. For most private workloads, pull subscriptions are simpler because they do not require any inbound connectivity.

Dataflow architecture selection — how to pair Pub/Sub with Dataflow for in-flight transformation.
Batch vs streaming design — trade-offs between micro-batches and pure streaming pipelines.
BigQuery data modeling and clustering — how the BigQuery subscription's destination table should be designed.

Introduction to Pub/Sub Messaging and Streaming Ingestion

白話文解釋（Plain English Explanation）

The Post Office Sorting Hub

The Telephone Switchboard

The Restaurant Kitchen Ticket Rail

Core Concepts of Pub/Sub Messaging and Streaming Ingestion

Topics

Subscriptions

Messages and Attributes

Publishers and Subscribers

Subscription Models in Pub/Sub Messaging and Streaming Ingestion

Pull Subscriptions

Push Subscriptions

BigQuery Subscriptions

Cloud Storage Subscriptions

Architecture and Design Patterns

Fan-Out for Multi-Consumer Pipelines

Buffering for Spike Smoothing

Dead-Letter Routing

Replay via Snapshots and Seek

Change Data Capture Hub

GCP Service Deep Dive

Ordering Keys

Message Retention

Ack Deadlines and Auto-Extension

Dead-Letter Topics

Snapshots and Seek

Pub/Sub Lite

BigQuery Subscription Direct Write

Cloud Storage Subscription

Schema Registry

Exactly-Once Delivery

Filter Expressions

IAM and Access Control

Common Pitfalls and Trade-Offs

Treating Push as a Free Lunch

Forgetting Ordering Costs Throughput

Mismatched Retention and Replay Plans

Dead-Letter Topic Without a Consumer

Confusing Pub/Sub Lite with Pub/Sub

Schema Evolution Surprises

Best Practices

Real-World Use Case

Exam Tips

Frequently Asked Questions (FAQ)

How does Pub/Sub messaging and streaming ingestion differ from Apache Kafka?

When should I pick a BigQuery subscription instead of a Dataflow streaming job?

Does enabling ordering keys hurt throughput?

What happens if a subscriber never acks a message?

Is exactly-once delivery truly exactly once?

How do filter expressions interact with billing?

Can I push messages to an endpoint inside a private VPC?

Related Topics

Further Reading

Official sources

More PDE topics