Cloud Monitoring and Logging for Data Pipelines

Introduction to Cloud Monitoring and Logging for Data Pipelines

Cloud Monitoring and Logging for data pipelines is the discipline of turning raw signals from Dataflow, Dataproc, BigQuery, Pub/Sub, and supporting infrastructure into actionable insight. On the PDE exam you are expected to know which signal answers which question, where the data lives, and how to alert on it without drowning the on-call rotation in noise.

白話文解釋（Plain English Explanation）

Think of Pub/Sub backlog like a sushi conveyor belt

A sushi restaurant prepares plates and drops them onto the conveyor. If diners eat slowly, the belt fills up. The chef can see the queue building and either slows down or asks more diners to start eating. A Pub/Sub subscription works the same way. The publisher drops messages onto the belt, the subscriber consumes them, and the num_undelivered_messages metric is exactly the count of plates still sitting on the conveyor. When that number climbs, you either scale out subscribers or slow the publisher.

Think of Cloud Logging like a courier tracking page

When you ship a parcel internationally, you get a tracking page that lists every checkpoint: picked up, sorted at hub, cleared customs, out for delivery. Cloud Logging is that tracking page for a record passing through your pipeline. Each Dataflow worker, each BigQuery job, each Pub/Sub publish writes a line. Filter by your correlation_id and you can replay the whole journey, including the line where things went wrong.

Think of an SLO error budget like a monthly data plan

Your phone plan gives you, say, 20 GB per month. You can binge YouTube on day one and ration the rest, or you can spread usage evenly. An SLO error budget works the same way. If your target is 99.9% successful pipeline runs over 30 days, you have 0.1% of allowed failures. Burn it on a single bad deploy and you must freeze risky changes until the window resets. The math is identical to the data plan, just denominated in failed runs instead of gigabytes.

Think of Cloud Trace like a relay race baton

In a 4x100m relay, the baton is the only thing the timing system truly cares about. Coaches review where the baton was at every handoff to find the slow leg. Cloud Trace attaches a trace ID to a request and records every span (every leg) it travels through. When a synchronous data API takes too long, you do not guess which microservice is slow. You open the trace and see exactly which leg ate the seconds.

Core Concepts of Cloud Monitoring and Logging

Cloud Monitoring is a time-series database optimized for numerical metrics. Cloud Logging is an indexed event store for structured and unstructured text. The two share a control plane through the Google Cloud Operations Suite, but they answer different questions. Monitoring tells you something is wrong now. Logging tells you what exactly happened, with full context.

A few terms keep showing up on the exam.

A metric type is the schema for a measurement, identified by a string like pubsub.googleapis.com/subscription/num_undelivered_messages. Each metric type has a kind (gauge, delta, cumulative) and a value type (int64, double, distribution). A monitored resource is the entity producing the metric, such as a dataflow_job or bigquery_project. A time series is the combination of a metric type and a labeled resource over time.

A user-defined metric whose value is computed from log entries that match a Logging filter. Two flavors exist: counter metrics (count matching entries) and distribution metrics (extract a numeric field from the log payload).

Reference: https://cloud.google.com/logging/docs/logs-based-metrics

A text-based query language for Cloud Monitoring time series, more expressive than the basic threshold UI. MQL lets you join metrics, apply window functions, and define alert conditions that the point-and-click editor cannot express.

Reference: https://cloud.google.com/monitoring/mql

A log sink routes log entries from the _Default or _Required bucket to another destination: a different log bucket, a Pub/Sub topic, BigQuery, or Cloud Storage. Sinks are how you implement long-term retention, real-time streaming to SIEM tools, or downsampling to manage cost.

Architecture and Design Patterns

A production pipeline rarely lives inside a single service. A typical streaming architecture looks like this. Pub/Sub ingests events from mobile clients. Dataflow reads the subscription, joins against a Bigtable lookup, and writes enriched rows to BigQuery. Composer orchestrates a nightly batch reconciliation job that exports BigQuery results back to Cloud Storage for downstream partners.

Observability for that stack is layered.

At the infrastructure layer, the Ops Agent runs on Compute Engine VMs and any self-managed worker nodes, shipping host metrics (CPU, memory, disk) and syslog. Managed services like Dataflow and BigQuery emit metrics natively, no agent required.

At the service layer, each managed service exposes its own namespace. Dataflow publishes job/system_lag, job/data_watermark_age, and per-stage element counts. BigQuery exposes slots/allocated_for_project, query/execution_times, and storage/byte_count. Pub/Sub gives you publish, subscription, and topic-level series.

At the application layer, your code emits structured logs (JSON payloads) and optionally OpenTelemetry traces. This is where you embed business identifiers like customer_id or pipeline_run_id, which are the only labels that make a pipeline replay tractable.

The pattern that ties it together is the golden signals dashboard per service: latency, traffic, errors, saturation. Each pipeline gets one such dashboard, plus a parent rollup that aggregates SLO compliance across all pipelines owned by the team.

Native service metrics are emitted automatically with no opt-in, but VM-level metrics for Dataproc require the Ops Agent or the Stackdriver legacy agent. Dataproc clusters created without the --enable-cloud-monitoring flag will be missing CPU per-core utilization and memory pressure series, which is the most common reason for blind YARN rebalancing failures.

Reference: https://cloud.google.com/dataproc/docs/guides/dataproc-metrics

GCP Service Deep Dive

Dataflow metrics that matter

Five Dataflow metrics carry most of the operational weight.

job/system_lag is the maximum amount of time a piece of data has been waiting at any stage of the pipeline. For a healthy streaming job this should hover near a few seconds. A growing system lag means a downstream stage is slower than the source, and autoscaling has either hit its max workers cap or is constrained by a non-parallelizable hot key.

job/data_watermark_age measures how far behind real-time the watermark is. It differs from system lag because watermark age accounts for the windowing semantics. A late-arriving record can move the watermark even when system lag is fine.

job/elements_produced_count per stage is the throughput meter. Compare consecutive stages to find the one that is dropping or buffering elements.

job/current_num_vcpus paired with job/total_vcpu_time answers the autoscaling question. If vCPUs are flat at the maximum and system lag is climbing, you are throughput-bound, not CPU-bound.

job/user_counter is your escape hatch. Beam's Metrics.counter() and Metrics.distribution() API surfaces here, so any business event you increment in a DoFn becomes a first-class time series.

Dataproc and Spark metrics

Dataproc emits cluster-level metrics by default: HDFS capacity, YARN memory, NodeManager count. To get application-level Spark metrics (executor count, shuffle read bytes, GC time) you enable the Spark metric sink that ships to Cloud Monitoring, or you scrape the Spark UI through a Prometheus sidecar.

The Component Gateway gives you authenticated access to the Spark History Server and YARN ResourceManager UIs without opening firewall rules, which is the right answer when an exam scenario asks how a developer should debug a stuck Spark stage in a private cluster.

BigQuery monitoring

BigQuery is unusual because the relevant signals are split between Cloud Monitoring metrics and the INFORMATION_SCHEMA views.

Cloud Monitoring covers project-wide and reservation-wide series: slots/allocated_for_project, slots/total_allocated_for_reservation, query/count, and storage bytes by table. These are right for capacity planning and reservation sizing.

INFORMATION_SCHEMA.JOBS_BY_PROJECT and JOBS_BY_ORGANIZATION give per-job detail: bytes scanned, slot-ms consumed, query text, error result. These are right for cost attribution and identifying expensive query patterns. A common mistake is trying to compute per-user cost from Cloud Monitoring metrics, which lacks the principal label. Use JOBS_BY_PROJECT instead.

Schedule a query that aggregates JOBS_BY_PROJECT daily into a flat fact table, then build your cost dashboard on the materialized table rather than the live view. The view re-scans 180 days of metadata each query and gets expensive on busy projects.

Reference: https://cloud.google.com/bigquery/docs/information-schema-jobs

Pub/Sub backlog and ack health

The two Pub/Sub metrics that show up on every PDE pipeline scenario are subscription/num_undelivered_messages and subscription/oldest_unacked_message_age. The first tells you how deep the backlog is. The second tells you whether any specific message is stuck (a poison pill that crashes a worker on every redelivery).

A third underrated metric is subscription/ack_message_count divided by subscription/sent_message_count. When that ratio drops below 1 you have nack storms, often caused by deadline exceeded errors when a downstream API is slow.

Notification channels

Cloud Monitoring supports email, SMS, Slack (via webhook), PagerDuty, Pub/Sub, and webhook channels. Pub/Sub is the right answer when an exam scenario wants alerts to trigger automated remediation: an alert fires, posts to a topic, a Cloud Function consumes it and restarts a Dataflow job or rolls back a deployment.

Log-based Metrics, Logs Explorer, and Log Analytics

Logs Explorer queries you actually use

The Logs Explorer query language has a small surface area but a few patterns cover most exam scenarios.

resource.type="dataflow_step"
resource.labels.job_name="enrich-clickstream"
severity>=ERROR
jsonPayload.message=~"watermark.*stuck"

The =~ operator is a regex match. Use it sparingly because regex matches cannot use the underlying index, so they get slow on broad time windows.

resource.type="bigquery_resource"
protoPayload.methodName="jobservice.jobcompleted"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error.code!=""

This finds every failed BigQuery job. Save it as a query, attach a log-based metric, and you have a "BigQuery error rate" series with no extra code.

Log Analytics

Log Analytics is the relatively new feature that exposes a log bucket as a BigQuery dataset and lets you run SQL against it. This is a game-changer for ad-hoc investigation because you can join logs against your own dimension tables (a customer table, a deployment table) directly in SQL.

The catch is that a bucket must be opted in to Log Analytics at creation time, or you upgrade an existing bucket. Default _Default buckets are not Log Analytics enabled in older projects.

Log Analytics queries are billed under BigQuery on-demand pricing for the bytes scanned. A naive SELECT * over a year of logs from a busy production project can scan terabytes. Always include a timestamp partition filter and project only the columns you need.

Reference: https://cloud.google.com/logging/docs/log-analytics

Building log-based metrics

A counter log-based metric matches a Logging filter and increments by one for each match. A distribution metric extracts a numeric value from the log payload (typically a latency field) and builds a distribution histogram. Both can be labeled with up to ten user labels extracted from the log entry, which is what makes them useful in dashboards (group by customer_tier, by region, by pipeline_name).

The label cardinality limit matters. If you label by user_id for an app with millions of users, you blow past the active time-series quota and Cloud Monitoring will start dropping series silently.

Alerting Policies: Threshold, MQL, and Anomaly Detection

Threshold conditions

The simplest alert is a threshold on a single metric: "fire if num_undelivered_messages exceeds 10000 for 5 minutes." The UI lets you choose the aggregator (mean, max, sum, percentile), the duration, and the threshold value.

Two parameters trip people up.

The duration is how long the condition must be continuously true before the alert opens. Set it too low and you page on every transient spike. Set it too high and you miss real incidents.

The alignment period is the resampling window before the aggregator runs. A 60-second alignment period with a mean aggregator means each evaluated point is the average over the previous minute. Mismatched alignment and duration is a common source of alerts that "should have fired but did not."

MQL conditions

MQL lets you express conditions the threshold UI cannot. A classic example: "alert if the ratio of failed Dataflow jobs to total jobs exceeds 5% over a rolling hour."

fetch dataflow_job
| metric 'dataflow.googleapis.com/job/is_failed'
| group_by 1h, [failed: sum(value.is_failed)]
| join
  fetch dataflow_job
  | metric 'dataflow.googleapis.com/job/is_failed'
  | group_by 1h, [total: count(value.is_failed)]
| value [ratio: val(0) / val(1)]
| condition val() > 0.05

MQL is also the right answer when you need rate calculations, multi-metric correlations, or topk filtering inside an alert.

Anomaly detection (forecast and ML-based conditions)

For metrics with strong seasonality (daily, weekly), static thresholds produce false alarms. Cloud Monitoring offers forecast conditions that use a model to predict the expected range and alert when the actual value diverges. This is exposed in the UI as a percent change condition with a learned baseline, and in MQL as the forecast operator.

The exam framing is usually: "static thresholds keep paging the on-call team during normal weekend traffic dips. What should you do?" The right answer is a forecast or anomaly condition, not a higher static threshold (which would hide real incidents during weekday peaks).

A common pitfall: configuring an alert with notification_rate_limit set to "no limit" plus auto-close set to 7 days. A flapping condition can then page hundreds of times in an hour. Always set a notification rate limit (one notification per hour is a sensible default) and consider an auto_close of 1-2 days to clean up stale incidents.

Reference: https://cloud.google.com/monitoring/alerts/concepts-indepth

SLOs and Error Budgets

Service Monitoring is the GCP product that hosts SLO definitions. You first define a service (a logical grouping, often one per pipeline). Then you define an SLI: either a request-based SLI (good requests / total requests, computed from a metric or log-based metric) or a windows-based SLI (good time windows / total time windows).

A typical pipeline SLO looks like: "99.5% of records published to Pub/Sub are written to BigQuery within 5 minutes, measured over a rolling 28-day window." The good event count comes from a log-based metric that fires when a record is successfully landed. The total event count comes from the Pub/Sub publish counter.

The error budget is (1 - SLO target) * total events. With 99.5% target and 100M weekly publishes, the budget is 500K failed records per week.

The product exposes two derived metrics worth alerting on. Burn rate is how fast the budget is being consumed relative to a normal pace. A burn rate of 2 means you will exhaust the budget in half the window. Fast-burn alerts (1-hour window, burn rate > 14.4) catch acute incidents. Slow-burn alerts (6-hour window, burn rate > 6) catch chronic degradations that would otherwise eat the budget without ever tripping a threshold alert.

Standard SRE burn-rate alert recipe: fast-burn at 14.4x over 1 hour for high-severity paging, slow-burn at 6x over 6 hours for ticket-creating notifications, slow-burn at 3x over 3 days for engineering review. Memorize these multipliers; they appear directly in PDE scenarios.

Reference: https://sre.google/workbook/alerting-on-slos/

Dashboards and Visualization

Dashboards in Cloud Monitoring are JSON definitions you can author in the UI, in Terraform via google_monitoring_dashboard, or via the Dashboards API. The Terraform path is the answer whenever an exam scenario mentions "consistent monitoring across multiple environments" or "version-controlled observability."

A useful pipeline dashboard layout has three rows.

The top row shows business-level SLO compliance: current SLO status, error budget remaining, burn rate trend. This is what executives see.

The middle row shows golden signals per service: Pub/Sub backlog, Dataflow system lag, BigQuery slot utilization, error rate. This is what the on-call engineer sees first.

The bottom row shows infrastructure detail: worker counts, CPU per worker, memory pressure, network throughput. This is what the engineer drills into after isolating the failing service.

Each chart should declare a clear time-series legend (one line per resource, not one line per metric type) and use the LINE or STACKED_AREA visualization deliberately, not by default.

Cloud Trace for Distributed Pipelines

Cloud Trace is most often discussed in the context of microservices, but it earns its keep in data pipelines too. Synchronous data APIs (a feature store lookup, a real-time scoring service, a metadata service called from inside a Beam DoFn) benefit from end-to-end tracing.

Beam itself does not auto-instrument trace spans across stages because the asynchronous, batched nature of the model does not map cleanly onto the request-scoped trace concept. But you can manually start spans inside a DoFn for any synchronous outbound call, using the OpenTelemetry SDK with the Cloud Trace exporter.

For Dataflow pipelines that call external services, the right pattern is to propagate a trace context through the pipeline as a side input or attached metadata, and emit a span per external call. The latency analysis view in Cloud Trace then shows you which external dependency is slowing the pipeline.

The PDE exam will rarely ask deep tracing questions, but it will ask which observability pillar covers latency analysis across services. The answer is Cloud Trace, not Cloud Monitoring or Cloud Logging.

Common Pitfalls and Trade-offs

A handful of mistakes show up over and over in real production incidents and on exam questions.

Logging too much. Default INFO logs from a busy Dataflow job can generate gigabytes per hour. Without exclusion filters, you ingest all of it into the _Default bucket and pay for every byte. The fix is a log exclusion filter that drops INFO from high-volume sources while keeping WARNING and above. Pair that with a separate sink that routes ERROR and above to a long-retention bucket.

Alerting on every metric. A team that wires up alerts for CPU, memory, disk, network, and queue length on every service ends up paging on noise. The answer is to alert on user-facing symptoms (SLO burn rate, end-to-end latency) and reserve resource alerts for capacity planning emails, not pages.

Cardinality explosion in custom metrics. Adding a high-cardinality label like user_id or request_id to a custom metric blows past the active time-series limit and gets your metrics dropped silently. High-cardinality dimensions belong in logs or trace attributes, not metric labels.

Confusing log retention with archival. The _Default bucket retains for 30 days. Extending retention on the bucket gets expensive. For multi-year retention (compliance, audit), set up a sink to BigQuery (queryable, columnar) or Cloud Storage (cheap, sequential).

Forgetting that audit logs are different. Admin Activity logs are always-on and free. Data Access logs are off by default and must be enabled per service. For BigQuery and Cloud Storage, enabling Data Access logs can produce massive volume; scope them carefully to the resources that actually need it.

Best Practices

A short list of practices that consistently pay off:

Treat dashboards and alerts as code. Define them in Terraform alongside the pipeline they monitor, so a new pipeline cannot ship without observability.
Standardize log payloads as JSON with a fixed set of fields (severity, pipeline_id, run_id, correlation_id, event_type). Free-form strings make Logs Explorer queries painful.
Define SLOs before alerts. The SLO drives which symptoms matter; alerts on anything else are by definition non-symptom alerts and should be tickets, not pages.
Use exclusion filters aggressively for INFO logs from chatty services. Save the savings for higher retention on the logs that matter.
For each pipeline, document the runbook URL in the alert policy documentation field. The on-call engineer should never have to guess what to do when the page arrives.
Enable Log Analytics on at least one bucket per environment so you can join logs against business data in SQL during incident response.
Send a weekly SLO report (auto-generated query result) to the team channel. Visibility drives improvement.

Real-World Use Case

A mid-size online retailer processes clickstream events for a personalization model. Their pipeline ingests through Pub/Sub at peak 80K messages per second, enriches in Dataflow against a user profile lookup, scores in a Vertex AI online endpoint, and writes results back to BigQuery and Bigtable. Daily volume is around 4 billion events.

Their observability stack looks like this.

Pub/Sub publishes the standard backlog and oldest-unacked metrics. A threshold alert fires at 100K backlog or 5-minute oldest age. Both alerts route to a Pub/Sub topic; a Cloud Function consumes the topic and posts to Slack with a Looker Studio link.

Dataflow emits system lag and watermark age. An MQL alert fires when system lag exceeds 60 seconds for 10 minutes AND current vCPUs equals max vCPUs (meaning autoscaling is saturated). This compound condition avoids false alarms during normal scale-up periods.

BigQuery cost is tracked through a daily scheduled query against JOBS_BY_PROJECT, materialized into a fact table, and surfaced in a Looker dashboard. A log-based metric counts queries with bytes_processed over 1 TB and alerts the data platform team to investigate.

The Vertex AI endpoint is wrapped in OpenTelemetry tracing. Cloud Trace shows that endpoint p99 latency degrades on Mondays at 9 AM, correlating with model retraining jobs sharing the project quota. The team moved retraining to a separate project, eliminating the contention.

SLO: 99.5% of clickstream events are scored and written to BigQuery within 60 seconds, measured over 28 days. The error budget is 20 million events per 28 days. A fast-burn alert (1h, 14.4x) wakes the on-call engineer; a slow-burn alert (6h, 6x) creates a Jira ticket. In the past quarter the team has triggered three fast-burn alerts (all genuine incidents) and twelve slow-burn tickets, exactly the signal-to-noise ratio they want.

Exam Tips

A few patterns the PDE exam loves.

When a scenario describes "growing Pub/Sub backlog" or "messages not being processed fast enough," the expected metric is subscription/num_undelivered_messages. When it describes "a single message stuck for hours," the metric is subscription/oldest_unacked_message_age. The first is throughput, the second is poison-pill detection.

When asked "how do I know my Dataflow pipeline is keeping up with real-time," the answer is job/system_lag or job/data_watermark_age. CPU utilization is a distractor; a pipeline can be CPU-saturated and still be on time, or CPU-idle and falling behind on a hot key.

When the scenario mentions "alerting on a specific log message" or "tracking the count of a particular error string," the answer is a log-based counter metric plus an alert on that metric.

When the scenario calls for "long-term retention of audit logs for compliance," the answer is a log sink to BigQuery (for queryable retention) or Cloud Storage (for cheap archival), not increasing the log bucket retention period.

When the question contrasts "static threshold causing false alarms during normal traffic patterns" with options including forecast/anomaly conditions, choose the anomaly condition.

When asked "how do I correlate latency across multiple services," the answer is Cloud Trace. Cloud Logging is for events; Cloud Monitoring is for metrics; Cloud Trace is for spans.

When the scenario involves "running SQL on logs to join with a customer dimension table," the answer is Log Analytics, not exporting logs to BigQuery via a sink (the sink is a heavier solution that you would prefer only if Log Analytics were unavailable).

When the right answer involves "automated remediation triggered by an alert," the notification channel is Pub/Sub, with a Cloud Function or Cloud Run service consuming the topic.

Frequently Asked Questions (FAQ)

How do Cloud Monitoring metrics differ from Cloud Logging log entries?

Cloud Monitoring stores numeric time series indexed by metric type and resource labels, optimized for fast aggregation across millions of points. Cloud Logging stores text or structured event records indexed for substring search. Use Monitoring for "is the rate of X above threshold," use Logging for "show me the exact event that triggered Y." Log-based metrics bridge the two by counting log matches as a metric.

What is the difference between Logs Explorer and Log Analytics?

Logs Explorer uses the Logging Query Language (LQL), a filter-style syntax optimized for fast browse-and-filter on the live log stream. Log Analytics treats a log bucket as a BigQuery dataset and lets you run standard SQL, including joins against external tables. Use Logs Explorer for incident-time browsing, use Log Analytics for analytical questions like "how many distinct customers hit this error in the last week."

When should I use MQL instead of the threshold alerting UI?

Use MQL when you need a ratio between two metrics, a multi-metric join, a topk filter, a windowed aggregation that the UI does not expose, or any condition that requires arithmetic across series. The UI handles "single metric exceeds threshold for duration" cleanly; everything beyond that is easier and more maintainable in MQL.

How do SLO error budgets help reduce alert fatigue?

A symptom-based SLO defines what "good service" means in user-visible terms. Burn-rate alerts only fire when the error budget is being consumed faster than the SLO allows over the chosen window. This filters out resource alerts (high CPU, low memory) that do not actually hurt users, which are the largest source of pager noise in most teams.

Can I trace a record through a Beam pipeline end-to-end?

Beam does not auto-create Cloud Trace spans across the pipeline because the model is asynchronous and batched. You can propagate a trace context through the data itself (as a field on each record) and create spans manually for synchronous external calls inside DoFns. For end-to-end record tracking inside the pipeline, the typical pattern is structured logs with a correlation_id field, queried via Logs Explorer, rather than Cloud Trace.

What notification channel should I use for automated remediation?

Use a Pub/Sub notification channel. The alert publishes a structured payload to a topic; a Cloud Function or Cloud Run service subscribes, parses the alert, and takes action (restart a job, scale a cluster, roll back a deployment). Email, SMS, and PagerDuty are for human notifications, not for triggering code.

How long are metrics and logs retained by default?

Cloud Monitoring retains metrics for 24 months for most metric types (some user-defined metrics retain for 6 weeks at finer resolution). Cloud Logging's _Default bucket retains for 30 days, and the _Required bucket (Admin Activity audit logs) retains for 400 days and cannot be reduced. For longer retention on operational logs, either extend the bucket retention (more expensive, queryable) or sink to BigQuery / Cloud Storage (cheaper, more steps to query).

Dataflow Architecture Selection — pick the right Dataflow pattern before you can monitor it sensibly
Batch vs Streaming Design — the SLO and observability needs differ sharply between the two
Disaster Recovery for Data Platforms — monitoring drives RTO/RPO measurement and DR drill validation