Introduction to Debugging Dataflow and Dataproc Jobs
Debugging Dataflow and Dataproc jobs is the daily reality of any data engineer running production pipelines on Google Cloud. A job that worked last week now hangs at 70% throughput. A streaming pipeline shows a watermark stuck three hours behind wall time. A Spark job on Dataproc dies overnight with cryptic Java stack traces. None of these symptoms tell you the root cause directly — your skill at debugging Dataflow and Dataproc jobs is what turns the symptoms into a fix.
This study note walks through the tools, signals, and recovery patterns that the PDE exam expects you to know, and that real teams actually use when their pipelines are on fire.
白話文解釋(Plain English Explanation)
Think of a Dataflow job like a sushi conveyor belt restaurant
The chef puts plates of sushi onto a moving belt. Customers grab them as the plates pass by. If one customer takes forever to decide, plates pile up in front of them — that pile is your system_lag. If the chef stops making fresh plates, the belt still moves but no new sushi appears, and the time stamped on the most recent plate (the watermark) falls further and further behind the actual clock on the wall. Debugging Dataflow and Dataproc jobs in this scenario means walking down the belt, asking which station is the slow one, and either adding a second chef (autoscaling) or moving the picky customer to a separate counter (sharding the hot key).
Think of Dataproc like running your own restaurant kitchen
Dataflow is the fully managed conveyor belt. Dataproc is when you rent the kitchen, hire the cooks, and pick the recipes (Spark, Hive, Tez) yourself. You get more control, but you also have to read the cooks' notebooks (driver logs), check what each line cook actually did (executor logs), and unlock the back door so the health inspector can see the kitchen (Component Gateway). When something burns, you walk into the kitchen with a flashlight; you don't get a fancy dashboard handed to you for free.
Think of debugging as being a doctor with three different scopes
A general practitioner uses a stethoscope (the Dataflow monitoring UI) to listen for an obvious wheeze — high system_lag, stuck watermark, OOM crashes. If something sounds off, they order an X-ray (the execution graph) to see which bone is broken — which fused step is the bottleneck. For deeper problems, they order an MRI (Cloud Profiler flame graphs) to see soft tissue — which line of your DoFn.processElement is burning the CPU. The exam loves to test whether you reach for the right scope at the right time, instead of always defaulting to "add more workers."
Core Concepts of Debugging Dataflow and Dataproc Jobs
A Dataflow pipeline has two graphs you need to keep separate in your head. The job graph is what your code expressed: every ParDo, GroupByKey, Window, and Combine you wrote. The execution graph is what the Dataflow service actually runs after applying optimizations — most importantly, fusion, where adjacent steps get collapsed into a single physical stage so data does not have to be serialized between them. Fusion makes pipelines faster but harder to inspect, because the timing you see in the UI may belong to the fused unit rather than your original step.
Two timing signals dominate streaming debugging. System lag measures how long the oldest unprocessed element has been waiting inside a step. Watermark is Dataflow's estimate of "we have probably seen all event-time data up to this timestamp." A healthy streaming job has system_lag in seconds and a watermark that closely tracks wall-clock time. A sick job shows minutes-long lag, a watermark that has stopped moving, or both.
For Dataproc, the mental model is different. You are running open-source engines on managed VMs. The two primary log streams are the driver log (what the application coordinator did) and the executor logs (what each worker container did). Cluster-level events — startup failures, autoscaling decisions, decommissioning — show up in Dataproc's own operational logs in Cloud Logging.
Architecture and Design Patterns for Observability
A well-instrumented data platform is one where debugging starts before the incident. The pattern most teams converge on has four layers.
The first layer is the managed UI: the Dataflow monitoring interface for Dataflow jobs, and the Component Gateway plus the Dataproc job UI for Dataproc. These give you a fast visual triage path — green or red, fast or slow.
The second layer is structured logs in Cloud Logging. Both products stream to it automatically, but you should add your own structured log entries from inside DoFn or Spark code, with severity levels and consistent labels. Log-based metrics let you turn "saw a poison pill record" into a counter you can alert on.
The third layer is Cloud Monitoring metrics: built-in metrics like job/system_lag, job/data_watermark_age, plus custom metrics emitted via Beam's Metrics API or Spark accumulators.
The fourth layer is deep profiling: Cloud Profiler for Dataflow, and the Spark UI / YARN ResourceManager / Tez UI for Dataproc, reached through Component Gateway.
Wire all four layers up before you launch a pipeline to production. Trying to retrofit observability while a job is on fire is the most expensive way to do it. Reference: https://cloud.google.com/dataflow/docs/guides/troubleshoot-pipelines
GCP Service Deep Dive: Dataflow Monitoring
The Dataflow monitoring interface in the Cloud Console is the first place to go when a job misbehaves. The execution graph view shows every step as a box. Each box exposes elements added per second, estimated size, system lag for streaming, and a tiny per-step watermark. Click into a step and you see the wall time consumed and the count of elements processed.
The "Job metrics" tab gives you the time-series chart for System Lag, Data Freshness, Backlog seconds, and Backlog bytes for streaming jobs, plus throughput and worker count over time. For batch jobs, focus on the Elements added rate per stage and the duration per stage.
Worker logs are reachable from the right-hand panel. There are several log streams: job-message (high-level lifecycle events), worker (what the harness saw), worker-startup (VM boot), shuffler, and your own user logs. When a job fails, do not stop at the first ERROR you see — search backwards in time for the first WARNING that preceded the cascade.
GCS plays a quiet but critical role. The --stagingLocation and --tempLocation flags point to GCS buckets where the SDK uploads JARs, custom containers, and shuffle spill data. If your service account loses roles/storage.objectAdmin on those buckets, every new job submission silently fails to start. The same applies to template submission via --templateLocation.
The Dataflow SDK uploads your job's classpath to the staging bucket on every submission. If the staging bucket is in a different region than your worker zone, you pay egress and your job startup gets slower. Keep staging, temp, and worker location in the same region. Reference: https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
GCP Service Deep Dive: Cloud Profiler for Dataflow
Cloud Profiler attaches a low-overhead sampling profiler to your Dataflow workers. You enable it by adding --dataflowServiceOptions=enable_google_cloud_profiler (Java) or --profile_cpu --profile_memory flags (Python) when you submit the job. Once running, Profiler shows flame graphs split by service name (your job name) and version (your job ID).
The two profile types you will use most are CPU time and Heap. CPU time shows where wall clock disappears — usually inside DoFn.processElement or, painfully often, inside coder serialization. Heap shows what is allocated, which catches the memory leak that eventually becomes an OOM.
A typical workflow: see a slow stage in the execution graph, open Profiler filtered to the same job ID, look for a wide bar in the flame graph that corresponds to user code, then check whether that code is doing something silly like compiling a regex inside the inner loop or fetching a large lookup table on every element. The wins from Profiler tend to be 2x to 10x speedups, not 5%.
GCP Service Deep Dive: Dataproc Component Gateway
Dataproc clusters expose web UIs for the underlying open-source components, but they live on private network ports inside your VPC. Component Gateway provides an authenticated HTTPS proxy through Google's frontend, so you can click into the Spark UI, YARN ResourceManager, MapReduce JobHistory, Tez UI, HDFS NameNode, Zeppelin, JupyterLab, and Hive UIs without opening firewall holes or running an SSH tunnel.
Enable it at cluster creation with --enable-component-gateway. The links then appear in the Dataproc cluster details page in the Cloud Console.
For Spark debugging specifically, the Spark UI is irreplaceable. The "Stages" tab tells you which stage is slow and which task within that stage is the laggard. The "SQL" tab shows you the physical plan and the time spent at each operator — this is how you catch a Cartesian join you did not intend, or a BroadcastHashJoin that fell back to SortMergeJoin because the broadcast threshold was exceeded.
A Dataproc feature that publishes web UIs of cluster components (Spark, YARN, Tez, Hive, JupyterLab, etc.) over an authenticated HTTPS proxy, eliminating the need for SSH tunnels or public IPs. Reference: https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways
To fix a bug in a running streaming Dataflow pipeline without losing in-flight state, drain (do not cancel) and relaunch with --update. The service runs a job-compatibility check on the new execution graph; if step names changed, use --transformNameMapping so state migrates from the old GroupByKey/window steps to the new ones. For Dataproc Spark jobs, the equivalent diagnostic surface is the Spark UI's Stages and SQL tabs reached via Component Gateway — that is where you catch an unintended Cartesian join or a BroadcastHashJoin that fell back to SortMergeJoin.
Reference: https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline
Worker Logs vs Job Logs vs Driver vs Executor Logs
Mixing these up is one of the most common mistakes in debugging Dataflow and Dataproc jobs. They live in different places and tell different stories.
For Dataflow, job logs (resource type dataflow_step) are emitted by the Dataflow service: "starting job", "autoscaling to N workers", "job succeeded". They tell you what the control plane did. Worker logs are emitted by the worker VMs and contain stack traces from your user code, JVM warnings, and the Beam SDK's own messages. When you have an exception, you almost always want worker logs.
For Dataproc, the driver log is the output of the application's main method — for Spark, this is your SparkSession and any println or logger.info calls from the driver code. It is captured to GCS at gs://<staging-bucket>/google-cloud-dataproc-metainfo/<cluster-uuid>/jobs/<job-id>/driveroutput*. The executor logs are per-container YARN logs, which Dataproc forwards to Cloud Logging when the cluster has logging enabled.
The query patterns are worth memorizing. To find Dataflow worker errors:
resource.type="dataflow_step"
resource.labels.job_id="<JOB_ID>"
severity>=ERROR
To find Dataproc executor logs for a specific job:
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name="<CLUSTER>"
jsonPayload.application="<APPLICATION_ID>"
Common Pitfalls and Trade-offs
The first pitfall is misreading fused steps. You see a single box in the execution graph that takes 80% of the job's wall time, and you assume the bottleneck is the last transform in that fused unit. It might be the first one. To break apart fusion for diagnostic purposes, insert a Reshuffle.viaRandomKey() between the steps you want to measure separately. Remove it once you have your answer — Reshuffle adds shuffle cost.
The second pitfall is conflating slow worker with deadlock. A truly deadlocked pipeline shows zero throughput across all workers and a watermark that has stopped advancing. A slow worker shows uneven CPU — one or two workers pegged at 100%, the rest mostly idle, and throughput proportionally low. The fix for deadlock is usually to look at external dependencies (a downstream sink that stopped accepting writes). The fix for a slow worker is usually a hot key.
The third pitfall is trusting the autoscaler blindly. Streaming Engine and the Horizontal Autoscaler will add workers when backlog grows, but if your bottleneck is a hot key or a slow sink, more workers will not help. You will pay for idle CPU while system_lag stays stuck.
The Dataflow UI's "Hot key" warning fires when one key dominates the work distribution. Adding more workers does nothing — those workers will sit idle while the unlucky one keeps grinding. The fix is to redistribute the key with salting, an upstream Combine.perKey, or a windowed pre-aggregation.
Reference: https://cloud.google.com/dataflow/docs/guides/common-errors#hot-key-detected
The fourth pitfall is using the default service account in production. Both Dataflow and Dataproc default to the Compute Engine default service account, which has Project Editor in many older projects. When something later breaks because of a missing permission, you cannot tell whether the issue is in the service account or in your IAM bindings, because the SA can do too much. Use a custom least-privilege service account.
Best Practices for Debugging in Production
A short list of habits that pay back the time you invest in them.
- Always set
--jobNameto something human-readable that includes the environment and date. Auto-generated job names sort poorly in the console. - Tag jobs with labels (
--labels=team=data,env=prod) so you can filter logs and bills. - Emit a custom Beam metric at the start and end of each major transform — counters cost nothing and give you a fast sanity check.
- For streaming jobs, set up an alert on
job/data_watermark_ageover five minutes. Watermark drift is your earliest warning. - For batch jobs, set a job-level deadline. A pipeline that should finish in 30 minutes but ran for 6 hours is silently telling you something is wrong.
- Keep an "incident runbook" per pipeline that lists the top three failure modes and the first command to run for each.
- For Dataproc, prefer ephemeral clusters per job over long-lived shared clusters. Easier to reason about, easier to diff between runs.
- Pin Beam SDK and Dataproc image versions explicitly. Auto-latest will eventually break you.
For streaming Dataflow debugging, the diagnostic order is: watermark age first, then system_lag per stage, then per-stage throughput, then worker logs, then Profiler. Going out of order wastes time. Reference: https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf
Real-World Use Case: A Mid-Size Adtech Pipeline Goes Sideways
A 40-engineer adtech company runs a streaming Dataflow pipeline that ingests bid request events from Pub/Sub, joins them against an advertiser metadata cache loaded from BigQuery, windows them into one-minute tumbling windows, and writes aggregated counts to Bigtable. Steady-state throughput is around 80,000 events per second, with 25 n1-standard-4 workers.
One Tuesday morning at 09:14, the on-call engineer gets paged: data_watermark_age exceeded five minutes. They open the Dataflow monitoring UI. The execution graph shows the join stage has system_lag of 8 minutes and climbing. Throughput on that stage has collapsed from 80k/sec to 12k/sec. Worker count is at 80, the autoscaler maximum, and they are mostly idle.
They check worker logs filtered to severity>=WARNING and find the smoking gun: a Hot key detected warning naming a single advertiser ID. A quick BigQuery query confirms that one advertiser's traffic spiked 50x — turns out it is a black-friday early-access campaign that nobody warned the data team about.
The fix has two parts. Short term: drain the pipeline (do not cancel — drain finishes in-flight windows cleanly), then re-launch with --update so checkpoint state is preserved, this time with a salted key (advertiser_id concatenated with random.nextInt(16)) for the per-minute aggregation, then a second pass that strips the salt and re-aggregates. Long term: add an alert on per-key throughput skew so the team catches campaign launches before they become incidents.
Total time from page to recovery: 47 minutes. Total time the engineer spent in the Dataflow UI vs. logs vs. Profiler: 8 minutes UI, 5 minutes logs, 0 minutes Profiler. The UI was enough because the hot key warning was already explicit.
Exam Tips for the PDE Exam
The PDE exam treats this topic as scenario-based. You will get a paragraph describing symptoms and you must pick the most likely cause or the right next diagnostic step. A few patterns repeat.
If the question mentions "watermark not advancing" or "data freshness increasing" in a streaming pipeline, the answer is almost always either a slow external sink, a stuck windowing function waiting for late data, or a hot key. It is almost never "add more workers."
If the question mentions "OOM" or "java.lang.OutOfMemoryError" on Dataflow workers, look for answers that involve switching to a highmem machine type, reducing the number of numberOfWorkerHarnessThreads, or rewriting GroupByKey followed by an in-memory list into a Combine.perKey with an associative-commutative function.
If the question mentions Dataproc cluster creation failing in PROVISIONING state, the cause is one of three things: an init action script that returned a non-zero exit code, a missing service account permission for Compute Engine or GCS, or a quota issue. The fix points to the init action's stderr in the staging bucket.
If the question mentions "need to fix a bug in a running streaming Dataflow pipeline without losing in-flight data", the answer is the --update flag, which performs a job-compatibility check and migrates state from the old job to the new one. If the pipeline graph changed in incompatible ways, you may need to use a --transformNameMapping to tell the service which old step maps to which new step.
If the question mentions "need to inspect Spark UI for a Dataproc cluster without opening firewall ports", the answer is Component Gateway.
If the question mentions a Dataflow job that worked yesterday and fails today with no code change, suspect three things in order: an SDK auto-upgrade if you did not pin a version, a service account permission change (check Audit Logs), or upstream data shape change (a new field, a new high-cardinality value).
Frequently Asked Questions
What is the difference between Dataflow's job graph and execution graph?
The job graph is the logical pipeline you constructed in code — every ParDo, GroupByKey, and windowing transform as you wrote it. The execution graph is what the Dataflow service runs after optimization, most importantly fusion, which collapses adjacent steps into a single physical stage. The execution graph is what you see in the monitoring UI, and it is where wall time is actually charged. When you cannot find a step from your code in the UI, it has likely been fused into its neighbor.
How do I tell whether my Dataflow streaming pipeline is slow or stuck?
Look at three signals together. data_watermark_age tells you how old the latest fully-processed event-time is. system_lag tells you how long the oldest in-flight element has been waiting. Per-stage throughput tells you whether work is being completed at all. A slow pipeline shows steady positive throughput but rising lag and watermark age. A stuck pipeline shows zero throughput and a watermark that has stopped moving, often pointing to a downstream sink that is rejecting writes or a deadlocked external dependency.
When should I use the --update flag instead of cancelling and restarting?
Use --update whenever you need to deploy a code change to a streaming pipeline without losing in-flight state, buffered windows, or the current watermark. The service performs a compatibility check on the new job graph against the running one; if compatible, it transfers state across. Use cancel-and-restart only when state can be safely dropped, when the change is incompatible (different windowing, removed steps), or when you want to start from a clean checkpoint. For batch jobs, --update does not apply — re-run from the source.
My Dataproc job's driver log shows success but the Spark UI shows failed tasks. Which one do I trust?
Both, because they describe different things. The driver log records what the application coordinator decided. Spark is fault-tolerant: individual tasks can fail and be retried, and as long as the retry succeeds, the application as a whole succeeds. Failed tasks in the Spark UI are normal in small numbers; they only become a problem when you see many failures from the same executor (suggesting a node-level issue) or many failures of the same task across multiple executors (suggesting a data issue, like a corrupt input record or a skewed partition that exceeds memory).
What is a "hot key" and why does adding more workers not fix it?
A hot key is a single grouping key that receives a disproportionate share of the data — for example, one user_id with 1,000,000 events while typical users have 100. In a GroupByKey or per-key aggregation, all elements for one key must be processed by the same worker, so that one worker becomes the bottleneck regardless of how many other workers are idle. Adding more workers does nothing because the work cannot be subdivided further. Fixes include: pre-aggregating with Combine.perKey (which can run partial combines on every worker before the shuffle), salting the key with a random suffix and aggregating in two passes, or moving the high-cardinality work into a separate pipeline branch.
How do I enable Cloud Profiler on a Dataflow job, and what should I look for first?
For Java jobs, add --dataflowServiceOptions=enable_google_cloud_profiler (and grant roles/cloudprofiler.agent to the worker service account). For Python jobs, use --profile_cpu --profile_memory. Once data flows in, open Cloud Profiler in the console and filter by service name (your job name) and time range. Look first at CPU time flame graphs for unusually wide frames inside DoFn.processElement — those are your hot loops. Then check Heap profiles for allocations that grow over time, which point to leaks or unbounded caches.