Data Processing with Dataproc and Spark

Q: How do I prevent data loss when scaling down a Dataproc cluster?

Set gracefulDecommissionTimeout in the autoscaling policy to give in-flight tasks time to complete before workers are removed. A typical value is 1 hour for batch workloads. Avoid autoscaling clusters that cache data on executors with cache() or persist() because removed workers lose cached partitions and force stage recomputation. For caching-heavy workloads, use a fixed worker count or migrate to Dataproc Serverless which manages capacity transparently.

Q: What is the difference between repartition and coalesce in Spark?

repartition(n) performs a full shuffle to redistribute data evenly across n partitions, and can increase or decrease partition count. coalesce(n) merges existing partitions without a shuffle, can only decrease the count, and may leave data skewed. Use coalesce to reduce the number of output files cheaply before writing. Use repartition when you need balanced partitions or want to increase parallelism. Avoid coalesce(1) for large datasets because it forces all data through a single executor.

Introduction to Data Processing with Dataproc and Spark

Data Processing with Dataproc and Spark sits at the centre of every Hadoop and Spark migration story on Google Cloud. The PDE exam expects you to pick the right cluster shape, tune Spark jobs, and decide when serverless beats a long-running cluster.

This study note walks through the mechanics that actually move the needle on cost and performance. You will see when a single-node sandbox is enough, when high availability earns its keep, and when Dataproc Serverless for Spark removes the cluster entirely. The goal is muscle memory: by the end you should reach for the right pattern without checking the docs.

白話文解釋（Plain English Explanation）

Before diving into YARN, executors, and shuffle partitions, three analogies help anchor the mental model behind Data Processing with Dataproc and Spark.

Analogy 1: A pop-up restaurant kitchen

A traditional Hadoop cluster is like owning a restaurant. You sign a long lease, hire chefs full time, and pay rent whether the dining room is empty or packed. Dataproc flips this into a pop-up kitchen. You rent the venue for one wedding, bring in chefs only for the prep window, then tear everything down at midnight. An ephemeral Dataproc cluster behaves exactly the same way: spin it up for one job, run it, delete it.

Dataproc Serverless for Spark goes further. There is no kitchen to rent at all. You hand the recipe to a catering company, they cook in their own facility, and you only pay for portions served. No cluster, no idle workers, no init action timing.

Analogy 2: A library reshelving crew

Spark partitioning is like a reshelving crew at a library. If one worker has the entire fiction section while four others sit idle, the job drags on for hours. That is data skew. Repartition redistributes books across workers, but it costs a full reshuffle of the building. Coalesce just merges nearby workers' carts without crossing the floor, which is cheaper but cannot fix imbalance.

Broadcast join is the librarian printing a small genre lookup card for every worker so nobody has to walk to the reference desk. Adaptive Query Execution (AQE) is the floor manager watching live and reassigning carts mid-shift when one worker falls behind.

Analogy 3: A toolbox versus a Swiss Army knife

Picking between Dataproc on Compute Engine, Dataproc on GKE, and Dataproc Serverless is like choosing between a full mechanic's toolbox, a multi-bay garage with shared lifts, and a Swiss Army knife. The toolbox (Dataproc on GCE) gives you every wrench and the freedom to mount custom attachments through initialization actions. The garage (Dataproc on GKE) shares lifts with other tenants so you save floor space. The Swiss Army knife (Dataproc Serverless) fits in your pocket and handles eighty percent of jobs without setup, but you cannot bolt on a custom socket set.

Core Concepts of Data Processing with Dataproc and Spark

Dataproc is Google Cloud's managed Hadoop and Spark service. It provisions clusters in roughly ninety seconds, exposes the full open-source surface (Spark, Hive, Pig, Presto, Flink, HBase), and integrates with GCS, BigQuery, and Pub/Sub through purpose-built connectors.

A managed group of Compute Engine VMs running YARN, HDFS, and selected open-source components. The master node hosts the ResourceManager and HDFS NameNode; workers run NodeManagers and DataNodes. See cluster overview.

The PDE exam tests three things repeatedly: choosing the right cluster mode, tuning Spark for skew and shuffles, and using GCS as the source of truth instead of HDFS. Internalise those and most Dataproc questions become pattern matching.

A few terms surface again and again:

Driver: the JVM that builds the DAG and schedules tasks
Executor: a JVM on a worker that runs tasks and caches data
Stage: a set of tasks separated by shuffle boundaries
Shuffle: the network step that redistributes data by key
Catalyst: Spark SQL's query optimiser
Tungsten: Spark's off-heap execution engine

Architecture and Design Patterns

The reference architecture for Data Processing with Dataproc and Spark on Google Cloud follows a clean separation: storage in GCS, metadata in Dataproc Metastore (or BigQuery), compute in ephemeral clusters or Serverless batches, and orchestration in Cloud Composer or Workflows.

Storage-compute separation

The single biggest win when migrating from on-prem Hadoop is moving HDFS data to GCS. With on-prem Hadoop, HDFS lives on the same nodes as compute, which forces you to keep clusters running just to keep data accessible. On GCP, GCS holds the data permanently and clusters become disposable.

Treat HDFS on Dataproc as scratch space only. Persistent data belongs in GCS so clusters can be deleted without data loss. This is the foundation for ephemeral cluster patterns. See the GCS connector docs.

Ephemeral cluster pattern

The canonical pattern: one cluster per job (or per pipeline DAG). Cloud Composer triggers a DataprocCreateClusterOperator, runs the Spark job via DataprocSubmitJobOperator, then DataprocDeleteClusterOperator tears it down. Total runtime equals job runtime plus ninety seconds, and you only pay for the minutes you used.

Long-running shared cluster pattern

Some teams need a long-running Dataproc cluster: interactive notebooks via Jupyter component gateway, ad-hoc Hive queries, or shared dev environments. These benefit from autoscaling policies and component gateway, but they sacrifice the cost discipline of ephemeral clusters.

Workflow templates

Workflow templates package a job graph with cluster config. The cluster is created on demand, jobs run in order, and the cluster is deleted on completion. This is ephemeral pattern in a single API call without needing Composer.

GCP Service Deep Dive

This section breaks down the three Dataproc form factors and the Spark optimisations that matter for the PDE exam.

Dataproc cluster modes

Dataproc supports three deployment modes that you should be able to distinguish on sight.

Standard mode uses one master and two or more workers. This is the default and covers most production workloads. The master runs YARN ResourceManager, HDFS NameNode, and the Spark History Server.

Single-node mode combines master and worker roles on a single VM. Used for development, learning, and small batch jobs under a few gigabytes. Single-node clusters cannot run YARN in distributed mode and cannot autoscale.

High availability mode runs three masters with ZooKeeper quorum. The HDFS NameNode and YARN ResourceManager run in active-standby pairs. Use this when cluster uptime matters more than cost, typically for long-running shared clusters that host critical pipelines.

Single-node Dataproc clusters look attractive for cheap testing, but they cannot represent production behaviour. Skewed partitions, executor OOMs, and shuffle bottlenecks all behave differently with one worker. Always test scale issues on a small standard cluster instead.

Dataproc Serverless for Spark

Dataproc Serverless removes the cluster entirely. You submit a Spark batch workload and Google Cloud provisions, runs, and tears down the underlying infrastructure. There are no master nodes, no init actions, no SSH access.

Two flavours exist:

Serverless Spark Batches: one-shot batch jobs submitted via gcloud dataproc batches submit spark
Serverless Spark Sessions: interactive sessions for notebooks (BigQuery Studio, Vertex AI Workbench)

Serverless excels at predictable Spark batch workloads under a few hours. You configure executor count, executor memory, and Spark properties, and the platform handles the rest. Cold start is around sixty seconds.

Default to Dataproc Serverless for Spark when the workload is pure Spark, runs in batches, and does not need custom Hadoop components. It eliminates cluster management and bills per DCU-second. Move to Dataproc on GCE only when you need Hive, HBase, custom JARs, or sub-minute startup. See Serverless overview.

Serverless does not support all components. No Hive Metastore on the cluster (use Dataproc Metastore externally), no HBase, no init actions. You bring your own JARs through GCS-hosted dependency lists.

Dataproc on GKE

Dataproc on GKE runs Spark jobs as Kubernetes pods inside an existing GKE cluster. This is the right choice when your organisation already standardises on GKE and wants Spark to share node pools with other workloads.

Trade-offs: you get bin-packing and shared resource pools, but you lose the YARN-native tooling. HDFS is not provided (use GCS). Init actions do not apply; you customise the Spark image instead.

A common pattern is to dedicate a GKE node pool with a taint like dataproc=spark:NoSchedule, then let Dataproc on GKE schedule executors there while application workloads stay on other pools.

Spark optimisation: DataFrame and Dataset APIs

Always prefer the DataFrame or Dataset API over RDDs. DataFrames go through Catalyst, which applies predicate pushdown, column pruning, and join reordering. RDDs are opaque to the optimiser; whatever Scala or Python code you write runs literally.

When using PySpark, prefer DataFrame operations over Python UDFs. A Python UDF forces serialisation between the JVM and a Python worker per row, which destroys Catalyst's vectorised execution. Use built-in functions, pandas_udf (vectorised), or rewrite in Spark SQL.

Spark optimisation: partitioning

Partitioning controls how data is distributed across executors. Two layers matter:

File-level partitioning: how Parquet files are laid out in GCS (e.g., dt=2026-05-12/)
In-memory partitioning: how Spark splits a DataFrame across executors

Aim for partition sizes of 128 MB to 256 MB after filters. Too small and you pay overhead per task; too large and you lose parallelism and risk OOMs.

Spark optimisation: broadcast join

When joining a large fact table to a small dimension, broadcast the small side. Spark sends the dimension to every executor, so each row of the fact joins locally without a shuffle.

from pyspark.sql.functions import broadcast
df = fact.join(broadcast(dim), "country_code")

The default spark.sql.autoBroadcastJoinThreshold is 10 MB. Raise it (cautiously) for tables up to a few hundred megabytes, but do not broadcast tables that risk executor OOM.

Spark optimisation: repartition vs coalesce

Both change partition count, but they behave differently.

repartition(n): full shuffle, can increase or decrease partitions, balances data evenly
coalesce(n): no shuffle, can only decrease partitions, may leave skew

Use coalesce to reduce output file count before writing without a shuffle penalty. Use repartition when data is skewed or you need to increase parallelism.

Calling coalesce(1) before writing to GCS to get a single file looks tidy, but it forces all data through one executor. Memory blows up and the job stalls on the final write. Use repartition(1) if you genuinely need a single file, or write multiple files and merge with gsutil compose afterwards.

Spark optimisation: Adaptive Query Execution (AQE)

AQE, enabled by default since Spark 3.2, dynamically optimises query plans at runtime based on shuffle statistics. It can:

Coalesce small post-shuffle partitions to reduce task overhead
Switch sort-merge join to broadcast join when shuffle data is smaller than expected
Handle skewed joins by splitting large partitions across multiple tasks

Enable with spark.sql.adaptive.enabled=true and spark.sql.adaptive.skewJoin.enabled=true. On Dataproc 2.0+ both are on by default. AQE alone often resolves skew issues that would otherwise need manual salting.

Ephemeral vs long-running clusters

The decision tree is simple:

One scheduled job per night, deterministic SLA: ephemeral cluster
Interactive notebook for a data science team: long-running with autoscaling
Hundreds of ad-hoc queries from many users: long-running with autoscaling and component gateway
Pure Spark batch with no custom components: skip clusters, use Serverless

Ephemeral clusters give you cost isolation, version isolation, and blast-radius isolation. A bad init action does not poison tomorrow's job. A library upgrade does not break a neighbour pipeline.

Autoscaling policies

Dataproc autoscaling adjusts the number of secondary (preemptible or non-preemptible) workers based on YARN pending memory. Primary workers stay fixed because shrinking them risks losing HDFS data.

workerConfig:
  minInstances: 2
  maxInstances: 10
basicAlgorithm:
  cooldownPeriod: 2m
  yarnConfig:
    scaleUpFactor: 0.05
    scaleDownFactor: 1.0
    gracefulDecommissionTimeout: 1h

scaleUpFactor controls aggressiveness when YARN has pending containers. gracefulDecommissionTimeout lets running tasks finish before a worker is removed, which prevents shuffle data loss.

Autoscaling does not work well with Spark workloads that cache data on executors. When workers scale down, cached partitions are lost and stages restart. Use autoscaling for Hive or batch Spark jobs without persistent caching, and pin worker count for caching-heavy workloads. See autoscaling guide.

Preemptible workers

Preemptible (now called Spot) VMs cost 60-91% less than standard workers but can be reclaimed by Compute Engine with thirty seconds notice. Dataproc handles this gracefully because YARN reschedules failed tasks.

Use them for the secondary worker pool, never for primary workers (which hold HDFS DataNodes). A typical setup: 2 primary workers for stability, 8-50 preemptible secondary workers for elastic capacity. Mix carefully; if more than half your workers are preemptible, a Spot pool drain can stall the job.

BigQuery connector

The Spark BigQuery connector reads and writes BigQuery natively from Spark, using the BigQuery Storage Read API for fast parallel reads.

df = spark.read.format("bigquery") \
    .option("table", "project.dataset.events") \
    .option("filter", "event_date >= '2026-05-01'") \
    .load()

Predicate pushdown applies the filter at BigQuery side, so Spark only pulls matching rows. For writes, the connector supports direct write (recommended) and the legacy GCS staging mode.

On Dataproc 2.1+ the BigQuery connector is preinstalled. On older images, attach via --jars or initialization action. Always pin a connector version compatible with your Spark version (e.g., spark-3.3-bigquery for Dataproc 2.1).

GCS connector

The GCS connector ships with every Dataproc image and presents GCS as a Hadoop-compatible filesystem (gs:// URIs). It handles authentication via the cluster service account and supports parallel reads.

Two performance settings worth knowing:

fs.gs.inputstream.fadvise=AUTO: lets the connector pick sequential or random read mode per file
fs.gs.block.size=128m: aligns block size with Spark's preferred partition size

For wide tables in Parquet or ORC format, GCS plus the connector typically outperforms HDFS at lower cost, especially for read-heavy workloads.

Initialization actions

Initialization actions are shell scripts that run on every node at cluster creation. Use them to install custom Python packages, mount NFS, configure Kerberos, or pre-warm caches.

gcloud dataproc clusters create my-cluster \
  --initialization-actions gs://my-bucket/install-libs.sh \
  --initialization-action-timeout 10m

Best practice: keep init scripts idempotent and under five minutes. A failed init action fails the cluster, so test scripts on a single-node cluster first. Google maintains a public repository of init actions at gs://goog-dataproc-initialization-actions-<region>/.

Init actions run as root on every node. A buggy script can corrupt the cluster image or open security holes. Pin init action versions to a specific GCS object generation (gs://bucket/script.sh#1234567890) so a quiet upstream change does not break tomorrow's deployment.

Common Pitfalls and Trade-offs

Real-world Data Processing with Dataproc and Spark deployments hit the same pitfalls repeatedly.

Treating Dataproc like on-prem Hadoop. Teams migrate HDFS-centric mental models and keep clusters running 24/7 with persistent HDFS data. The whole point of Dataproc is ephemeral compute against GCS storage. Long-running clusters with persistent HDFS recreate every problem you wanted to leave behind.

Over-provisioning master nodes. The master rarely needs more than n1-standard-4. Spending on a high-memory master while starving workers is a classic misconfiguration.

Ignoring shuffle partitions. The default spark.sql.shuffle.partitions=200 is wrong for almost every workload. For a 10 GB dataset, 200 partitions of 50 MB each is fine. For a 1 TB dataset, 200 partitions of 5 GB each will OOM. Tune to roughly 128 MB per partition after filters.

Mixing too many preemptible workers. Beyond 60-70% preemptible ratio, Spot reclaim events start cascading and shuffle data is lost faster than it can be recomputed. Keep a stable spine of standard workers.

Forgetting Dataproc Metastore. When clusters are ephemeral, Hive metadata vanishes with them. Dataproc Metastore is a managed external Hive Metastore that survives cluster deletion. Without it, every new cluster rebuilds table definitions from scratch.

Underestimating cold start. Dataproc Serverless cold start is around 60 seconds; Dataproc on GCE is around 90 seconds. For sub-second SLA, neither fits; route those workloads to BigQuery or Dataflow.

Best Practices

A pragmatic checklist for production Data Processing with Dataproc and Spark:

Default to Dataproc Serverless for batch Spark jobs unless you need Hive or custom components
Use ephemeral clusters with workflow templates or Composer for orchestrated pipelines
Keep persistent data in GCS, not HDFS; use Dataproc Metastore for table metadata
Enable AQE and tune spark.sql.shuffle.partitions to match data size
Use the BigQuery Storage API connector for high-throughput BigQuery reads
Mix 30-50% preemptible workers in the secondary pool, never as primary workers
Pin Dataproc image versions and initialization action GCS object generations
Right-size machine types: prefer fewer larger machines for shuffle-heavy jobs to reduce cross-node shuffles

Real-World Use Case

A regional retailer runs a nightly inventory reconciliation job that joins point-of-sale events from 800 stores with the central product catalogue, then writes the deltas to BigQuery for the buying team's morning dashboards.

Volume: about 200 GB of POS events per day, plus a 5 GB product dimension table. SLA: complete by 06:00 local time.

The original on-prem implementation ran on a 20-node Hadoop cluster that sat idle 22 hours a day. After migration:

POS events land in GCS via Pub/Sub to GCS Dataflow streaming pipeline (raw zone)
A Cloud Composer DAG runs at 02:00 daily
Composer creates an ephemeral Dataproc cluster: 2 standard primary workers, 10 Spot secondary workers, autoscaling to 30
The Spark job reads POS Parquet from GCS, broadcasts the product dim, computes deltas, writes to BigQuery via the storage write API
Composer deletes the cluster on success

Total runtime: 38 minutes. Total cost: about $4.20 per night, down from $1,800 per month for the always-on on-prem cluster. AQE handles the seasonal skew when one product category dominates, so the team has not had to add manual salting.

For a recent A/B test, they ran the same job on Dataproc Serverless. Cold start added 55 seconds, but the cost dropped another 30% because they no longer paid for cluster idle time during the BigQuery write phase. They are migrating production to Serverless once they finish moving a Hive-based fallback path off the cluster.

Exam Tips

The PDE exam asks Data Processing with Dataproc and Spark questions in predictable patterns. Spot these and the answer usually picks itself.

Migration questions that mention "lift and shift Hadoop" almost always point to Dataproc on Compute Engine with GCS as the storage layer. If the question says "minimal code changes" and lists Spark or Hive jobs, Dataproc is the answer over Dataflow.

Cost optimisation questions about Dataproc usually want ephemeral clusters, preemptible workers, and autoscaling. If the workload is pure batch Spark with no Hive, the most cost-effective answer is often Dataproc Serverless.

Performance questions about slow Spark jobs typically point to one of: data skew (answer: AQE skew join or salting), too many small files (answer: repartition before write), or Python UDFs (answer: rewrite in Spark SQL or pandas_udf).

Architecture questions about persistent metadata across ephemeral clusters point to Dataproc Metastore. Persistent table data points to GCS plus Hive external tables.

Trick comparisons: Dataproc vs Dataflow. Dataproc is for existing Hadoop/Spark code. Dataflow is for new pipelines, especially streaming with windows and watermarks. If the question mentions Apache Beam or "exactly-once streaming with windowing", the answer is Dataflow regardless of how Spark-friendly the team is.

Dataproc default ports: YARN ResourceManager 8088, HDFS NameNode 9870, Spark History Server 18080, Jupyter 8123. Component Gateway exposes these via authenticated HTTPS without needing SSH tunnels.

Frequently Asked Questions (FAQ)

When should I choose Dataproc Serverless over a Dataproc cluster?

Choose Dataproc Serverless for Spark when the workload is pure Spark batch, runs less than a few hours, does not require Hive or HBase, and does not need custom OS-level configuration via init actions. Serverless eliminates cluster management and bills per DCU-second, which is usually cheaper than even an ephemeral cluster for short jobs. Stick with Dataproc on GCE when you need Hive, HBase, custom JARs at the OS level, sub-minute startup, or interactive long-running notebooks.

How do I prevent data loss when scaling down a Dataproc cluster?

Set gracefulDecommissionTimeout in the autoscaling policy to give in-flight tasks time to complete before workers are removed. A typical value is 1 hour for batch workloads. Avoid autoscaling clusters that cache data on executors with cache() or persist() because removed workers lose cached partitions and force stage recomputation. For caching-heavy workloads, use a fixed worker count or migrate to Dataproc Serverless which manages capacity transparently.

What is the difference between repartition and coalesce in Spark?

repartition(n) performs a full shuffle to redistribute data evenly across n partitions, and can increase or decrease partition count. coalesce(n) merges existing partitions without a shuffle, can only decrease the count, and may leave data skewed. Use coalesce to reduce the number of output files cheaply before writing. Use repartition when you need balanced partitions or want to increase parallelism. Avoid coalesce(1) for large datasets because it forces all data through a single executor.

How do preemptible workers differ from primary workers in Dataproc?

Primary workers are standard Compute Engine VMs that run HDFS DataNodes and YARN NodeManagers. They cost full price but provide stable storage and compute. Preemptible (Spot) workers are secondary workers that cost 60-91% less but can be reclaimed by Compute Engine with 30 seconds notice. They run NodeManagers but not DataNodes, so losing them does not lose HDFS data. Use 2-4 primary workers for stability and add preemptible workers for elastic capacity, keeping the preemptible ratio under 70%.

Should I use HDFS or GCS for data storage on Dataproc?

Use GCS for all persistent data. HDFS on Dataproc should be treated as scratch space only. GCS is durable, cheaper than persistent disks, separates storage lifecycle from cluster lifecycle, and lets you delete clusters without losing data. The GCS connector ships with every Dataproc image and presents GCS as a Hadoop filesystem with the gs:// prefix. The only reason to use HDFS is for intermediate shuffle data that benefits from local SSD throughput.

How does the BigQuery connector work with Spark on Dataproc?

The Spark BigQuery connector uses the BigQuery Storage Read API to perform fast parallel reads directly from BigQuery storage, bypassing the standard query API. It supports predicate pushdown so filters apply at the BigQuery side. For writes, the direct write mode uses the BigQuery Storage Write API for high-throughput inserts. On Dataproc 2.1+ the connector is preinstalled; on older images attach it via --jars or an initialization action. Always pin a connector version compatible with your Spark version.

Can I run multiple Spark jobs concurrently on the same Dataproc cluster?

Yes. YARN scheduling lets you submit multiple jobs and they share cluster resources according to the configured scheduler (capacity or fair). For ephemeral cluster patterns this is rarely useful because each pipeline gets its own cluster. For long-running shared clusters, fair scheduler is the default and gives every job an equal share of resources. For strict isolation between teams, use separate clusters or Dataproc Serverless where each batch is fully isolated.

Dataflow Architecture Selection — when to choose Dataflow over Dataproc for streaming and complex windowing
Dataproc Modernization Strategies — migrating on-prem Hadoop to Dataproc with minimal code change
BigQuery Data Modeling and Clustering — designing the BigQuery target tables that Spark jobs write to

Introduction to Data Processing with Dataproc and Spark

白話文解釋（Plain English Explanation）

Analogy 1: A pop-up restaurant kitchen

Analogy 2: A library reshelving crew

Analogy 3: A toolbox versus a Swiss Army knife

Core Concepts of Data Processing with Dataproc and Spark

Architecture and Design Patterns

Storage-compute separation

Ephemeral cluster pattern

Long-running shared cluster pattern

Workflow templates

GCP Service Deep Dive

Dataproc cluster modes

Dataproc Serverless for Spark

Dataproc on GKE

Spark optimisation: DataFrame and Dataset APIs

Spark optimisation: partitioning

Spark optimisation: broadcast join

Spark optimisation: repartition vs coalesce

Spark optimisation: Adaptive Query Execution (AQE)

Ephemeral vs long-running clusters

Autoscaling policies

Preemptible workers

BigQuery connector

GCS connector

Initialization actions

Common Pitfalls and Trade-offs

Best Practices

Real-World Use Case

Exam Tips

Frequently Asked Questions (FAQ)

When should I choose Dataproc Serverless over a Dataproc cluster?

How do I prevent data loss when scaling down a Dataproc cluster?

What is the difference between repartition and coalesce in Spark?

How do preemptible workers differ from primary workers in Dataproc?

Should I use HDFS or GCS for data storage on Dataproc?

How does the BigQuery connector work with Spark on Dataproc?

Can I run multiple Spark jobs concurrently on the same Dataproc cluster?

Related Topics

Further Reading

Official sources

More PDE topics