examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Amazon EMR with Spark/Hive and AWS Step Functions Orchestration

4,000 words · ≈ 20 min read ·

DEA-C01 Domain 3 Tasks 3.1/3.2 EMR + Step Functions: EMR on EC2 vs EKS vs Serverless, master/core/task sizing, instance fleets vs groups, Spark DRA, Hive + S3 external tables, Iceberg/Hudi/Delta, Step Functions Map + Retry, EMR-vs-Glue + MWAA-vs-SFN cost traps.

Do 20 practice questions → Free · No signup · DEA-C01

Amazon EMR with Spark plus AWS Step Functions for orchestration is the canonical large-scale data processing pattern on AWS, and on the DEA-C01 exam EMR Spark and Step Functions are tested in roughly one out of every six Domain 3 questions. The exam frames most EMR questions as "we need to process X TB of data with Y semantics, EMR or Glue or Athena" — and the right answer almost always depends on whether the workload requires custom Spark code, whether the cluster needs to be long-running, and whether the cost model permits per-job billing or requires reserved capacity. Community study guides from Tutorials Dojo, Camille Chang, and VivekR's 30-day roadmap all flag the same gap: candidates know EMR runs Spark but cannot pick between EMR on EC2, EMR on EKS, and EMR Serverless under exam scenarios, and they cannot pick between Step Functions and MWAA under cost scenarios.

This guide is built to put EMR Spark and Step Functions into operational muscle memory. It covers what EMR is and the three deployment models, cluster architecture with master and core and task nodes, instance fleets versus instance groups, Spark on EMR with dynamic resource allocation, Hive on EMR for SQL workloads, Iceberg and Hudi and Delta Lake support, EMR versus Glue selection, Step Functions state machines, the Map state for fan-out parallelism, error handling with Retry and Catch, Standard versus Express workflows, and the canonical EMR-versus-Glue plus MWAA-versus-Step-Functions exam traps that hide in scenario questions.

What Is Amazon EMR?

Definition — EMR with Spark and Step Functions Orchestration. This DEA-C01 topic covers a domain-specific AWS service or pattern. Confirm the canonical definition from official AWS documentation before relying on third-party summaries — service names and feature scoping have shifted over time.

Amazon EMR is the managed big data platform that runs open-source frameworks (Hadoop, Spark, Hive, HBase, Presto, Trino, Flink) on AWS-provisioned clusters. EMR exists for the workload class where data engineers need to write custom Spark code, run multi-hour jobs over petabyte-scale datasets, or use open-source tools that AWS does not offer as managed services individually.

Three Deployment Models

EMR ships in three distinct deployment models. EMR on EC2 is the original — long-running clusters of EC2 instances with the EMR software stack installed. EMR on EKS runs the same Spark workloads on a Kubernetes cluster (EKS), letting teams that have standardized on Kubernetes share the same orchestration plane for ML, microservices, and data processing. EMR Serverless is the per-job-billed model — submit a Spark or Hive job and AWS provisions, runs, and tears down compute behind the scenes with no cluster management.

When EMR Is The Right Tool

EMR is correct when the workload requires custom Spark code (PySpark, Scala, Java) that exceeds what AWS Glue's managed ETL supports, when the data volume is large enough that Spark's distributed shuffle and partitioning matter, when the team uses Spark-ecosystem libraries (Delta Lake, Hudi, Iceberg with custom write paths, MLlib for in-Spark ML), or when an EMR-specific feature like custom JAR submission, Hadoop ecosystem integration, or HBase is required. EMR is the wrong tool when the same job could run on Glue at lower operational overhead and the team does not need EMR-specific features.

EMR vs Glue

This is the single most-tested EMR boundary on DEA-C01. Glue is for managed ETL with relatively simple Spark logic, schema discovery via crawlers, job bookmarks for incremental processing, and Glue Data Catalog integration. EMR is for custom Spark code with full control over the runtime, library versions, and cluster configuration. Cost: Glue is per-DPU-hour with a higher per-DPU-second rate; EMR is per-EC2-instance-hour at lower rates plus the EMR service charge. For long-running heavy workloads EMR is cheaper; for periodic light ETL Glue is cheaper.

EMR Cluster Architecture

Understanding the master/core/task node split and instance configuration drives every EMR cost and performance question on the exam.

Master Node

The master node runs the cluster coordinator services — YARN ResourceManager, HDFS NameNode, Spark Master. There is exactly one master per cluster (or three for HA-enabled clusters). The master is the single point of coordination; it does not run task work. Master node failure on a non-HA cluster terminates the cluster, so for long-running production clusters either enable HA or accept the restart-on-failure pattern.

Core Nodes

Core nodes run the YARN NodeManager and HDFS DataNode services — they store HDFS data and run task containers. You cannot use Spot instances for core nodes in production because losing a core node loses the HDFS blocks it stored (unless EMR is configured to use S3 as the storage layer, which is the modern pattern). Core node count is sized to the HDFS storage requirement plus the desired baseline compute.

Task Nodes

Task nodes run only the YARN NodeManager — they execute task containers but do not store HDFS data. Losing a task node does not lose data, only the in-progress tasks (which YARN reschedules). This makes task nodes ideal for Spot instances, often delivering 70 to 90 percent cost savings versus On-Demand. The canonical production pattern: small core node fleet for stability, large task node fleet on Spot for elastic compute.

Instance Fleets vs Instance Groups

Instance Groups define each node group as a single instance type — for example, "core: 5x m5.4xlarge, task: 20x m5.4xlarge." Simple but inflexible — if m5.4xlarge Spot capacity is unavailable, the cluster waits.

Instance Fleets define each node group as a list of acceptable instance types with a target capacity in EC2 units — for example, "task: 200 EC2 units across m5.4xlarge, m5.8xlarge, r5.4xlarge, c5.4xlarge, prefer Spot." EMR allocates whichever combination is available and cheapest, falling back to On-Demand if Spot is fully exhausted. Instance Fleets are the recommended pattern for cost optimization on Spot-heavy clusters.

Use Instance Fleets with Spot capacity for task nodes to deliver 70 to 90 percent cost savings versus On-Demand for elastic compute on EMR. Core nodes should remain On-Demand (or Reserved) because Spot interruption loses HDFS data unless the cluster is fully S3-backed; task nodes are stateless and ideal for Spot. The Instance Fleet model lets EMR allocate across multiple instance types to reduce the chance of Spot capacity exhaustion — declaring "task fleet: 200 EC2 units across six instance types" gives EMR enough flexibility to maintain capacity even when one instance family is constrained. Combine with EMR-managed scaling that adjusts task node count based on YARN pending containers. For DEA-C01 exam scenarios about EMR cost reduction, the answer is almost always "task nodes on Spot via Instance Fleets," not "scale the cluster down."

EMR Serverless — No Cluster Management

EMR Serverless is the deployment model that eliminates cluster management entirely.

How It Works

You create an EMR Serverless application (a logical container scoped to a Spark or Hive runtime version) and submit jobs to it. AWS provisions worker capacity for the job, runs it, and bills per-second of vCPU and memory used. There is no cluster to size, no master to manage, no task nodes to provision.

When EMR Serverless Is Right

Best for sporadic workloads where a long-running cluster would idle. A nightly ETL that runs 2 hours then needs nothing for 22 hours pays only for those 2 hours on EMR Serverless; on EMR on EC2 you either keep the cluster running (24x cost) or pay the cluster startup time (10-15 minutes) on every job. EMR Serverless is also right for ad hoc analyst Spark jobs where cluster sizing is unpredictable.

When EMR Serverless Is Wrong

Long-running streaming applications need EMR on EC2 or EMR on EKS — Serverless is per-job, not for continuous workloads. Workloads requiring HDFS or HBase need EMR on EC2. Very-low-latency interactive workloads sometimes prefer warm clusters over Serverless cold start.

Pricing Model Trade-Off

Serverless pays per second of allocated capacity at a slight premium per unit versus EMR on EC2. The break-even depends on cluster utilization: if a cluster runs at 40 percent utilization 24/7, Serverless is cheaper; if it runs at 80+ percent utilization, EMR on EC2 with Reserved or Spot is cheaper.

Spark On EMR

Spark is by far the most-used EMR framework for data engineering.

Job Submission

spark-submit --deploy-mode cluster --num-executors 20 --executor-memory 16G --executor-cores 4 myjob.py is the canonical submission. EMR also accepts EMR Steps (a step is a one-shot job submitted via the EMR API), which is the integration point for Step Functions and other orchestrators.

Dynamic Resource Allocation

DRA lets Spark scale executors up or down based on workload. Idle executors release after a timeout, freeing resources for other jobs in the same cluster. Required when running multi-tenant Spark on shared clusters; less critical on single-job clusters.

Spark UI And History Server

The Spark UI on the running cluster shows live job progress; the Spark History Server (which EMR archives to S3) shows completed-job analysis. Spark UI is the primary debugging tool for slow Spark jobs — DAG visualization, stage durations, executor utilization.

Spark Configuration Tuning

Critical Spark properties: spark.sql.shuffle.partitions (default 200, often too low for large data), spark.executor.memory and spark.executor.cores (sized to instance type), spark.dynamicAllocation.enabled (DRA toggle), spark.serializer=org.apache.spark.serializer.KryoSerializer (faster serialization). EMR applies sensible defaults but production tuning is workload-specific.

Hive On EMR

Hive is the SQL-on-Hadoop layer with strong S3 external table support.

When Hive Is Right

Hive is correct for SQL-driven batch ETL where Spark SQL would also work but Hive's mature query optimizer, partition handling, and HiveQL syntax are preferred. Hive on EMR commonly serves as the SQL execution engine for legacy ETL jobs migrated from on-premises Hadoop. New workloads usually pick Spark SQL or Athena instead.

External Tables On S3

CREATE EXTERNAL TABLE ... LOCATION 's3://bucket/path/' lets Hive treat S3 prefixes as tables. The Glue Data Catalog can be the metastore (instead of the EMR-local Hive metastore), which makes the same tables visible to Athena, Spark, and other engines.

Open Table Formats — Iceberg, Hudi, Delta Lake

EMR supports all three open table formats with native runtime integration.

Apache Iceberg

ACID transactions, time-travel queries via FOR TIMESTAMP AS OF, partition evolution (changing partitioning without rewriting data), and schema evolution. Iceberg on EMR uses the Glue Data Catalog as the metadata store. Athena engine v3 queries Iceberg tables natively. The DEA-C01 exam guide explicitly lists Iceberg as in-scope.

Apache Hudi

Optimized for upsert-heavy workloads (CDC pipelines from RDBMS into a data lake). Two table types: Copy-on-Write (rewrites data files on update, faster read) and Merge-on-Read (logs deltas, faster write).

Delta Lake

Originally Databricks-developed, now broadly supported. Similar ACID semantics to Iceberg with slightly different metadata layout. Common on teams that started on Databricks and migrated to EMR.

Choosing Among Them

Iceberg is the canonical AWS-recommended choice for new lake-house architectures because of Glue Data Catalog integration and Athena native support. Hudi is the right answer for upsert-heavy CDC. Delta Lake is the right answer for teams already invested in Delta. The DEA-C01 exam mentions all three but emphasizes Iceberg.

What Are AWS Step Functions?

AWS Step Functions is the state machine orchestration service for chaining AWS service calls into a coordinated workflow.

State Machine Model

A workflow is a JSON-defined state machine: each state is either a Task (call an AWS service or Lambda), a Choice (branch on a condition), a Wait (delay), a Parallel (run states in parallel), a Map (run a state for every item in an array), a Pass (passthrough), or a Fail/Succeed terminal state. Transitions between states are explicit — output of one state becomes input to the next.

Direct AWS Service Integrations

Step Functions has direct integrations with over 200 AWS services, including Glue (run a job), EMR (add a step, terminate a cluster), Athena (run a query), Lambda (invoke), Redshift Data API (run a SQL statement), and S3 (read/write objects). Direct integrations skip Lambda glue code, reducing cost and complexity.

Standard vs Express Workflows

Standard workflows support up to 1-year execution duration, exactly-once execution semantics, full execution history retention, and per-state-transition billing. Best for long-running data pipelines.

Express workflows support up to 5-minute duration, at-least-once semantics, and per-execution plus per-millisecond-runtime billing. Best for high-volume short-lived workflows like API request processing.

For data engineering pipelines, Standard is almost always correct because pipelines run minutes to hours.

Step Functions Map State — Fan-Out Parallelism

The Map state runs a sub-workflow for every item in an input array — the canonical fan-out pattern.

Inline Map vs Distributed Map

Inline Map runs up to 40 concurrent iterations and supports up to 256 KB state size. Sufficient for small fan-out (process up to a few hundred files in parallel).

Distributed Map runs up to 10,000 concurrent iterations and supports millions of items by reading the input array from S3. The right answer for processing every file in a large S3 prefix or every row of a large dataset.

Common Use Cases

Process every Glue table in parallel, run a Glue job for every partition of a large dataset, transform every file landed in an S3 ingestion zone. The Distributed Map state fundamentally changed Step Functions from "linear orchestrator" to "true parallel executor" for data engineering use cases.

Error Handling — Retry And Catch

Production data pipelines fail in transient ways constantly; Step Functions error handling is the discipline that keeps them resilient.

Retry Configuration

Each Task state can declare Retry blocks with error type, max attempts, interval seconds, and backoff rate. A typical Glue job retry: Retry: [{ErrorEquals: ['Glue.AWSGlueException'], MaxAttempts: 3, IntervalSeconds: 30, BackoffRate: 2.0}] retries up to 3 times with exponential backoff.

Catch Blocks

Catch routes specific errors to alternate states — typically a notification state, a compensation state, or a graceful failure state. Catch: [{ErrorEquals: ['States.ALL'], Next: 'NotifyOps'}] routes any uncaught error to an SNS notification.

Activity And Callback Patterns

Activity states wait for an external worker to claim and complete the task — useful for human approval steps or external system coordination. Callback patterns let a Lambda or Glue job signal completion back to Step Functions via the SDK, enabling long-running tasks to fit the state-transition model.

Always declare Retry blocks on every Task state in production data pipelines, with exponential backoff configured. AWS service throttling, transient network errors, and ephemeral compute failures are constant in production; without Retry blocks every transient failure becomes a pipeline failure that wakes someone up. The recommended baseline: 3 attempts with 30-second initial interval and 2.0 backoff rate. Combine with Catch blocks routing terminal failures to an SNS notification or a CloudWatch alarm. The cost of Retry blocks in Step Functions is per-state-transition (a few microcents); the cost of pipeline failures from missing them is real. For DEA-C01 exam scenarios about resilient pipelines, the answer is Retry plus Catch, not external monitoring.

Step Functions vs MWAA vs EventBridge

Three orchestration tools cover different parts of the data engineering workflow space. The DEA-C01 exam tests this decision repeatedly.

Step Functions

Best for: state-machine-style pipelines with branching, error handling, parallel fan-out, and direct AWS service integrations. Pricing: per-state-transition (pennies per execution). No baseline cost when idle. Authoring: JSON or AWS Workflow Studio visual editor.

Amazon MWAA (Managed Apache Airflow)

Best for: complex DAG-based pipelines with extensive open-source ecosystem (Airflow operators for hundreds of services), Python-based pipeline definitions, large existing Airflow codebases. Pricing: per-environment-per-hour with a baseline running cost (the environment runs continuously). Authoring: Python DAG files.

EventBridge Scheduler

Best for: scheduled invocation of Lambda, Step Functions, Glue, ECS tasks. Replaces the older EventBridge Rules cron-style scheduling and CloudWatch Events. Pricing: per-invocation. Best for the trigger layer, not the orchestration layer itself.

Decision Rule

If the team has existing Airflow DAGs or strong preference for Python-defined pipelines, MWAA. If the workflow is state-machine-shaped (sequential steps with branching and parallelism), Step Functions. For scheduling triggers that kick off either, EventBridge Scheduler. The cost trap: MWAA has a baseline running cost (around 300 USD/month minimum) that hits hard for sparse pipelines; Step Functions has zero idle cost. The DEA-C01 exam plants this with "the team runs one pipeline a day, what is most cost-effective" — Step Functions plus EventBridge Scheduler.

MWAA has a continuous baseline running cost even when no DAGs execute, making it expensive for sparse data pipeline workloads where Step Functions plus EventBridge Scheduler is dramatically cheaper. A team running 1 pipeline per day might pay 300 USD/month for an MWAA environment versus 1-5 USD/month for the equivalent Step Functions plus EventBridge setup. MWAA is the right answer when DAG complexity, Python-based pipeline definitions, or existing Airflow code outweigh the cost; Step Functions wins on cost for typical workflows. The DEA-C01 exam plants this as "the team has a single nightly batch pipeline and limited orchestration needs, recommend a service" — MWAA is the trap; Step Functions is the answer.

Plain-Language Explanation: EMR Spark And Step Functions

Three concrete analogies make EMR's deployment model and Step Functions' orchestration intuitive.

Analogy 1 — The Construction Site With Three Hiring Models

Imagine a construction company that builds custom houses (Spark jobs). The company has three hiring models. EMR on EC2 is hiring a permanent crew of construction workers (master, core, task nodes) on a long-term contract — the crew shows up every day whether there is work or not, and the foreman (master) coordinates the daily build. EMR on EKS is sharing a worker pool with the company's other divisions (Kubernetes for ML, microservices, batch) — workers come from a pool, dispatched to whichever division needs them today. EMR Serverless is calling a temp agency every time a house needs building — workers arrive, finish the project, leave, and the company pays only for hours worked. Task nodes on Spot are the day laborers — cheaper, but if the day laborer takes a better gig elsewhere mid-shift the project foreman just calls another. Instance Fleets are saying "I need 200 worker-hours total, I do not care if they are carpenters or electricians or general laborers, send what is cheapest today." Step Functions is the project plan binder the foreman carries — a step-by-step schedule with "if framing fails three times, call architect, otherwise proceed to roofing, in parallel to plumbing." MWAA is hiring a full-time project manager who shows up every day to coordinate the binder, paid whether or not there is a project that day. The DEA-C01 cost question is whether you have enough projects to justify the full-time manager (MWAA) or whether the foreman with a binder (Step Functions) suffices.

Analogy 2 — The Restaurant Kitchen And Recipe Coordinator

Picture a restaurant kitchen scaled to industrial volume. The line cooks (Spark executors) need a kitchen (EMR cluster) to work in. Some restaurants own their kitchen 24/7 (EMR on EC2) — high fixed cost but the kitchen is warmed up and ready every minute. Some restaurants share a co-working kitchen with other businesses (EMR on EKS) — pay for the slots you use. Some restaurants only rent kitchen space when an order comes in (EMR Serverless) — zero idle cost but a few minutes of warm-up before each shift. The head chef on duty (master node) coordinates the line cooks (core and task nodes); core cooks store ingredients in the walk-in (HDFS), task cooks just execute orders. Iceberg, Hudi, and Delta Lake are different recipe systems for tracking changes to recipes over time — Iceberg is the AWS-blessed standard, Hudi is for fast-change recipes (CDC), Delta Lake is for restaurants that started on Databricks. Step Functions is the order ticket coordinator who reads "make appetizer, then main, then dessert; if appetizer fails three times call manager; do mains and sides in parallel" — a deterministic state machine routing tickets through the kitchen. The Map state is duplicating the same ticket to every cook for every customer at the table. MWAA is hiring an Airflow DAG coordinator who is great at complex flows but charges a daily rate even on slow days; Step Functions is per-ticket pricing.

Analogy 3 — The Postal Sorting Facility With Sorting Machines And Workflow Software

EMR is the sorting facility itself — racks of sorting machines that physically process packages. EMR on EC2 is owning the facility 24/7 with full staff. EMR on EKS is renting space in a multi-tenant logistics center. EMR Serverless is paying a third-party logistics company per package sorted. Master node is the supervisor checking the daily plan; core nodes are sorters with package storage; task nodes are temp sorters who can leave anytime (Spot instances) without losing inventory. Spark is the sorting program running on the machines; Hive is the SQL-driven label-printing program; Iceberg is the warehouse manifest system that supports point-in-time queries (what was the inventory yesterday at noon). Step Functions is the workflow software running in the office above — "receive truck at 8am, run sorting program, generate labels, dispatch trucks at 5pm; if sorting program fails 3 times, alert manager." Map state is "for every truck in today's incoming list, run the same sorting workflow in parallel." Standard workflows handle 1-year-long shipments; Express workflows handle 5-minute API request processing. MWAA is hiring a full-time workflow consultant; Step Functions is paying per workflow event. EventBridge Scheduler is the office clock that fires "8am, start daily workflow." The DEA-C01 exam asks: read the package volume and frequency, choose the right deployment model, then choose the right orchestrator.

Common Exam Traps For EMR And Step Functions

The DEA-C01 exam plants a stable set of traps. Memorize all five.

Trap 1 — Glue When EMR Is Needed

A scenario describes custom Spark code with non-trivial libraries and large data volumes. Wrong answer: Glue (because "managed ETL" sounds easier). Right answer: EMR. Glue is best for relatively standard ETL; complex Spark belongs on EMR.

Trap 2 — Core Nodes On Spot

A candidate moves all nodes to Spot for cost. Wrong — losing a core node loses HDFS data. Task nodes on Spot, core nodes On-Demand or Reserved.

Trap 3 — MWAA For Sparse Workflows

A scenario has 1 nightly pipeline. MWAA's baseline cost makes it wrong; Step Functions plus EventBridge Scheduler is correct.

Trap 4 — Express Workflows For Long Pipelines

A candidate picks Express for cost. Wrong — Express has a 5-minute limit. Long pipelines must be Standard.

Trap 5 — Inline Map For Large Fan-Out

A scenario fans out to 10,000 items and chooses Inline Map. Wrong — Inline tops at 40 concurrent. Distributed Map handles up to 10,000.

EMR for custom Spark and large data; Glue for managed ETL with simpler logic. EMR Serverless for sparse jobs, EMR on EC2 for steady-state workloads. Task nodes on Spot, core nodes On-Demand. Step Functions Standard for data pipelines (1-year duration); Express for high-volume short workflows (5-minute cap). Distributed Map for fan-out beyond 40 items. MWAA for complex Python DAGs at higher cost; Step Functions for state-machine pipelines at per-transition cost. This is the one paragraph to memorize for every EMR plus Step Functions question on DEA-C01. The recurring exam pattern is choosing among service combinations under cost, scale, and complexity constraints; the right answer is almost always the one that minimizes idle cost while matching workflow shape.

Key Numbers And Must-Memorize EMR And Step Functions Facts

EMR Deployment Models

  • EMR on EC2: long-running clusters, full control
  • EMR on EKS: shared Kubernetes pool
  • EMR Serverless: per-job billing, no cluster

EMR Cluster Architecture

  • Master: 1 (or 3 HA), coordinator only
  • Core: HDFS storage, On-Demand or Reserved
  • Task: stateless, ideal for Spot (70-90% savings)
  • Instance Fleets across multiple instance types for resilience

Open Table Formats On EMR

  • Iceberg: AWS-recommended, Glue Catalog, Athena native
  • Hudi: upsert-optimized, CDC-friendly
  • Delta Lake: Databricks-origin, broadly supported

Step Functions Workflow Types

  • Standard: up to 1 year, exactly-once, per-transition billing
  • Express: up to 5 minutes, at-least-once, per-execution-plus-runtime

Step Functions State Types

  • Task, Choice, Wait, Parallel, Map, Pass, Fail, Succeed
  • Inline Map: up to 40 concurrent
  • Distributed Map: up to 10,000 concurrent

Orchestration Cost Model

  • Step Functions: per-state-transition, zero idle cost
  • MWAA: per-environment-per-hour, ~300 USD/month baseline
  • EventBridge Scheduler: per-invocation triggers

DEA-C01 exam priority — EMR with Spark and Step Functions Orchestration. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — EMR Spark And Step Functions Top Questions

Q1 — When should I use EMR on EC2 versus EMR Serverless?

Use EMR on EC2 for steady-state workloads where the cluster runs continuously or near-continuously and cluster utilization is high (60+ percent). The Reserved or Savings Plans pricing on EC2 plus Spot for task nodes delivers the lowest per-DPU-hour cost when utilization is high. Use EMR Serverless for sparse workloads where the cluster would spend most of its time idle, ad hoc Spark jobs where cluster sizing is unpredictable, and teams that want zero cluster management overhead. The break-even is roughly 40 percent utilization — below that EMR Serverless wins on total cost; above that EMR on EC2 wins. EMR on EKS sits between, ideal for organizations that have already standardized on Kubernetes and want to share the cluster pool across data, ML, and microservices workloads.

Q2 — How do I decide between EMR Spark and AWS Glue for an ETL workload?

Use Glue for relatively standard ETL with managed crawlers, job bookmarks for incremental processing, Glue Data Catalog integration, and PySpark code that fits the Glue runtime. Use EMR for custom Spark code with non-standard libraries, large data volumes that benefit from cluster-scale optimization, Hive or HBase requirements, or open-source ecosystem features Glue does not expose. Cost: Glue is per-DPU-hour at a higher rate but with no cluster management overhead; EMR is per-EC2-hour at lower rates but requires cluster sizing and operational care. For periodic light ETL, Glue wins. For long-running heavy workloads, EMR with Spot task nodes wins. The DEA-C01 exam tests this with scenarios about workload characteristics and asks for the correct service.

Q3 — Why use Spot instances on EMR task nodes but not on core nodes?

Core nodes store HDFS blocks; losing a core node loses the data it stored unless the cluster has been configured to use S3 as the storage layer (the modern EMRFS pattern, where HDFS is essentially unused). On a clusters that uses HDFS, Spot interruption of a core node corrupts the cluster. Task nodes are stateless — they execute YARN containers but store nothing durable. Losing a task node only loses in-progress task work, which YARN reschedules on a surviving node. This is why the production pattern is core nodes on On-Demand (or Reserved Instances for cost) and task nodes on Spot via Instance Fleets. The Spot savings on task nodes is typically 70 to 90 percent versus On-Demand, applied to the largest portion of the cluster.

Q4 — When should I use Step Functions Standard versus Express workflows?

Use Standard for data engineering pipelines that run minutes to hours and require exactly-once execution semantics — Glue jobs, EMR steps, multi-stage ETL, periodic batch processing. Standard supports up to 1-year duration, retains full execution history for auditing, and bills per state transition. Use Express for high-volume short-lived workflows like API request orchestration, IoT event processing, or microservice coordination — Express tops out at 5 minutes per execution, supports at-least-once semantics, and bills per execution plus per-millisecond runtime. For DEA-C01 data pipeline scenarios, Standard is almost always correct because pipelines exceed Express limits and need exactly-once. The trap: candidates pick Express for cost without checking the duration cap.

Q5 — How do I choose between Step Functions, MWAA, and EventBridge Scheduler?

Step Functions is the state-machine orchestration choice for AWS-native pipelines with branching, parallelism, error handling, and direct service integrations. MWAA (Managed Apache Airflow) is the right choice when the team has existing Airflow DAGs, prefers Python-based pipeline definitions, or needs the broad Airflow operator ecosystem. EventBridge Scheduler is the trigger layer — it schedules invocations of Step Functions, MWAA DAGs, Lambda, Glue, or ECS tasks but does not orchestrate state machines itself. Cost rule: MWAA has a continuous baseline cost (~300 USD/month for the smallest environment) that makes it expensive for sparse workloads; Step Functions has zero idle cost. For a single nightly pipeline, Step Functions plus EventBridge Scheduler is dramatically cheaper than MWAA. For dozens of complex DAGs with branching, MWAA's productivity benefit may justify the cost.

Q6 — How do I handle errors and retries in a Step Functions data pipeline?

Declare Retry blocks on every Task state with sensible defaults — typically 3 attempts, 30-second initial interval, exponential backoff with rate 2.0. Pair Retry with Catch blocks that route uncaught errors to a notification state (publish to SNS for ops alerts) or a compensation state (clean up partial work). For Glue or EMR jobs that are idempotent, Retry is safe; for non-idempotent operations like INSERT statements without idempotency tokens, Retry must be paired with deduplication logic to avoid double-processing. The Retry plus Catch pattern transforms transient failures (AWS service throttling, ephemeral instance failures) into successful completions and reserves human-paged failures for true production issues. For DEA-C01 exam scenarios about resilient pipelines, Retry plus Catch is the answer.

Q7 — When does Distributed Map state make sense in Step Functions?

Use Distributed Map when fan-out exceeds the 40-concurrent limit of Inline Map or when the input array exceeds 256 KB. Distributed Map reads the input from S3 (a JSON, JSONL, or CSV file or a list of S3 keys) and runs up to 10,000 concurrent iterations of a child workflow. The canonical use case: process every file in a large S3 prefix, run a Glue job for every partition of a multi-thousand-partition table, transform every row of a large dataset. Inline Map remains correct for small fan-outs (a few dozen items) where the input fits in state. The DEA-C01 exam tests this with scenarios about processing thousands of files in parallel — Distributed Map is the answer.

Further Reading — Official AWS Documentation For EMR And Step Functions

The authoritative AWS sources are the Amazon EMR Management Guide (deployment models, cluster sizing, Instance Fleets), the EMR Release Guide (Spark, Hive, Iceberg, Hudi, Delta Lake configuration), the Step Functions Developer Guide (state types, error handling, Standard vs Express), and the Distributed Map state documentation. The AWS Big Data Blog has multiple deep-dive posts on EMR Spot patterns, Iceberg on EMR, and Step Functions data pipeline patterns. The AWS Well-Architected Data Analytics Lens covers EMR and Step Functions in the data processing and orchestration phases. The AWS Samples GitHub repository contains end-to-end pipeline examples combining EMR, Step Functions, Glue, and Athena. Finally, the Skill Builder DEA-C01 Exam Prep Standard Course has dedicated modules on EMR and orchestration that walk through the canonical exam traps in scenario form.

Official sources

More DEA-C01 topics