examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

AWS Glue ETL — Job Bookmarks, DynamicFrames, and PySpark

4,000 words · ≈ 20 min read ·

Master AWS Glue ETL for DEA-C01 Domain 1 Tasks 1.2 and 1.4 — DynamicFrames vs DataFrames, job bookmarks for incremental processing, PySpark transformations, crawler partition handling, the small-file problem and groupFiles, DPU and worker types G.1X G.2X, Glue Workflows triggers, and the high-frequency exam traps that data engineers miss.

Do 20 practice questions → Free · No signup · DEA-C01

AWS Glue is the workhorse ETL service of the AWS data engineering stack and one of the most heavily tested services on the DEA-C01 exam, where Domain 1 (Data Ingestion and Transformation, 34 percent weight) leans on Glue across multiple task statements. Community study guides from Tutorials Dojo, Digital Cloud Training, and ExamCert.App all flag the same pain points — candidates conflate DynamicFrames with DataFrames, misuse job bookmarks, fall into the crawler partition gotcha that creates duplicate tables, and pick the wrong worker type for the workload. The wrong choice on the exam is also the wrong choice in production: a crawler that creates fifty tables instead of one breaks every downstream Athena query, a bookmark configured against the wrong key reprocesses terabytes every run, and a G.1X cluster sized for a Parquet aggregation OOMs an hour into a batch.

This guide is built for the data engineer perspective. It covers what AWS Glue is, the three job types and when to use each, the DynamicFrame abstraction, PySpark transformations on Glue, job bookmarks for incremental processing, crawler behavior including the multi-table partition gotcha, the small-file problem and the groupFiles option, worker types and DPU allocation, Glue Workflows for multi-step pipelines, connection types and VPC configuration, and the canonical exam traps that catch most data engineers. By the end, the Glue ETL surface should feel as familiar as your own kitchen workstation.

What Is AWS Glue?

AWS Glue is a fully managed, serverless ETL (extract, transform, load) service that combines a Spark-based execution engine, a metadata catalog, and a workflow orchestrator into one platform. It exists because building data pipelines on raw EMR or self-managed Spark requires a data engineer to handle cluster provisioning, autoscaling, library management, scheduling, and metadata governance — Glue handles all of that and bills only for the compute consumed during job runs. The DEA-C01 exam treats Glue as the default answer for "managed ETL on AWS" and tests the operational specifics that distinguish it from EMR or Lambda-based alternatives.

Glue Job Types — Spark, Python Shell, Ray

Glue offers three job types, each suited to a different workload size and language. Spark jobs (the most common) run distributed Apache Spark with PySpark or Scala on multiple worker nodes — use them for any ETL workload over a few gigabytes. Python shell jobs run a single Python process with up to 1 DPU, ideal for lightweight tasks like calling APIs, running small SQL transforms, or invoking other AWS services from a scheduled context. Ray jobs (newer) run distributed Python workloads using the Ray framework, suited for parallel Python that does not fit Spark's RDD model — typical for ML preprocessing or graph computations.

Glue Versions And Runtime

Glue version pins the underlying Spark, Python, and library versions. Glue 3.0 ships Spark 3.1, Glue 4.0 ships Spark 3.3 with adaptive query execution, and Glue 5.0 ships Spark 3.5. Each version carries a different price-per-DPU-hour and different feature support — for example, Iceberg native support arrived in Glue 4.0. The exam expects you to know that newer Glue versions ship newer Spark and that adaptive query execution and dynamic partition pruning are post-3.0 features that improve performance without code changes.

Plain-Language Explanation: AWS Glue ETL

The Glue ETL surface is the kind of system where naming alone does not convey the trade-offs. Three concrete analogies make the structure stick.

Analogy 1 — The Restaurant Prep Kitchen And The Recipe Card

Picture a restaurant prep kitchen. The head chef writes a recipe card describing exactly what to chop, in what order, into what container, with what seasoning. The recipe card is the PySpark script — it spells out the transformation logic in code. The prep cooks are the DPU workers — distributed labor that follows the recipe card in parallel. The walk-in cooler holds raw ingredients (the source S3 buckets) and finished prep containers (the target S3 buckets). The head chef's notebook marks which deliveries have already been prepped so the morning team does not redo yesterday's work — that notebook is the Glue job bookmark. The catalog of every ingredient and container in the kitchen is the Glue Data Catalog.

When a Glue job runs, the recipe card (script) is sent to the prep cooks (DPUs), they read raw ingredients (S3 source), follow the recipe (transformations), write into prep containers (S3 target), and the head chef updates the notebook (bookmark). If the kitchen is busy, more prep cooks (more DPUs) finish faster but cost more — pick G.1X for routine recipes (4 vCPU, 16 GB RAM per worker) and G.2X for memory-heavy recipes like aggregating millions of rows (8 vCPU, 32 GB RAM per worker). The crawler is the expediter who walks through the walk-in cooler and updates the catalog when new ingredients arrive — and yes, a sloppy expediter who labels each new delivery as a separate ingredient creates fifty entries for the same tomato, which is exactly the crawler partition gotcha.

Analogy 2 — The Library Acquisitions And Cataloging Department

Picture a research library. The acquisitions department receives boxes of new books (raw S3 data), the catalogers read each book and assign metadata (the crawler populates the Glue Data Catalog), and the transcription team rewrites old manuscripts into modern editions (the ETL job transforms data). The cataloger's logbook records which books have been processed so they do not get cataloged twice — that logbook is the bookmark. The central card catalog is the Glue Data Catalog, queryable by Athena, Redshift Spectrum, and EMR.

The library has different cataloging staff for different volumes of work — a single librarian (Python shell job, 1 DPU) handles a small donation, a team of distributed catalogers (Spark job with multiple DPUs) handles a thousand-box estate donation. DynamicFrames are like the cataloger's working notes — schema-flexible, allowing entries with missing fields or ambiguous types, with resolveChoice for deciding what to do when the same field has two types across books. DataFrames are the final card catalog format — strict schema, optimized for retrieval. The cataloger drafts in DynamicFrame, polishes the schema, then converts to DataFrame for the final Parquet output.

Analogy 3 — The Postal Service Sorting Facility

Picture a postal sorting facility. Mail trucks arrive from regional offices (S3 sources), sorters route each parcel by destination (PySpark transformations on DPUs), and outbound trucks depart to delivery routes (S3 targets). The facility keeps a delivery log marking which mail batches have been sorted so the next shift does not re-sort yesterday's mail — that log is the bookmark. The address book describing every street and zip code is the Data Catalog.

The supervisor decides how many sorters to staff per shift — that is DPU allocation. The job bookmark only works if every parcel has a unique tracking number that the sorter can compare against the log — that is bookmark key selection, and getting the key wrong (using a non-unique field) means parcels get re-sorted or skipped. The crawlers are the inspectors who walk the receiving dock and add new sender addresses to the address book — and a sloppy inspector who treats every parcel from "5th Avenue building 100 unit A" and "5th Avenue building 100 unit B" as completely different addresses is the crawler partition gotcha that creates duplicate tables when partition prefixes are misread.

DynamicFrames — Schema Flexibility For Messy Data

DynamicFrame is Glue's schema-flexible data abstraction layered on top of Spark.

Why DynamicFrames Exist

Apache Spark DataFrames require a schema upfront — every column has one type, and rows must conform. Real data, especially semi-structured JSON from APIs or streaming logs, regularly violates this. A field might be a string in one record and an integer in another. A column might be missing from older records. Schema-on-read inference fails when Spark sees a String where it expected Integer. DynamicFrames solve this by carrying multiple type choices per column and deferring resolution to explicit operations the engineer controls.

resolveChoice — Picking A Type When Multiple Exist

When a DynamicFrame column carries Choice(string, int), the resolveChoice transform tells Glue what to do: cast everything to a single type, project onto two separate columns, or drop the ambiguous rows. Common patterns include make_cols (split into colname_string and colname_int), cast:string (force everything to string), or match_catalog (use the Glue Data Catalog schema as the target type). The exam expects you to know that DynamicFrames handle dirty data and that resolveChoice is the operation that flattens type ambiguity before downstream consumers query the result.

When To Convert DynamicFrame To DataFrame

DynamicFrames are convenient for ingestion but slower than DataFrames for SQL-style transformations because they carry extra metadata. The standard pattern is: read source as DynamicFrame, run schema-flexibility operations like resolveChoice and relationalize, then call toDF() to convert to a DataFrame for joins, aggregations, and writes. The reverse path fromDF() re-wraps a DataFrame as a DynamicFrame when you need to write through Glue's catalog-aware sink.

DynamicFrames vs DataFrames On The Exam

A scenario asking "raw JSON with inconsistent fields needs ETL" picks DynamicFrames. A scenario asking "fastest aggregation on a known schema" picks DataFrames. A scenario asking "best of both" picks the DynamicFrame-then-toDF pattern. Memorize: DynamicFrames are for the dirty-data problem, DataFrames are for the performance problem, and converting between them is a one-line API call.

A DynamicFrame is Glue's schema-flexible alternative to a Spark DataFrame, designed for semi-structured data with type inconsistencies, missing fields, and nested records. The defining feature is the Choice type that allows a single column to carry multiple type options simultaneously, resolved later via resolveChoice operations like make_cols, cast, project, or match_catalog. DynamicFrames also include native relationalize for flattening nested structures, unbox for parsing string-encoded JSON columns, and splitFields for partitioning data into multiple frames. They are not a replacement for DataFrames but a complement — use DynamicFrames at ingestion to absorb messy data, then convert via toDF() to DataFrames for performance-critical transformations.

PySpark On Glue — The Transformation Vocabulary

PySpark is the dominant language for Glue Spark jobs.

Reading Sources With GlueContext

The GlueContext.create_dynamic_frame.from_catalog method reads a source registered in the Glue Data Catalog, applying any partition pushdown predicate you provide. The from_options variant reads directly from S3, JDBC, or other sources without a catalog entry. Catalog-based reads are the recommended pattern because they pick up schema updates from crawlers automatically.

Common Transformations

PySpark on Glue supports the full Spark API plus Glue-specific transforms. Filtering uses Filter.apply, joining uses Join.apply for DynamicFrames or standard DataFrame.join for DataFrames, aggregating uses groupBy().agg(), and column selection uses SelectFields or DropFields. The Map.apply transform applies a Python function row-by-row to a DynamicFrame, useful for complex per-row logic that does not fit a SQL expression. For schema reshaping, ApplyMapping renames and casts columns in one step.

Writing Targets

GlueContext.write_dynamic_frame.from_options writes a DynamicFrame to S3 in formats including Parquet (preferred), ORC, Avro, JSON, and CSV with compression options. The from_catalog variant writes through a catalog table definition, which automatically registers new partitions in the catalog if the target is partitioned. For DataFrames, the standard df.write.parquet() works directly.

Partitioned Writes

Writing partitioned output requires specifying partitionKeys in the connection options. Glue then writes one Parquet file (or several, depending on partition data volume) per partition value. Partition columns are stripped from the data files and represented as S3 prefixes — the canonical Hive layout year=2026/month=05/day=02/. Athena and Redshift Spectrum auto-discover these partitions through the Glue catalog.

Job Bookmarks — Incremental Processing Without Reprocessing

Job bookmarks are the core mechanism for processing only new data on each run.

What A Bookmark Is

A Glue job bookmark is a per-job state object that tracks which input data has already been processed. On the first run, the job processes everything. On subsequent runs with bookmarks enabled, the job processes only data that has appeared since the last successful run — based on file modification time for S3 sources or a tracked column for JDBC sources.

Bookmark Modes — Enable, Disable, Pause

Bookmarks have three modes. Enabled updates state at the end of the run and uses it as the lower bound on the next run. Disabled ignores any prior bookmark state and processes all input. Pause uses the existing bookmark as a lower bound but does not update it at the end — useful for retries that should resume from the last good checkpoint without advancing.

Bookmark Key Selection

For S3 sources, Glue uses file modification time as the bookmark key by default — files modified after the bookmark are processed. For JDBC sources, the bookmark key must be a column you specify, typically a monotonically increasing primary key or a timestamp column. Choosing a non-monotonic column (like a string ID that does not sort lexicographically with insertion order) breaks bookmark semantics and causes silent data loss or duplication.

Bookmark Scope — Per Job, Not Global

A bookmark is per-job, identified by the job name. Two different jobs reading the same S3 source maintain independent bookmarks. Renaming a job resets its bookmark — the new name has no history. Cloning a job does not copy the bookmark. The exam plants this as "we deleted and recreated the job and now it reprocessed everything" — answer is the bookmark was tied to the old job name and the new job has none.

Glue job bookmarks track processed data per-job, not per-source — renaming or recreating a job loses the bookmark history and triggers a full reprocess of the source. The bookmark is identified by the combination of job name plus run context. For S3 sources, files modified after the last successful run are processed; for JDBC sources, a configured bookmark key column drives incremental selection. Bookmarks must be enabled on the job and the script must call job.commit() at the end for state to persist. Failures before commit do not advance the bookmark — the next run reprocesses the same data, which is the correct retry semantics. Disabling bookmarks reprocesses everything; pausing bookmarks reads with the existing lower bound but does not advance.

Glue Crawlers — Catalog Population And The Partition Gotcha

Crawlers are the metadata-discovery side of Glue.

What A Crawler Does

A Glue crawler scans a data store (S3 prefix, JDBC database, MongoDB), infers the schema, and registers tables in the Glue Data Catalog. It runs on a schedule or on-demand. Once cataloged, the data is queryable by Athena, Redshift Spectrum, EMR, and any consumer that uses the Glue Data Catalog as a metastore.

Crawler Configuration — Classifiers And Exclusion Patterns

Classifiers tell the crawler how to parse files: built-in classifiers handle CSV, JSON, Parquet, ORC, Avro, and XML; custom classifiers handle proprietary formats via grok patterns. Exclusion patterns tell the crawler to skip files matching a pattern — for example, **/_temp/** to skip staging directories or **.bak to skip backup files. Misconfigured exclusion patterns are a common cause of incomplete tables.

The Multi-Table Partition Gotcha

The most cited Glue exam trap: a crawler scanning an S3 prefix with multiple subfolders may create one table per subfolder instead of one partitioned table for the whole prefix. This happens when the schemas of subfolders differ slightly (different column count, different types, different file formats) — the crawler treats them as distinct tables. The fix is the crawler grouping behavior option — set "Create a single schema for each S3 path" to force the crawler to merge schemas across subfolders into one partitioned table. The community-confirmed pain point: data engineers see fifty Athena tables where they expected one, debug for hours, and only later discover this checkbox.

Incremental Crawls

By default, a crawler scans the entire data store on every run. Incremental crawls scan only new partitions added since the last run, dramatically reducing crawler cost on large data lakes. Enable via the "Crawl new sub-folders only" option. This requires that new data lands in new partition prefixes following the existing layout — adding files to existing partitions is invisible to incremental crawls.

A Glue crawler scanning an S3 prefix with subfolders of differing schemas creates one table per subfolder by default, not one partitioned table — the result is dozens of duplicate Athena tables for what should be a single dataset. The fix is to set the crawler's "Create a single schema for each S3 path" option (also called "grouping behavior") which forces schema merging across subfolders. Without it, slight schema variations like an extra column added in newer partitions, a CSV mixed with a Parquet, or different compression codecs cause the crawler to split the data into separate tables. Data engineers debug this for hours before discovering the checkbox. On the DEA-C01 exam, scenarios describing "crawler creates many tables for what should be one partitioned table" are answered by enabling grouping or by adjusting the crawler target to a single layout-consistent prefix.

DPU Allocation And Worker Types

Glue Spark jobs run on configurable worker fleets.

What A DPU Is

A DPU (Data Processing Unit) is Glue's billing unit. One DPU equals 4 vCPUs and 16 GB of memory. You pay per-DPU-hour with per-second billing and a 1-minute minimum. DPU count multiplied by job duration is the bill.

Worker Types — Standard, G.1X, G.2X, G.4X, G.8X

Worker types define the resource shape per worker. Standard (legacy) is 4 vCPU + 16 GB RAM = 1 DPU, 50 GB local disk. G.1X (recommended for most workloads) is 4 vCPU + 16 GB RAM = 1 DPU, 64 GB local disk, with better network throughput than Standard. G.2X doubles to 8 vCPU + 32 GB RAM = 2 DPU per worker, 128 GB local disk — for memory-heavy jobs like large aggregations and joins. G.4X and G.8X scale further for very large datasets or memory-pressured workloads (X-large fact-table joins, large window functions).

Number Of Workers vs Auto-Scaling

You set a number of workers manually or enable auto-scaling, which lets Glue add and remove workers based on Spark stage demand. Auto-scaling reduces cost on jobs with uneven parallelism profiles (a small last stage on a small cluster is cheaper than a small last stage on the originally-provisioned big cluster). Auto-scaling requires Glue 3.0+ and is enabled per job.

Picking The Right Worker Type

The recommended starting point is G.1X with auto-scaling. Move to G.2X when you see executor OOM errors in CloudWatch Logs or when shuffle stages spill to disk excessively. Move to G.4X+ for very large fact-table joins or memory-pressured aggregations. The exam plants this as "job is OOMing on a 200 GB shuffle" — answer is upgrade to G.2X or G.4X.

The Small File Problem And groupFiles

The S3 small-file problem is one of the highest-yield Glue topics.

Why Small Files Hurt Spark

Each file in S3 is opened, read, and closed by Spark as a task. With 100,000 small files, Spark schedules 100,000 tasks, most of which do almost no work but pay full per-task overhead. Job duration is dominated by scheduler overhead, not data processing. Costs balloon, and the job times out or runs hours longer than it should.

The groupFiles Option

Glue's groupFiles connection option tells the reader to group multiple small files into one task. Set groupFiles to inPartition and Glue groups files within each partition. Combined with groupSize (target bytes per group, default 1 MB to 1 GB depending on file count), this collapses thousands of tiny tasks into hundreds of right-sized tasks. Use this on any S3 source with files under 100 MB.

Compaction As The Permanent Fix

groupFiles is a read-time mitigation. The permanent fix is to compact small files into larger ones through a periodic Glue job that reads, repartitions, and writes back. Iceberg-managed tables in Glue 4.0+ have built-in compaction. For non-Iceberg layouts, write a scheduled Glue job that does coalesce(N).write.parquet() to consolidate. Target file size is typically 128 MB to 1 GB depending on downstream query patterns.

Set groupFiles=inPartition and groupSize=134217728 (128 MB) on any Glue source reading from S3 with thousands of small files — without it, Spark schedules one task per file and per-task overhead dominates job time. The community-confirmed pain point: a Glue job processing 50 GB of data in 200,000 small files takes hours longer than the same data in 200 large files. With groupFiles, Spark batches files into reasonable-sized read tasks before processing, cutting scheduling overhead by orders of magnitude. This is the first thing to check on slow Glue jobs reading from S3 — open CloudWatch Logs, look at the number of input files, and if it is in the thousands per partition, enable groupFiles immediately.

Glue Workflows And Triggers — Multi-Step Orchestration

Glue Workflows orchestrate multi-step ETL inside Glue itself.

Workflow Structure

A Glue Workflow is a directed acyclic graph of triggers and jobs (and crawlers). Triggers fire on schedules, on completion of upstream jobs, or on demand. Each trigger fires one or more downstream jobs or crawlers. The visual editor in the Glue console builds these graphs, or you author via CloudFormation or the SDK.

Trigger Types — Schedule, Conditional, On-Demand

Schedule triggers fire on a cron expression. Conditional triggers fire when a watched job or crawler reaches a target state (succeeded, failed, stopped). On-demand triggers fire only when invoked via the SDK or console. The exam plants "ETL must run after the daily crawler" as a conditional trigger answer.

Workflow Run Properties

Run properties are workflow-level key-value state shared across jobs in one workflow run. Use them to pass small data between steps — for example, the timestamp of the run, a partition value to process, or a status flag from an early validation step. Run properties are not a substitute for proper data passing through S3 — they are sized in kilobytes, not gigabytes.

Glue Workflows vs Step Functions vs MWAA

Glue Workflows are the right answer when the entire pipeline is Glue jobs and crawlers. Step Functions are right when the pipeline spans Glue, Lambda, EMR, Redshift, and external services. MWAA (Managed Airflow) is right for complex, long-running, Python-heavy DAGs with tight observability requirements. The exam plants this as a service-selection question; default to Glue Workflows for Glue-only pipelines.

Glue Connections And VPC Configuration

Connections describe how Glue reaches data stores.

Connection Types

Glue supports JDBC, Kafka, MongoDB, Network, and Marketplace connectors. JDBC connections store the database URL, credentials (referenced from Secrets Manager), and the JDBC driver class. Network connections describe a VPC subnet and security group for Glue jobs to use when they need to access private resources.

VPC Configuration

To access a resource in a private VPC (an RDS database, a private API), the Glue job runs in your VPC via a Network connection. This requires a NAT gateway or VPC endpoints for S3 and any other AWS services the job uses, plus security groups allowing Glue ENIs to reach the target. Misconfigured VPC connections are the most common cause of Glue jobs hanging at startup with "no available IP addresses" or "cannot reach destination" errors.

Cross-Region Access

Glue jobs in one region cannot directly access data in another region without crossing the public internet or VPC peering. The exam plants this as "Glue job in us-east-1 needs to read from RDS in us-west-2" — answer involves VPC peering and a Network connection, not a direct cross-region read.

Common Exam Traps For AWS Glue ETL

The DEA-C01 exam plants a consistent set of traps around Glue. Memorize all five.

Trap 1 — Crawlers Transform Data

A scenario claims "we use a crawler to clean and transform CSV data." Wrong. Crawlers only discover schema and register catalog entries — they never modify data. The transformation happens in a Glue ETL job, not in a crawler.

Trap 2 — Bookmarks Are Global Across Jobs

A candidate assumes one bookmark covers all jobs reading the same source. Wrong. Each job has its own bookmark keyed on the job name. Renaming a job loses its bookmark.

Trap 3 — DynamicFrames Are Always Faster

A candidate picks DynamicFrames for every workload because Glue documentation features them. Wrong. DynamicFrames are slower than DataFrames for known-schema operations. Use DynamicFrames for messy ingestion, convert to DataFrames for heavy SQL transformations.

Trap 4 — G.1X Is Always Enough

A candidate sizes every job at G.1X to save cost. Wrong for memory-heavy workloads — large joins and window aggregations on G.1X cause executor OOM. Size G.2X or G.4X for memory-pressured jobs and accept the higher per-DPU rate.

Trap 5 — Crawlers Always Create One Partitioned Table

A candidate assumes pointing a crawler at s3://bucket/data/ always produces one partitioned table for the dataset. Wrong if subfolders have schema variations — the crawler creates one table per subfolder unless grouping behavior is enabled.

Trap 6 — groupFiles Is The Default

A candidate assumes Glue automatically batches small files. Wrong. groupFiles=inPartition must be set explicitly. Default behavior is one task per file, which dominates job time on small-file datasets.

Trap 7 — Lambda Can Replace Glue

A scenario suggests Lambda for batch ETL. Wrong for any workload over a few minutes — Lambda's 15-minute timeout, 10 GB memory cap, and 6 MB payload limit make it unsuitable for batch ETL. Glue (or Step Functions invoking Glue) is the right answer for batch.

Glue ETL = serverless Spark with DynamicFrames for messy data, DataFrames for performance, bookmarks for incremental processing, crawlers for schema discovery (not transformation), G.1X for routine workloads and G.2X+ for memory-heavy, and groupFiles for small-file optimization. This sentence covers 80 percent of DEA-C01 Glue questions. If the scenario word is "schema flexibility" or "messy JSON," answer DynamicFrames. If the word is "large aggregation" or "fast SQL," answer DataFrames after toDF conversion. If the word is "incremental" or "process only new data," answer bookmarks. If the word is "schema discovery" or "register table," answer crawlers. If the word is "OOM" or "memory pressure," answer G.2X or G.4X. If the word is "thousands of small files," answer groupFiles. If the word is "multi-step pipeline of Glue only," answer Workflows.

Key Numbers And Must-Memorize Glue Facts

Job Types

  • Spark — distributed PySpark or Scala, 2-100+ DPU
  • Python shell — single Python process, max 1 DPU
  • Ray — distributed Python via Ray framework

Worker Types

  • Standard — 4 vCPU, 16 GB RAM, 1 DPU, 50 GB disk (legacy)
  • G.1X — 4 vCPU, 16 GB RAM, 1 DPU, 64 GB disk (recommended default)
  • G.2X — 8 vCPU, 32 GB RAM, 2 DPU, 128 GB disk (memory-heavy)
  • G.4X / G.8X — larger memory-optimized variants

Job Bookmarks

  • Modes: enabled, disabled, pause
  • Per-job state, identified by job name
  • S3 sources use file modification time
  • JDBC sources use a configured key column
  • job.commit() must be called for state to persist

Crawlers

  • Discover schema, register catalog entries (never transform data)
  • Grouping behavior option merges subfolder schemas into one partitioned table
  • Incremental crawl scans new partitions only
  • Exclusion patterns skip matching files

DynamicFrames

  • Schema-flexible alternative to DataFrames
  • Choice type carries multiple type options per column
  • resolveChoice flattens type ambiguity
  • Convert to DataFrame via toDF() for performance

Small File Optimization

  • groupFiles=inPartition batches small files into right-sized read tasks
  • groupSize target bytes per group (typically 128 MB)
  • Permanent fix is periodic compaction jobs

Glue Workflows

  • DAG of triggers and jobs/crawlers
  • Trigger types: schedule, conditional, on-demand
  • Run properties pass small state between steps
  • Use Step Functions for non-Glue-only pipelines

DEA-C01 exam priority — AWS Glue ETL — Job Bookmarks, DynamicFrames, and PySpark. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — AWS Glue ETL Top Questions

Q1 — When should I use Glue ETL vs EMR vs Lambda for data processing?

Use Glue ETL when you need managed Spark for batch ETL with minimal operational overhead — schema discovery via crawlers, catalog integration with Athena and Redshift Spectrum, bookmarks for incremental processing, and per-DPU-second billing. Use EMR when you need custom Spark configurations, non-Spark frameworks (Hive, HBase, Presto, Flink), long-running clusters that amortize startup time, or fine-grained control over the runtime environment. Use Lambda only for short event-driven processing under 15 minutes, payloads under 6 MB, and memory under 10 GB — typical patterns are object-creation triggers from S3 doing lightweight transforms or routing. Lambda is wrong for any batch ETL over a few minutes; Glue is wrong for ad-hoc Spark with custom Hadoop ecosystem dependencies; EMR is wrong for short-lived workloads where cluster startup time dominates.

Q2 — How do I prevent a Glue crawler from creating multiple tables for the same dataset?

Enable the crawler's "Create a single schema for each S3 path" option (also called grouping behavior) which forces schema merging across subfolders into one partitioned table. The default behavior treats each subfolder with a slightly different schema as a separate table — extra columns, different file formats, or different compression codecs all trigger the split. Other prevention strategies include making the dataset's schema strictly consistent across all partitions (no extra columns added in newer partitions), using a single file format and compression codec across the prefix, and pointing the crawler at the partitioned root rather than at multiple subfolders. The exam plants "crawler creates 50 tables instead of 1" with grouping behavior as the canonical fix.

Q3 — Should I use DynamicFrames or DataFrames in my Glue job?

Use DynamicFrames for ingestion of messy, semi-structured, or schema-inconsistent data — JSON from APIs with optional fields, CSV with mixed types, or any source where types vary across rows. The Choice type and resolveChoice operations are the value-add. Use DataFrames for performance-critical operations on known-schema data — large joins, aggregations, window functions. The standard pattern is read source as DynamicFrame, run schema-flexibility operations like resolveChoice and relationalize, then toDF() to convert and run the heavy transformations as DataFrames. Convert back via fromDF() if you need to write through Glue's catalog-aware sink. Do not pick DynamicFrames for everything just because they are Glue-native — they are slower than DataFrames for known-schema work.

Q4 — How do bookmarks behave when a job fails or is rerun?

Bookmarks advance only when the job script calls job.commit() at the end of a successful run. A failure before commit leaves the bookmark unchanged — the next run reprocesses the same data, which is the correct retry semantics. A failure after commit but before the orchestrator marks the run successful still advances the bookmark — the next run does not reprocess. Disabling bookmarks reprocesses everything regardless of state. Pausing bookmarks reads with the existing lower bound but does not advance — useful for testing reruns without losing the production bookmark. Renaming a job loses the bookmark; the new name has no history. Cloning does not copy bookmark state. The exam plants "after recreating the job, the next run reprocessed all data" with the answer being the bookmark was tied to the old job name.

Q5 — How do I right-size DPU and worker type for a Glue job?

Start with G.1X (4 vCPU, 16 GB RAM per worker) and 10 workers as a default for moderate ETL. Run the job, watch CloudWatch metrics for executor OOM errors, shuffle spill, and DPU utilization. If executors OOM or shuffle spills heavily to disk, upgrade to G.2X (8 vCPU, 32 GB RAM) and reduce worker count to keep total DPU similar — fewer, fatter workers handle memory-heavy work better than many small ones. For very large fact-table joins or window aggregations, go to G.4X or G.8X. Enable auto-scaling (Glue 3.0+) to let Glue add and remove workers based on stage demand — this saves cost on jobs with uneven parallelism profiles. Monitor glue.driver.aggregate.shuffleBytesWritten and glue.driver.aggregate.shuffleLocalBytesRead in CloudWatch to detect shuffle pressure that calls for fatter workers.

Q6 — How do I solve the small-file problem on Glue inputs?

Two layers of fix. Read-time mitigation uses Glue's groupFiles=inPartition connection option with groupSize set to a target like 128 MB — this batches small files into reasonable-sized read tasks at the Spark scheduler level, avoiding per-file task overhead. Apply this to any S3 source with files under 100 MB or with thousands of files per partition. Permanent fix is periodic compaction: a scheduled Glue job that reads the small-file dataset, repartitions to a sensible file count, and writes back as larger Parquet files (target 128 MB to 1 GB each). Iceberg-managed tables in Glue 4.0+ have built-in compaction that handles this automatically. Without compaction, the small-file count grows monotonically and queries get slower over time — a critical operational hygiene task on long-running data lakes.

Q7 — What is the difference between Glue Workflows, Step Functions, and MWAA for orchestration?

Glue Workflows are the right answer when the entire pipeline is Glue jobs and crawlers — visual DAG editor in the Glue console, native trigger types (schedule, conditional, on-demand), workflow-level run properties for shared state, and tight integration with Glue job and crawler lifecycle. Step Functions are right when the pipeline spans Glue plus Lambda, EMR, Redshift, SageMaker, and external services — full state machine model with Map for fan-out, Choice for branching, error handling with Retry and Catch, and per-state-transition billing. MWAA (Managed Airflow) is right for complex, long-running, Python-heavy DAGs with tight observability requirements, when teams already know Airflow, or when the DAG portability across cloud or on-premises matters. The cost trade-off: Glue Workflows and Step Functions are usage-priced (per execution), MWAA has a continuous environment cost regardless of DAG count. Default to Glue Workflows for Glue-only pipelines; Step Functions for cross-service AWS pipelines; MWAA for Airflow-team pipelines.

Further Reading — Official AWS Documentation For AWS Glue

The authoritative AWS sources are: the AWS Glue Developer Guide overview page, the DynamicFrames documentation covering the Choice type and resolveChoice, the job bookmarks documentation covering modes and commit semantics, the crawlers documentation covering grouping behavior and exclusion patterns, the workers and DPU sizing documentation, the groupFiles and small-file optimization documentation, and the Glue Workflows orchestration documentation.

The AWS Glue Best Practices whitepaper is the single most exam-aligned resource for DEA-C01 — it covers DynamicFrames vs DataFrames trade-offs, bookmark semantics, crawler partition handling, and DPU sizing in one document. The AWS Big Data Blog has multiple deep-dive posts on specific Glue topics including small-file handling, Iceberg integration, and cross-account catalog sharing. The Glue API reference covers every transform method signature and connection option in detail. Finally, the AWS Glue GitHub samples repository contains end-to-end PySpark scripts for common ETL patterns including incremental loads, slowly-changing dimensions, and CDC processing — read these as worked examples for the patterns DEA-C01 tests.

Official sources

More DEA-C01 topics