AWS Glue is the workhorse ETL service that sits between raw data in S3 and ML-ready datasets in Feature Store, training jobs, and inference pipelines. On the MLA-C01 (Machine Learning Engineer Associate) exam, Glue questions are heavily weighted in Domain 1 because the cert positions itself as an engineering exam — you are not the data scientist deciding which model architecture to use, you are the engineer deciding whether the nightly transformation runs on Glue Spark, Glue Python shell, Glue Ray, EMR, or SageMaker Processing. Get the service-selection trade-offs right and you bank a chunk of Domain 1 points before touching a single line of PySpark.
This guide covers the four flavors of Glue (Spark, Python shell, Ray, streaming) and exactly when each is the right answer, the DataBrew low-code visual transformation tool and the trap of confusing it with SageMaker Data Wrangler, the Glue Data Quality rule types you will be asked to identify by name, the Glue Crawler vs Glue ETL distinction the exam tests with surgical precision, and the integration story between Glue output and SageMaker training jobs. By the end you should be able to read an MLA-C01 stem like "preprocess 500 GB of nested JSON nightly with feature derivation and emit Parquet to S3 with provable lineage" and immediately answer Glue Spark with bookmarks, knowing why Python shell and Ray are wrong for that volume.
What AWS Glue Is and Why ML Pipelines Need It
AWS Glue is a serverless data integration service combining a managed Spark/Python execution engine, a metadata catalog (Glue Data Catalog, shared with Athena and Lake Formation), a schema discovery tool (Crawlers), a low-code transformation UI (DataBrew), data quality validation (Glue Data Quality), and orchestration primitives (Workflows, Triggers). For ML pipelines, Glue is the right choice when you need to transform structured or semi-structured data at scale (GBs to TBs nightly), with managed infrastructure, pay-per-use pricing, and tight integration with the rest of the AWS data lake.
Where Glue Fits In The ML Pipeline
Raw data lands in S3 from operational sources (CDC streams, batch dumps, IoT). Glue ETL jobs read this raw data, apply transformations (joins, aggregations, derived features, type conversions), validate quality (Glue Data Quality rules), and write ML-ready Parquet to a curated S3 location. SageMaker training jobs then consume the curated S3 path, or Feature Store ingestion jobs push the Glue output into the Feature Store. Glue Workflows orchestrate the multi-step DAG; EventBridge triggers retraining when new partitions land.
Why MLA-C01 Tests Glue Heavily
The MLA-C01 community pain-point research repeatedly flags Glue Crawler vs ETL confusion, DataBrew vs Data Wrangler conflation, and Glue job-type selection (Spark vs Python shell vs Ray) as common exam mistakes. The Pluralsight blog explicitly calls out "ML lifecycle from ingestion to retraining" as a tested end-to-end pattern, and Glue is in that path. K21 Academy lists "Amazon EMR with Spark for data transformation" as a study area, where the trick is that EMR is the wrong answer in most MLA-C01 stems because Glue covers it more cheaply and with less management.
Plain-Language Explanation: AWS Glue ETL
Glue is one of those AWS services that looks like ten products glued together (the irony is not lost). Three concrete analogies make the structure stick.
Analogy 1 — The Restaurant Prep Kitchen
Picture a high-end restaurant where the dining-room chefs (your SageMaker training jobs) cannot work with raw ingredients straight from the loading dock — meat needs trimming, vegetables need washing and dicing, sauces need reducing, allergen labels need checking. The prep kitchen behind the line is AWS Glue — it takes raw deliveries (S3 raw data), applies standardized prep procedures (PySpark transformations), and stages everything in labelled containers ready for the dinner-service chefs to grab. The prep cook running the food processor for high-volume dicing is a Glue Spark job — handles the heavy lifting in parallel, scales to whatever volume comes in. The junior cook running a single-bench task like making vinaigrette is a Glue Python shell job — small dataset, single-threaded, much cheaper than firing up the food processor. The specialist cook handling exotic Python libraries that don't fit the food processor model is a Glue Ray job — distributed Python without the Spark constraint. The visual-recipe iPad on the wall showing a no-code drag-and-drop transformation flow is Glue DataBrew — for the line cooks who want to chop and clean without writing code. The food-safety inspector sampling every batch for contamination is Glue Data Quality — rules like "no expired ingredients, no missing labels, completeness above 95 percent." The inventory clipboard listing every prep batch's name, quantity, and shelf location is the Glue Data Catalog. The head prep chef's whiteboard sequencing today's prep tasks ("first stock, then sauces, then garnishes, then plating") is Glue Workflows.
Analogy 2 — The Manufacturing Assembly Line With Quality Stations
Picture a car factory taking raw materials at one end and shipping finished cars at the other. Glue is the subassembly building between the foundry (raw S3 data) and the final assembly line (SageMaker training). The CNC machining stations transforming raw metal into precise parts are Glue Spark jobs — heavy parallelized work on large volumes. The manual workbench where one technician does small precision tasks is a Glue Python shell job. The specialty tooling area where a Ray-based parallel framework runs distributed Python ML libraries is a Glue Ray job. The drag-and-drop CAD tool that lets a junior technician design a transformation visually without programming is Glue DataBrew. The quality-control inspector with a checklist of go/no-go criteria (completeness, uniqueness, freshness, range checks) is Glue Data Quality — every batch that fails the checklist is rejected and the production manager (CloudWatch alarm) is notified. The part-number clipboard the QC inspector consults to know what every part should look like is the Glue Data Catalog. The production scheduler chaining stations into a daily build sequence is Glue Workflows.
Analogy 3 — The Newspaper Editorial Pipeline
Picture a newspaper taking raw wire-service feeds (raw S3 data) at one end and producing the printed edition (ML-ready dataset) at the other. The copy desk that takes raw wire stories and edits them for length, style, and accuracy is Glue ETL — the central transformation function. The deputy editor running a heavy parallel desk for breaking-news bulk content is a Glue Spark job. The lone editor at a desk handling a single small story is a Glue Python shell job. The specialty data-journalism team running parallel Python scripts for graphics generation is a Glue Ray job. The visual page-layout tool that an art director uses to compose a story without writing markup is Glue DataBrew. The fact-checking team with a checklist (every name spelled, every number sourced, every quote attributed) is Glue Data Quality. The morgue and reference index where every published story is cataloged for future search is the Glue Data Catalog. The production manager's wallchart sequencing today's editorial cycle is Glue Workflows. When the deputy editor walks the wallchart in the morning, they know exactly which desks must finish before press time — that is the dependency graph the exam expects you to build with Glue Workflows.
Glue ETL Job Types — Spark, Python Shell, Ray
Glue offers three execution engines for ETL jobs. The MLA-C01 exam tests which one matches which workload.
Glue Spark Jobs (The Default)
Glue Spark jobs run Apache Spark on managed clusters. Two engine choices: standard Spark (Scala or Python via PySpark) and Glue ETL DynamicFrames (an AWS extension over Spark DataFrames with schema flexibility). Worker types: G.1X (4 vCPU, 16 GB), G.2X (8 vCPU, 32 GB), G.4X, G.8X, G.025X (lightweight). Auto Scaling on Glue 3.0+. Use for GB-to-TB-scale transformations, joins across large tables, partitioned writes, distributed feature engineering.
Glue Python Shell Jobs
Glue Python shell jobs run a single Python script on a single worker (1 DPU = 4 vCPU + 16 GB, or 0.0625 DPU lightweight = 1 vCPU + 1 GB). No Spark, no distributed processing. Use for small data (typically under 10 GB), scripting tasks that pull data from APIs, lightweight cleanup that doesn't justify Spark overhead, calling other AWS services as part of a workflow. Significantly cheaper than Spark for small workloads.
Glue Ray Jobs
Glue Ray jobs run on Ray, an open-source distributed Python framework. Use for distributed Python ML libraries that don't fit Spark (Hugging Face transformers, custom PyTorch preprocessing, parallel inference at scale). Newer offering, narrower use case than Spark. Worker types Z.2X (8 vCPU, 64 GB) and similar.
Choosing The Right Job Type
- Data size > 10 GB or needs joins/aggregations across large tables → Spark.
- Data size < 10 GB, single-script logic, API integration, lightweight → Python shell.
- Distributed Python ML libraries (Hugging Face, etc.) → Ray.
- Real-time streaming ETL → Spark Streaming on Glue (a Spark variant).
Glue Python shell jobs are NOT for big data — they run on a single worker. The MLA-C01 trap is choosing Python shell for a 100 GB dataset because it sounds simpler. Python shell maxes at 1 DPU (16 GB memory), which fits roughly 5-10 GB of in-memory data. For anything bigger, Spark is mandatory. The signal "data fits in memory" or "small lookup table" or "API-driven script" is Python shell. The signal "TB-scale" or "joining two large tables" or "distributed feature derivation" is Spark. Picking wrong will get the job killed by OOM and earn a wrong answer on the exam.
Glue DataBrew — Low-Code Visual Transformations
DataBrew is the no-code/low-code transformation tool. Engineers are tempted to dismiss it; the exam tests it because business analysts and ML engineers use it for rapid prototyping.
What DataBrew Does
DataBrew provides a spreadsheet-like UI to build transformation recipes — sequences of named functions (categorize, filter, join, encode) applied to a data sample. A recipe is then applied to the full dataset via a DataBrew job, producing transformed output in S3.
250+ Built-In Transformations
DataBrew ships with 250+ pre-built transforms covering: type casting, string manipulation, date parsing, statistical aggregation, one-hot encoding, normalization, outlier handling, missing-value imputation, group-by aggregation, joins, pivots. No code required.
When DataBrew Fits
- Rapid prototyping — a data analyst can compose a 20-step transformation in 30 minutes.
- Repeatable batch transformations — recipes run on a schedule against new data.
- Data quality checking — DataBrew has its own data profile that flags missing values, outliers, distribution shifts.
- Self-service for non-engineers — business analysts produce ML-ready datasets without involving the data engineering team.
DataBrew vs Data Wrangler — The Exam Trap
This is the most heavily-tested confusion point in MLA-C01 Domain 1. Both have visual interfaces and feel similar — but they live in different services and integrate at different points.
- Glue DataBrew — part of AWS Glue, integrates with Glue Data Catalog and Lake Formation, runs as Glue jobs, output to S3, used by data engineers and business analysts for general data prep before ML.
- SageMaker Data Wrangler — part of SageMaker, integrates with SageMaker Studio, Feature Store, and Pipelines, runs in SageMaker Processing, used by data scientists for ML-specific feature engineering inside the SageMaker workflow.
Glue DataBrew and SageMaker Data Wrangler look almost identical visually but solve different problems at different stages. DataBrew is general-purpose data prep with a Glue/Lake Formation-centric integration story — output goes to S3, the Glue Catalog gets updated, downstream consumers can be Athena, Redshift, OR SageMaker. Data Wrangler is ML-specific feature engineering with a SageMaker-centric integration story — output goes directly to Feature Store or a SageMaker Pipeline step. The MLA-C01 stem signal "ML engineer working inside SageMaker Studio" or "export to Feature Store" or "part of a SageMaker Pipeline" maps to Data Wrangler. The signal "data engineer producing curated S3 datasets" or "Glue Catalog table for Athena" or "shared across non-ML consumers" maps to DataBrew.
Glue Data Quality — Built-In Rule Types
Glue Data Quality validates datasets against declared rules and fails the pipeline on violations. MLA-C01 expects you to recognize the rule taxonomy by name.
The Six Rule Categories
- Completeness — column has no null values, or null rate is below a threshold (e.g.
Completeness "user_id" > 0.99). - Uniqueness — column values are unique, or uniqueness rate is above a threshold (e.g.
Uniqueness "transaction_id" > 0.999). - Freshness — newest record is within a time window (e.g.
Freshness "event_time" < 1 day). - Conformance — values match a regex or one of a set (e.g.
ColumnValues "country" in ["US", "UK", "JP"]). - Accuracy — column values match expected statistics (e.g.
Mean "amount" between 100 and 200). - Drift — distribution comparison against a baseline (advanced, requires a baseline ruleset).
Data Quality Definition Language (DQDL)
Rules are written in DQDL — a declarative language. Example:
Rules = [
Completeness "user_id" > 0.99,
Uniqueness "transaction_id" = 1.0,
ColumnValues "country" in ["US", "UK", "JP", "AU"],
IsComplete "amount",
Mean "amount" between 50.0 and 500.0
]
Integration Points
Data Quality rules can run as a step in a Glue ETL job, as a node in a Glue Workflow, or against a table in the Data Catalog. Failed rules emit CloudWatch metrics and can fire EventBridge events to trigger remediation workflows or block downstream training.
Why MLA-C01 Cares
Garbage data produces garbage models. The exam expects you to know that Glue Data Quality is the AWS-native rule-based validation, distinct from SageMaker Data Wrangler's profile reports (which are visual and one-shot, not ongoing rules) and distinct from SageMaker Model Monitor (which monitors deployed model inputs, not training-data prep).
Glue Crawlers and Data Catalog — Schema Discovery
Crawlers are the most-confused Glue feature. MLA-C01 will test the crawler-vs-ETL distinction directly.
What Crawlers Do
A Glue Crawler scans an S3 location (or a JDBC source), infers the schema from file format and contents, and creates or updates a Glue Data Catalog table. The crawler does not transform data, does not copy data, does not validate data quality. It only writes metadata.
What Crawlers Do NOT Do
The single most common MLA-C01 trap: assuming a Crawler transforms or moves data. Wrong. A Crawler reads file headers and sample rows, builds a schema definition, registers it in the Data Catalog, and exits. Files in S3 remain untouched.
Crawler Schedules and Triggers
Crawlers run on a schedule (cron) or on demand. Best practice for ML pipelines: schedule Crawlers nightly to detect new partitions and schema changes; alternatively use Glue ETL enableUpdateCatalog to update partitions directly without rerunning the Crawler.
Schema Evolution Handling
When a Crawler detects a schema change (new column, type change), it updates the Data Catalog table. Existing queries against the old schema may break or return null for missing columns. Production ML pipelines should pin schema explicitly via Lake Formation or via DDL-managed tables rather than letting Crawlers update production schemas automatically.
A Glue Crawler discovers schema and updates the Data Catalog. It does NOT transform, copy, or validate data. This is the single most-tested distinction in MLA-C01 Domain 1. The wrong answer pattern: a stem describes "I need to convert CSV to Parquet for SageMaker" and includes "run a Glue Crawler on the source bucket" as a candidate answer. Wrong — a Crawler does not convert formats. The right answer is "run a Glue ETL job that reads CSV and writes Parquet, then run a Crawler (or use enableUpdateCatalog) to register the Parquet output in the Data Catalog." Crawler = metadata. ETL job = data movement and transformation.
PySpark on Glue — Feature Derivation at Scale
Most production Glue ETL for ML is written in PySpark via Glue's notebook or local development workflow. The exam doesn't test PySpark syntax, but it does test conceptual feature-engineering patterns.
Common Feature Derivation Patterns
- Aggregation features —
groupBy("user_id").agg(F.count("*"), F.sum("amount"))produces user-level features from transaction-level rows. - Window features —
F.row_number().over(Window.partitionBy("user_id").orderBy("timestamp"))creates sequence numbers; rolling windows compute "transactions in last 5 minutes." - Time-derived features —
F.hour("event_time"),F.dayofweek("event_time")produce categorical time features. - Joins to lookup tables — broadcast small tables, sort-merge for large tables.
- Encoding — Spark MLlib's
StringIndexer,OneHotEncoderfor categorical variables.
DynamicFrame vs DataFrame
Glue's DynamicFrame is an AWS extension that handles schema flexibility better than Spark DataFrames — useful when source data has inconsistent types per row. Convert to a DataFrame for Spark MLlib operations, convert back to DynamicFrame to leverage Glue's Catalog integration.
Pushdown Predicates
For partitioned tables, push partition filters into the read so Glue only loads relevant partitions. glueContext.create_dynamic_frame.from_catalog(database="db", table_name="t", push_down_predicate="year=2026 AND month=05") reads only May 2026 partitions, not the full table. Critical for large-table cost control.
Glue Job Bookmarks — Incremental Processing
Job bookmarks track which data has already been processed, so re-runs only handle new data. This is the standard pattern for incremental retraining.
How Bookmarks Work
When enabled, a Glue ETL job persists state about which input data was read in each run (S3 keys for file sources, primary keys for JDBC sources). On the next run, the job skips already-processed data. State is stored per job per source.
Bookmark Modes
- Enable — default; tracks state and processes only new data.
- Disable — every run reprocesses everything (use for one-shot transformations or backfills).
- Pause — suspends bookmark advancement (useful for debugging without losing state).
Bookmark Limitations
Bookmarks work cleanly for append-only sources (new files in S3, new rows in JDBC tables). They do not track updates or deletes — for CDC-style sources, use AWS DMS into S3 followed by a Glue ETL job that handles upserts via merge logic.
When To Use Bookmarks For ML
Nightly ML retraining typically reads only the previous day's data, joins to slowly-changing dimension tables, and produces a new training partition. Bookmarks make this idempotent and cheap.
Integration With SageMaker
The downstream of Glue ETL is almost always SageMaker. The integration patterns are not glamorous but they are exam-relevant.
S3 Path Hand-Off
The simplest pattern: Glue ETL writes Parquet to a curated S3 prefix; SageMaker training reads from that prefix. The integration point is the S3 path string, passed as a SageMaker training job input.
Glue Data Catalog as SageMaker Table
SageMaker Data Wrangler can read from the Glue Catalog directly, treating tables as sources. This avoids hard-coding S3 paths and inherits Lake Formation permissions.
Glue ETL As A SageMaker Pipeline Step
SageMaker Pipelines supports a ProcessingStep that wraps any compute job, including Glue ETL invoked via Lambda or directly. Best practice for production ML pipelines: orchestrate Glue ETL → Feature Store ingestion → SageMaker training → Model Registry through a single Pipeline.
Feature Store Ingestion From Glue
Glue ETL output can feed Feature Store via a SageMaker Processing job that calls PutRecord for online or appends to the offline S3 location. Direct Glue-to-Feature-Store ingestion is via PySpark code calling the Feature Store APIs.
Amazon EMR vs Glue — When To Use Each
EMR is the more powerful, more flexible Hadoop/Spark platform; Glue is the serverless, simpler ETL service. The MLA-C01 exam tests when EMR is the right answer.
EMR Strengths
- Custom Spark/Hadoop versions — EMR supports specific open-source versions Glue doesn't.
- Custom bootstrap actions — install arbitrary libraries at cluster bootstrap.
- Long-running clusters — EMR clusters can stay up for days, Glue jobs are per-run.
- Mixed workloads — Hive, Presto, HBase, Flink alongside Spark on the same cluster.
- Cost control at very large scale — EMR with Spot and Reserved Instances can be cheaper than Glue at 24/7 high-utilization scale.
Glue Strengths
- Serverless — no cluster management, scales to zero between runs.
- Faster startup — minutes vs EMR's longer provisioning.
- Built-in catalog integration — Glue Data Catalog is the default; EMR can use it but doesn't ship integrated.
- Per-job pricing — pay only for execution time.
Decision Heuristic
For most ML pipelines with intermittent ETL jobs (nightly, hourly), Glue is the right answer. For 24/7 streaming workloads, very large persistent Spark clusters, or workloads needing custom Spark versions, EMR is the right answer. The MLA-C01 stem signal "managed serverless Spark for nightly ETL" is Glue. The signal "long-running cluster with custom packages" is EMR.
For most MLA-C01 ETL stems, Glue is the right answer over EMR — Glue is the AWS-preferred default for serverless ETL. Choose EMR only when the stem includes specific signals: "custom Spark version not supported by Glue," "long-running 24/7 cluster," "mixed Hadoop ecosystem (Hive + Presto + HBase)," or "very high sustained utilization where Spot Instances make EMR cheaper at scale." Without those signals, defaulting to EMR over Glue costs you simplicity, integration, and exam points.
Glue Workflows and Triggers — Multi-Step Orchestration
Glue Workflows orchestrate multiple Glue jobs and crawlers as a DAG.
Workflow Components
- Triggers — start jobs based on schedule, on-demand, or on completion of another job/crawler.
- Jobs — Glue ETL jobs as workflow nodes.
- Crawlers — Glue Crawlers as workflow nodes.
- Conditional triggers — fire when upstream nodes succeed (or fail).
Workflow vs SageMaker Pipelines vs Step Functions
Three orchestration options exist. Glue Workflows are scoped to Glue resources only — pure ETL DAGs. SageMaker Pipelines orchestrate ML-specific steps including training, tuning, registry. Step Functions are general-purpose, can orchestrate Glue, SageMaker, Lambda, and any AWS service. For pure-ETL DAGs, Glue Workflows are simplest. For ML-end-to-end (ETL → train → deploy), SageMaker Pipelines is preferred.
EventBridge As The Cross-Service Trigger
EventBridge schedules trigger Glue Workflows; EventBridge events from Glue (job state change) trigger downstream actions like SageMaker Pipeline executions. The standard pattern: nightly EventBridge schedule → Glue Workflow (ETL + Crawler) → on success, EventBridge event → SageMaker Pipeline.
Cost Model And Optimization For ML Workloads
Glue cost is driven by DPU-hours. Knowing the levers prevents budget surprises.
DPU Pricing
- Glue 4.0+ Spark Standard worker (G.1X): $0.44 per DPU-hour.
- G.2X, G.4X, G.8X scale linearly per DPU.
- Python shell (1 DPU): $0.44/hour or 0.0625 DPU lightweight at $0.0275/hour.
- Glue Ray: priced per Ray worker capacity.
Auto Scaling
Glue 3.0+ supports Auto Scaling — workers scale up to handle load, then scale down. Reduces over-provisioning waste when workload is bursty.
Cost Optimization Levers
- Push down predicates to read only relevant partitions.
- Use bookmarks to process only new data.
- Right-size workers — start with G.1X, scale to G.2X if you hit memory pressure.
- Use Python shell for small jobs — at 0.0625 DPU lightweight, a tiny job costs cents.
- Use the Glue Spark UI to identify slow stages and skewed joins.
- Avoid
count()andshow()in production code — these trigger full passes.
Glue job bookmarks plus pushdown predicates are the two cost-optimization levers that pay for themselves on every retraining run. Bookmarks ensure the job only processes new partitions; pushdown predicates ensure only relevant partitions are scanned. Without them, a nightly retraining job rereads the full historical dataset every night — a 1 TB historical dataset processed nightly burns 30 TB/month of read I/O instead of 30 GB/month. The MLA-C01 stem signal "incremental nightly ML retraining" or "minimize cost of reprocessing" maps to bookmarks + predicate pushdown.
Common Exam Traps For Glue ETL
Trap 1 — Glue Crawler Transforms Data
Wrong. Crawler discovers schema and writes metadata. ETL jobs transform data. The two are separate.
Trap 2 — Python Shell Job For Big Data
Wrong. Single worker, max ~10 GB in-memory. Use Spark for big data.
Trap 3 — DataBrew And Data Wrangler Are Interchangeable
Wrong. Different services, different integration points. DataBrew = Glue/Lake Formation-centric general data prep. Data Wrangler = SageMaker-centric ML feature engineering.
Trap 4 — EMR Is Always Cheaper Than Glue
Wrong. Glue is cheaper for intermittent workloads. EMR with Spot becomes cheaper only at 24/7 high utilization.
Trap 5 — Bookmarks Track Updates And Deletes
Wrong. Bookmarks track append-only progress. For CDC, use DMS to S3 and a Glue merge job.
Trap 6 — Glue Data Quality Is The Same As Model Monitor
Wrong. Glue Data Quality validates training data before training. Model Monitor watches inference data after deployment. Different stages.
Trap 7 — Glue Workflows Can Orchestrate SageMaker
Wrong. Glue Workflows are scoped to Glue jobs/crawlers only. Use SageMaker Pipelines or Step Functions for ML-end-to-end orchestration.
Decision Matrix — Glue Service Selection
| Use Case | Right Service | Wrong Answer Trap |
|---|---|---|
| Nightly TB-scale ETL with joins | Glue Spark + bookmarks | Glue Python shell |
| Small API-driven script (< 1 GB) | Glue Python shell | Glue Spark (overkill) |
| Distributed Hugging Face inference | Glue Ray | Glue Spark |
| Visual transformation by analyst | Glue DataBrew | Glue ETL notebook |
| ML feature engineering in SageMaker Studio | SageMaker Data Wrangler | Glue DataBrew |
| Discover schema of new S3 data | Glue Crawler | Glue ETL |
| Convert CSV to Parquet | Glue ETL job | Glue Crawler |
| 24/7 long-running Spark cluster | Amazon EMR | Glue |
| Validate training data quality | Glue Data Quality | SageMaker Model Monitor |
Key Numbers To Memorize For MLA-C01
- Glue Spark G.1X: 4 vCPU, 16 GB, $0.44/DPU-hour (Glue 4.0)
- Glue Python shell: 1 DPU or 0.0625 DPU lightweight
- Glue job bookmark: tracks append-only progress per source
- Glue Crawler: writes metadata only, does NOT transform data
- DataBrew: 250+ built-in transforms, recipe-based
- Glue Data Quality DQDL: completeness, uniqueness, freshness, conformance, accuracy, drift
- Glue Workflows: orchestrates Glue jobs/crawlers only, not SageMaker
- EMR: long-running cluster, custom Spark version, Hive/Presto/HBase support
FAQ — AWS Glue ETL For ML Top Questions
Q1 — When should I choose Glue Spark over Glue Python shell?
The decision is data volume and parallelism. Glue Python shell runs on a single worker (max 1 DPU = 16 GB memory) and cannot leverage parallel processing. For datasets larger than ~5-10 GB or workloads requiring joins, aggregations, or window functions across partitioned data, Glue Spark is mandatory. For datasets under 5 GB, single-stream API integrations, or scripts that orchestrate other AWS services, Python shell is dramatically cheaper (especially the 0.0625 DPU lightweight option at $0.0275/hour). The MLA-C01 stem signal "TB-scale" or "joining two large tables" points to Spark; "small API integration" or "fits in memory" points to Python shell.
Q2 — What's the difference between Glue DataBrew and SageMaker Data Wrangler?
Both are visual no-code/low-code data transformation tools, and both feel similar — but they live in different services and integrate at different points in the pipeline. Glue DataBrew is part of AWS Glue, integrates with the Glue Data Catalog and Lake Formation, runs as Glue jobs, and outputs to S3. It is used by data engineers and business analysts for general data preparation that may feed ML, BI dashboards, or data warehouses. SageMaker Data Wrangler is part of SageMaker Studio, integrates directly with Feature Store and SageMaker Pipelines, runs in SageMaker Processing, and is used by data scientists and ML engineers for ML-specific feature engineering as part of the SageMaker workflow. The exam trap is asking "an ML engineer working in SageMaker Studio wants to do feature engineering" — that is Data Wrangler, not DataBrew, even though both could technically do the work.
Q3 — Does a Glue Crawler transform my data into a different format?
No. A Glue Crawler only discovers schema and writes metadata to the Glue Data Catalog. The data files in S3 remain untouched in their original format. If you need to convert CSV to Parquet, you must run a Glue ETL job that reads CSV and writes Parquet — and then optionally run a Crawler (or use enableUpdateCatalog directly in the ETL job) to register the new Parquet output in the Data Catalog. This Crawler-vs-ETL distinction is one of the most-tested points in MLA-C01 Domain 1, and the wrong answer pattern is to suggest a Crawler alone solves a transformation problem.
Q4 — How do I handle incremental ML retraining without reprocessing the full historical dataset every night?
Two mechanisms work together. Glue job bookmarks track which data has already been processed (S3 keys for file sources, primary keys for JDBC sources) and skip already-seen data on subsequent runs. Pushdown predicates filter at read time so only relevant partitions are loaded — push_down_predicate="year=2026 AND month=05 AND day=02" reads only one day of partitioned data. Combine the two: bookmarks for "what is new since last run" and pushdown predicates for "only the partition I care about right now." Without these, a nightly retraining job rereads the full history every night, burning massive I/O cost. With them, the job processes only the new daily partition in seconds-to-minutes.
Q5 — When should I use Amazon EMR instead of AWS Glue for ML preprocessing?
EMR is the right choice in three scenarios. First, when you need a specific Apache Spark version Glue doesn't support yet (Glue trails the latest Spark releases by months). Second, when you need a 24/7 long-running cluster for streaming or interactive workloads — Glue jobs are per-run with startup overhead, EMR can stay up persistently. Third, when you need mixed Hadoop ecosystem services like Hive, Presto, HBase, or Flink alongside Spark on the same cluster. For typical MLA-C01 ETL workloads (nightly batch ETL, hourly transformation, on-demand feature derivation), Glue is the AWS-preferred answer because of serverless economics and tighter Catalog integration. The exam stem signal "custom Spark version" or "long-running" or "Hive + HBase" points to EMR; otherwise default to Glue.
Q6 — How do Glue Data Quality rules fit into an ML pipeline?
Glue Data Quality runs as a step in a Glue ETL job or as a node in a Glue Workflow, validating datasets against rules written in DQDL (Data Quality Definition Language). Common ML rules: Completeness "user_id" > 0.99 (no missing user IDs), Uniqueness "transaction_id" = 1.0 (no duplicate transactions), Freshness "event_time" < 1 day (data is current), ColumnValues "country" in ["US", "UK", "JP"] (only expected categories). On rule failure, the job either fails (blocking downstream training) or emits CloudWatch metrics and EventBridge events for alerting. Place Data Quality between raw-data ingestion and ML training so bad data never reaches the model. This is distinct from SageMaker Model Monitor, which runs against deployed model inputs and monitors for drift in production — Data Quality is pre-training, Model Monitor is post-deployment.
Further Reading — Official AWS Documentation
For depth beyond MLA-C01 scope, the authoritative AWS sources are: AWS Glue Developer Guide (job types, bookmarks, workflows), AWS Glue DataBrew Developer Guide, AWS Glue Data Quality documentation (DQDL reference), Amazon EMR Management Guide (for the EMR-vs-Glue comparison), AWS Lake Formation Developer Guide (catalog permissions), and the Machine Learning Lens of the AWS Well-Architected Framework. The AWS Big Data Blog has extensive Glue ETL pattern coverage, and the AWS Machine Learning Blog has multiple end-to-end posts showing Glue feeding SageMaker training.