SageMaker Data Wrangler and Feature Engineering - MLA-C01 ML Engineer Study Notes

SageMaker Data Wrangler is the visual feature-engineering tool inside SageMaker Studio that lets ML engineers and data scientists build, test, and export transformation flows without writing every line of PySpark by hand. On the MLA-C01 (Machine Learning Engineer Associate) exam, Data Wrangler appears throughout Domain 1 and shows up again in Domain 3 because exporting a Data Wrangler flow to a SageMaker Pipeline is the standard production handoff pattern. Pluralsight's MLA-C01 study guide flags Data Wrangler as a "high-frequency exam topic," and community pain-point research repeatedly cites the Data-Wrangler-vs-DataBrew confusion as one of the top three Domain 1 mistakes — the visual interfaces are similar enough that candidates who only studied one of them get burned by the other.

This guide covers the Data Wrangler flow model and where it sits in the ML pipeline, the categorical encoding options the exam tests by name (one-hot, label, ordinal, binning), the numeric scaling and normalization transforms with their exam-relevant trade-offs, the missing-value imputation strategies including MICE, the class-imbalance handling toolbox (SMOTE, random oversampling/undersampling, class weights), the time-series and text feature pipelines, the SageMaker Processing job model that turns interactive flows into production pipelines, and the export paths to Feature Store, S3, and SageMaker Pipelines. By the end you should be able to read a stem like "imbalanced fraud dataset, 0.5 percent positive rate, build an ML-ready dataset with proper class balance and reproducible feature pipeline" and immediately answer Data Wrangler with SMOTE oversampling, exported as a Pipeline step.

What SageMaker Data Wrangler Is and Where It Sits

Data Wrangler is a feature inside SageMaker Studio that provides a spreadsheet-like, visual data preparation environment with 300+ built-in transforms, integrations to S3, Athena, Redshift, Snowflake, EMR, and Databricks as data sources, automatic data-quality reports, and one-click export to S3, Feature Store, SageMaker Processing, or SageMaker Pipelines. It targets the ML engineer and data scientist who want to iterate on features quickly and then operationalize the same transformation as a production pipeline step.

The Feature Engineering Step In MLA-C01 Context

Feature engineering — turning raw fields into model-ready signals — is roughly half of every ML project. The MLA-C01 exam doesn't expect you to know which features make XGBoost predict best (that's the data scientist's call); it expects you to know which AWS service does which transformation and how the transformation flows from interactive exploration into the production training pipeline. Data Wrangler is the visual interactive tool; SageMaker Processing is the production execution engine; SageMaker Pipelines is the orchestration layer.

Where Data Wrangler Differs From Glue DataBrew

Both are visual transformation tools. The key distinction is where they sit in the integration story. Data Wrangler lives inside SageMaker Studio, exports natively to Feature Store and Pipelines, and is targeted at ML-specific feature engineering. DataBrew lives inside AWS Glue, integrates with the Glue Data Catalog and Lake Formation, and is targeted at general data preparation. The MLA-C01 exam tests this distinction by stem context — "ML engineer in SageMaker Studio" → Data Wrangler; "data engineer producing curated S3 datasets" → DataBrew.

Plain-Language Explanation: SageMaker Data Wrangler

Feature engineering is one of those topics where the textbook makes it sound mathematical and the reality looks like Excel on steroids. Three concrete analogies make the structure stick.

Analogy 1 — The Tailor's Workshop Before A Wedding

Picture a wedding boutique where every gown arrives from the manufacturer in standard sizes (raw data) and must be altered for the specific bride (model-ready features) before the wedding. The measuring station where the tailor takes the bride's measurements and notes every adjustment needed is the Data Wrangler flow's analyze step — it generates the data quality report, identifies missing measurements (null values), spotted outliers (anomalies), and produces the to-do list. The alteration table with hundreds of pre-built techniques (taking in seams, hemming, beading) is the 300+ Data Wrangler transforms — categorical encoders, numeric scalers, missing-value imputers, all ready to apply with a click. The specific alteration of converting the dress's standard size labels into precise measurements (one-hot encoding "size XL" into a vector) is categorical encoding. The balancing of an asymmetric design (left shoulder altered more than right) is class imbalance handling with SMOTE. The alteration log book that records every change so the next dress in the same line can be altered the same way is the Data Wrangler flow definition — a JSON document recording every transformation step, ready to replay on new data. The production sewing room that takes the alteration log and applies it to thousands of dresses for chain stores is SageMaker Processing running the exported flow at scale. The boutique's master fabric storage holding pre-altered components for reuse across weddings is Feature Store.

Analogy 2 — The Coffee Roaster's Recipe Lab

Picture a specialty coffee roaster where every shipment of green beans (raw data) must be roasted, blended, and packaged for sale (model-ready features). The recipe development lab where a master roaster experiments with roast profiles, blend ratios, and grind sizes is Data Wrangler in interactive mode — small samples, visual feedback, rapid iteration. The recipe card the master writes down — "230 °C for 12 minutes, blend 60% Brazilian + 40% Ethiopian, grind to 15 microns" — is the Data Wrangler flow definition. The production roaster in the warehouse that takes the recipe card and applies it to a 1000-pound shipment is SageMaker Processing. The specific operations on the recipe card — "remove defective beans" (missing-value imputation), "blend in a 60:40 ratio" (one-hot encoding categorical type), "grind to 15 microns" (normalization), "balance the chocolate notes" (SMOTE class balancing) — are the individual Data Wrangler transforms. The packaged roasted bags going to retail are the curated training-ready dataset in S3 or Feature Store. The flavor analysis report the master generates before approving the recipe is the Data Wrangler quick model and feature correlation reports — fast feedback on whether the recipe will produce sellable coffee before committing to the production batch.

Analogy 3 — The Hospital Lab Pre-Test Workup

Picture a hospital pathology lab preparing patient samples for a major diagnostic panel (the trained model). Raw samples arrive (raw data) but cannot go straight into the analyzer — they need centrifugation, dilution, reagent addition, and balancing. The lab tech's workstation with hundreds of pre-validated protocols is Data Wrangler with 300+ built-in transforms. The specific protocols — "centrifuge at 3000 rpm for 5 minutes" (normalization), "dilute 1:10 with saline" (min-max scaling), "label every tube with patient and timestamp" (one-hot encoding patient categories), "add reagent to balance pH" (class balance) — are individual transformations on the data. The lab's sample prep log recording every step applied to every sample is the flow definition. The automated robotic prep station in the back room that takes the prep log and processes 5000 samples overnight is SageMaker Processing. The central biobank where validated, prep-completed samples are stored for reuse across studies is Feature Store. When a new study (model retraining) starts, the biobank ships pre-prepped samples directly to the diagnostic analyzer — no rework needed. That biobank-as-feature-cache role is exactly Feature Store's place in the MLA-C01 architecture.

Data Wrangler Flow — Connecting To Data Sources

A Data Wrangler flow starts with one or more data sources, applies a sequence of transforms, and produces an output. The flow is saved as a .flow file (JSON) in your SageMaker Studio user space.

Supported Data Sources

Data Wrangler reads from: Amazon S3 (Parquet, CSV, JSON, ORC), Amazon Athena (queries against any Glue Catalog table), Amazon Redshift (direct queries), Snowflake, Databricks, Amazon EMR, Salesforce, and SaaS connectors via AWS DataConnections. Sample size is configurable (default 50,000 rows for fast interactive feedback) but the flow applies to the full dataset at export time.

Importing And Joining Multiple Sources

You can import multiple datasets into one flow and use the Join transform to combine them. Common pattern: import a transactions table from Athena, import a users table from Redshift, join on user_id, and produce a unified feature table.

Sample Size And Iteration Speed

The interactive flow runs on a sample for fast feedback. When you finalize the flow and export, the full dataset is processed. Always test edge cases (large categorical cardinalities, extreme outliers) on the sample before assuming the production run will succeed.

Built-In Transforms — Categorical Encoding

The MLA-C01 exam expects you to recognize each encoding by name and know when to use which.

One-Hot Encoding

Converts a categorical column with N unique values into N binary columns. country with values US, UK, JP becomes three columns country_US, country_UK, country_JP, each 0 or 1. Use when categorical values have no inherent order and cardinality is low (typically under 100 unique values). High cardinality (1000+ values) explodes the feature space — use embedding or hashing instead.

Label Encoding (Ordinal)

Converts a categorical column to integers by assigning each value a number. size with values S, M, L, XL becomes integers 0, 1, 2, 3. Implies order — appropriate for ordinal categories (size, education level, severity). Wrong for nominal categories (country, color) because the model interprets country=2 as twice country=1, which is meaningless.

Ordinal Encoding (Explicit Order)

Like label encoding but with explicit user-specified order. severity with low, medium, high, critical mapped to 0, 1, 2, 3 by your design, not by alphabetical default. Use when ordinal order matters and you don't trust auto-mapping.

Binning (Bucketing)

Converts continuous numeric values into categorical buckets. age becomes age_bucket with values 0-18, 19-34, 35-54, 55+. Useful when the relationship between feature and target is non-monotonic (older isn't strictly riskier — riskiest is teens and very old).

Frequency / Count Encoding

Replaces a categorical value with its frequency in the dataset. Common for high-cardinality features where one-hot is impractical.

Target / Mean Encoding

Replaces a categorical value with the mean of the target variable for that category. Powerful for high-cardinality features but prone to leakage — must be computed only on training data, never on the validation set. Data Wrangler offers this with leakage protection.

Choose the right encoding by checking whether the categorical variable has inherent order. Nominal (no order: country, color, product type) → one-hot encoding if cardinality is low, hashing or embedding if high. Ordinal (has order: education level, satisfaction rating) → ordinal encoding with explicit ordering. The MLA-C01 wrong-answer trap is using label encoding on nominal data — the model treats country=2 as twice country=1, which is meaningless. Label encoding with auto-numbering is only safe on truly ordinal categories where alphabetical happens to match the order.

Numeric Feature Engineering

For numeric columns, the transforms manage scale and distribution.

Min-Max Scaling

Scales values to a fixed range, usually [0, 1]: (x - min) / (max - min). Use when you need bounded values (e.g., for neural networks with sigmoid outputs). Sensitive to outliers — one extreme value compresses everything else into a narrow band.

Standardization (Z-Score)

Subtracts the mean and divides by standard deviation: (x - mean) / std. Output has mean 0, std 1, but no fixed range. Use for algorithms that assume normally-distributed inputs (linear models, neural networks). Less sensitive to outliers than min-max.

Robust Scaling

Uses median and interquartile range instead of mean and std: (x - median) / IQR. Use when data has outliers — robust scaling ignores them rather than letting them dominate.

Log Transform

log(x + 1). Use for highly-skewed distributions (income, count of events) — compresses the long tail and produces approximately normal output. Required for many tree-based models on heavy-tailed data.

Power / Box-Cox Transform

Like log transform but with a learned exponent. Auto-selects the transformation that produces the most normal-looking distribution.

Decision Heuristic

Tree-based models (XGBoost, Random Forest) — scaling matters less; do log transform on heavy-tailed features.
Linear models, neural networks, k-NN, SVMs — scaling matters; default to standardization, use min-max if you need bounded output.
Heavy outliers — robust scaling.

Handling Missing Values

Missing data is the silent killer of ML pipelines. MLA-C01 expects you to know the imputation toolbox.

Drop Rows / Drop Columns

If missingness is rare and uncorrelated, drop affected rows. If a column is mostly null, drop the column. Last resort, not first.

Mean / Median / Mode Imputation

Replace nulls with the column's mean (continuous), median (continuous, robust to outliers), or mode (categorical). Simple, fast, but introduces bias when missingness is correlated with the target.

Forward Fill / Backward Fill

For time-series, replace null with the previous (or next) value. Standard for sensor data with intermittent gaps.

KNN Imputation

Find the K most similar rows by other features, take the mean of their values for the missing column. Better than simple mean for correlated features.

MICE (Multiple Imputation by Chained Equations)

Iteratively imputes each column using regression on the others. Most accurate, most expensive. Data Wrangler offers MICE as a built-in option.

Indicator Column

Add a binary _was_missing column alongside the imputed value. Lets the model learn that missingness itself is informative.

When To Choose Which

< 5% missing, MCAR (missing completely at random) — drop rows.
5-20% missing — mean/median/mode for speed, KNN for accuracy.
20-40% missing — MICE if accuracy matters and compute is available.
40% missing — consider dropping the column unless it carries signal.

Class Imbalance Handling

Imbalanced classification — fraud, churn, equipment failure — is one of MLA-C01's most heavily-tested feature engineering topics.

Why Imbalance Matters

A fraud dataset with 0.5% positive rate naively trained on accuracy will predict "not fraud" 100% of the time and score 99.5% accuracy — useless. The model never learns the minority class because the loss is dominated by the majority.

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE generates synthetic minority-class examples by interpolating between existing minority samples. For each minority sample, SMOTE finds K nearest minority neighbors and creates new samples along the line connecting them. Output is a balanced dataset with realistic-looking synthetic data. Data Wrangler has SMOTE as a one-click transform.

Random Oversampling

Duplicate minority-class rows until classes are balanced. Simple but introduces overfitting risk — the model memorizes the duplicated samples.

Random Undersampling

Drop majority-class rows until classes are balanced. Simple, reduces dataset size, may lose information from the majority class.

Class Weights

Instead of resampling, assign higher loss weight to minority-class errors during training. The model learns from imbalanced data but pays more attention to minority misses. SageMaker built-in algorithms (XGBoost) support scale_pos_weight.

Combining Techniques

SMOTE + random undersampling is a common combo: oversample minority synthetically, then drop some majority to balance. Tune the ratio per dataset.

For severely imbalanced classification (less than 1% positive class), default to SMOTE plus class weights, not naive oversampling. Naive oversampling duplicates minority rows verbatim, which causes the model to overfit to those exact samples and perform poorly on truly novel minority examples. SMOTE creates synthetic but realistic minority samples by interpolating between existing ones, which generalizes better. Class weights (e.g., scale_pos_weight=200 for a 0.5% positive rate) tell the loss function to penalize minority-class mistakes 200x harder. Combining the two — SMOTE for richer training distribution, class weights for loss-aware learning — is the production pattern Data Wrangler offers. The MLA-C01 stem signal "0.5% fraud rate" or "extreme class imbalance" maps to SMOTE + weights; naive oversampling is usually the wrong-answer trap.

Text Feature Engineering

For NLP and text-classification tasks, Data Wrangler offers a focused but useful toolbox.

Tokenization

Split text into words or subwords. Data Wrangler offers basic tokenization; for production NLP, custom Processing jobs with Hugging Face tokenizers are common.

TF-IDF (Term Frequency - Inverse Document Frequency)

Convert text into a numeric matrix where each cell is the importance of a word in a document. Standard for traditional ML on text (logistic regression on TF-IDF vectors).

Word Embeddings

For deep-learning text models, words are mapped to dense vectors (Word2Vec, GloVe). Data Wrangler integrates with SageMaker BlazingText for embedding generation.

Character / Subword Encoding

For modern transformer models, BPE (Byte-Pair Encoding) or WordPiece tokenization. Typically done in custom Processing or directly in the training script.

When Data Wrangler Is The Right Text Tool

For traditional ML on text (TF-IDF + linear/tree models), Data Wrangler suffices. For deep NLP (transformers, fine-tuning), use SageMaker JumpStart or custom Processing with Hugging Face — Data Wrangler's text capabilities are not designed for that.

Time-Series Feature Engineering

Time-series data needs special features that capture temporal structure.

Lag Features

Create columns representing the value N time periods ago. sales_lag_7 = sales 7 days ago. Critical for forecasting models that need to learn from recent history.

Rolling Window Statistics

Compute mean, sum, std, min, max over a rolling window. sales_rolling_7d_mean = average of last 7 days' sales. Smooths out noise and reveals trends.

Timestamp Decomposition

Extract hour_of_day, day_of_week, month, is_weekend, is_holiday from a timestamp. Captures cyclical patterns.

Cyclic Encoding

For cyclical features like hour-of-day, encode as sin(2π * hour / 24) and cos(2π * hour / 24) so the model learns that hour 23 is close to hour 0.

Differencing

Replace value with value - lag_1 to make a series stationary (remove trend). Often required for ARIMA-style models.

Why The Exam Tests Time-Series Feature Engineering

DeepAR is a SageMaker built-in algorithm for time-series forecasting (covered in the Training Jobs topic), and Forecast is an AWS managed service. Both work better with engineered time features. The MLA-C01 stem signal "forecast next quarter's sales" or "predict equipment failure 7 days ahead" expects the candidate to know lag and rolling-window features.

SageMaker Processing — From Interactive To Production

Data Wrangler is interactive. To run the same flow on production-scale data, you export to SageMaker Processing.

What SageMaker Processing Is

SageMaker Processing is a managed service for running data processing workloads on demand. It launches a container (built-in Scikit-Learn, Spark, PyTorch, or BYOC) on managed instances, runs your script against input data in S3, writes output to S3, and tears down the cluster. Pay per instance-hour.

How Data Wrangler Exports To Processing

Data Wrangler's "Export" menu offers four targets: SageMaker Processing job (Python or Spark), SageMaker Pipelines (as a Processing step), Amazon S3 (one-shot run), and Feature Store (direct ingestion). Choosing Processing produces a Python notebook that runs your flow on Processing — runnable as-is.

Built-In Processing Containers

SKLearnProcessor — Scikit-Learn 0.20+ pre-installed; use for sklearn-based pipelines.
PySparkProcessor — Apache Spark 3.x; use for distributed feature engineering.
PyTorchProcessor / TensorFlowProcessor — for ML-framework-specific preprocessing.
ScriptProcessor — generic Python script with a Docker image of your choice.

When To Use Processing Directly Instead Of Data Wrangler

Data Wrangler is interactive; for fully programmatic feature pipelines (CI/CD, no-UI workflows), write Processing scripts directly. The MLA-C01 stem signal "production pipeline, no manual UI step" maps to Processing without Data Wrangler.

Data Wrangler is the interactive design tool; SageMaker Processing is the production execution engine. Data Wrangler exports its flow definition AS a Processing job — they are complementary, not competing. The MLA-C01 mental model: a data scientist iterates on features in Data Wrangler in SageMaker Studio, exports the finalized flow as a Processing job script, and the ML engineer wires that script into a SageMaker Pipeline as a ProcessingStep. Both stages use the same flow definition, ensuring training and inference use identical preprocessing and eliminating training-serving skew.

Exporting Data Wrangler Flows

The export step turns interactive work into reusable production assets.

Export To Amazon S3

One-shot run that processes the full dataset using the flow definition and writes output to a specified S3 prefix. Useful for ad-hoc dataset generation.

Export To Feature Store

Directly ingests the flow output into a SageMaker Feature Store feature group. The export wizard creates the feature group schema from the flow output schema. After export, features are queryable from the online store (low-latency real-time inference) and offline store (S3 Parquet for training).

Export As SageMaker Pipeline Step

Generates a ProcessingStep definition that can be added to a SageMaker Pipeline. The flow runs as a step in the pipeline DAG, with input and output S3 paths parameterized at pipeline runtime.

Export As Python Code

For full programmatic control, export the flow as a Python script (PySpark or Pandas) that can be edited and version-controlled. This is the recommended path for production workflows where the visual UI was used for prototyping but the deployed pipeline lives in Git.

Feature Importance and Quick Model Reports

Data Wrangler builds in fast feedback loops on feature quality.

Quick Model Report

A one-click XGBoost model trained on a sample, with cross-validated F1 (classification) or MSE (regression) and feature importance. Use to validate that features have predictive signal before committing to a full training pipeline.

Bias Report

Built-in fairness analysis on a sensitive feature (e.g. demographic attribute). Output: class imbalance, label imbalance, and other fairness metrics. Distinct from SageMaker Clarify (which is the dedicated full-bias-and-explainability service).

Target Leakage Detection

Quick check that no feature trivially predicts the target (often a sign of a data leak — e.g. including is_fraud_label as a feature). Critical pre-training check.

Multicollinearity Detection

Reports highly-correlated features. Helps decide which to drop for linear models that assume independent features.

Why The Reports Matter

Quick model and bias reports reduce the iteration cycle from "build full pipeline, train full model, see it fails" to "click report, see the issue, fix the flow." On the MLA-C01 exam, "quickly validate feature usefulness before training" → Data Wrangler quick model.

Common Exam Traps For Data Wrangler

Trap 1 — Data Wrangler And DataBrew Are Interchangeable

Wrong. Data Wrangler is SageMaker-centric, ML-specific, exports to Feature Store and Pipelines. DataBrew is Glue-centric, general-purpose, exports to S3 and Glue Catalog.

Trap 2 — Data Wrangler Is Production-Ready As-Is

Partial. The interactive flow runs on a sample for fast feedback. Production runs require export to SageMaker Processing or Pipelines. The flow is a design artifact, not the production runtime.

Trap 3 — Label Encoding Is Always Safe For Categoricals

Wrong. Label encoding implies ordinal order. Safe only for truly ordinal categories. For nominal categories (country, color), use one-hot encoding (low cardinality) or hashing/embedding (high cardinality).

Trap 4 — Mean Imputation Is Always The Right Default

Wrong. Mean imputation introduces bias when missingness is correlated with the target. For careful imputation, use KNN or MICE; for missingness-as-signal, add an indicator column.

Trap 5 — SMOTE Is The Universal Imbalance Solution

Partial. SMOTE works well for moderate imbalance and structured tabular data. For extreme imbalance (under 0.1%) or text/image data, combine with class weights and consider undersampling. SMOTE on raw text or images directly does not work — interpolation in pixel space is meaningless.

Trap 6 — Data Wrangler Replaces Glue ETL

Wrong. Data Wrangler is for ML feature engineering inside SageMaker. Glue ETL is for general data lake transformations feeding multiple consumers (BI, warehouse, ML). Production architectures often use both — Glue for data-lake-wide curation, Data Wrangler for ML-specific feature derivation.

Trap 7 — Quick Model Report Equals Full Model Evaluation

Wrong. Quick model is a fast XGBoost on a sample for feature-validation feedback. It does NOT replace full training, hyperparameter tuning, and rigorous evaluation.

Data Wrangler is for INTERACTIVE feature engineering in SageMaker Studio; SageMaker Processing is the PRODUCTION execution engine; Data Wrangler exports flows AS Processing jobs and Pipeline steps to bridge the two. This is the mental model the MLA-C01 exam keeps coming back to. The interactive UI is for exploration and prototyping. The exported flow is the production artifact. The same flow definition runs both — guaranteeing training-time and inference-time preprocessing match, which is how Feature Store and Pipelines eliminate training-serving skew. Anytime a stem says "feature engineering in SageMaker Studio" → Data Wrangler. "Production pipeline running the same transformations" → exported as Processing or Pipeline step.

Data Wrangler Flow Architecture — sources (S3, Athena, Redshift, Snowflake) feed into a Data Wrangler flow in SageMaker Studio, transforms are applied (encode, scale, impute, balance), reports are generated (quick model, bias, leakage), and the flow exports to S3, Feature Store, SageMaker Processing, or as a SageMaker Pipeline ProcessingStep for production runs.

Decision Matrix — Feature Engineering Service Selection

Use Case	Right Service	Wrong Answer Trap
Interactive ML feature engineering in Studio	SageMaker Data Wrangler	Glue DataBrew
Visual data prep by data engineer	Glue DataBrew	Data Wrangler
Production feature engineering at TB scale	SageMaker Processing (Spark)	Data Wrangler interactive
Categorical feature with no order, low cardinality	One-hot encoding	Label encoding
Categorical feature with order (S/M/L/XL)	Ordinal encoding	One-hot encoding
0.5% positive class fraud detection	SMOTE + class weights	Naive oversampling
30% missing values, accuracy critical	MICE imputation	Drop rows
Heavy-tailed numeric (income, counts)	Log transform	Min-max scaling
Time-series forecasting features	Lag + rolling window + timestamp decomposition	Raw timestamp only
Export feature pipeline to production	Pipeline ProcessingStep	One-shot S3 export

Key Numbers To Memorize For MLA-C01

Data Wrangler default sample: 50,000 rows (configurable)
300+ built-in transforms
Export targets: S3, Feature Store, Processing job, Pipelines step, Python code
Built-in Processing containers: SKLearnProcessor, PySparkProcessor, PyTorchProcessor, ScriptProcessor
One-hot encoding cardinality threshold: typically <100 unique values
Class weights for imbalanced training: scale_pos_weight in XGBoost
SMOTE: synthetic minority oversampling, interpolation between K-nearest minority neighbors

FAQ — SageMaker Data Wrangler and Feature Engineering Top Questions

Q1 — When should I use SageMaker Data Wrangler instead of AWS Glue DataBrew?

Both are visual data transformation tools and feel similar, but they live in different services and integrate at different points. Use Data Wrangler when you are an ML engineer or data scientist working in SageMaker Studio, your output destination is Feature Store or a SageMaker Pipeline, and your transformations are ML-specific (encoding for downstream models, SMOTE for class balance, time-series lag features). Use Glue DataBrew when you are a data engineer producing curated datasets in S3, your output feeds multiple consumers (BI, warehouse, ML), and your work integrates with the Glue Data Catalog and Lake Formation. The MLA-C01 stem signal "ML engineer in SageMaker Studio, export to Feature Store" → Data Wrangler. "Data engineer producing Glue Catalog table for Athena" → DataBrew. They are not competitors — production architectures often use both.

Q2 — How do I handle a fraud detection dataset with only 0.5% positive class?

Combine three techniques. First, use SMOTE in Data Wrangler to generate synthetic minority-class examples by interpolating between existing minority samples — produces a balanced training set with realistic synthetic data. Second, set class weights on the training algorithm (scale_pos_weight=200 for XGBoost given a 0.5% positive rate) so the loss function penalizes minority-class errors 200x harder than majority. Third, evaluate with metrics that handle imbalance — precision, recall, F1, AUC-ROC, and especially AUC-PR (precision-recall curve) — never accuracy alone, which would score 99.5% on a model that always predicts "not fraud." Naive oversampling (duplicating minority rows) is the wrong-answer trap because it causes the model to memorize duplicates and overfit. The combination of SMOTE + class weights + appropriate metrics is the production-grade answer.

Q3 — Should I use one-hot encoding or label encoding for a country column?

One-hot encoding. Country is a nominal categorical (no inherent order — Japan is not "greater than" the UK), so label encoding (mapping countries to integers 0, 1, 2, 3) introduces a false ordinal relationship the model will exploit. Use one-hot encoding for low-cardinality nominals (typically under 100 unique values). For high-cardinality nominals (e.g., ZIP codes with 40,000+ values), use feature hashing or embedding instead — one-hot would explode the feature space. Label encoding is only safe on truly ordinal categories like education level (high_school, bachelor, master, phd) where the integer order matches the semantic order. The MLA-C01 trap is using label encoding on nominal data because it produces fewer columns; the trap costs you model accuracy.

Q4 — How do I move a Data Wrangler flow from interactive prototyping to a production pipeline?

The standard path has three steps. First, finalize the flow in Data Wrangler interactively, validating with the Quick Model report and the data quality / bias reports. Second, export the flow via the Data Wrangler Export menu — choose "SageMaker Pipeline" to generate a ProcessingStep definition, or "Python code" for full programmatic control. Third, wire the step into a SageMaker Pipeline alongside training, evaluation, and deployment steps. The exported step is parameterized — pipeline parameters override input/output S3 paths at runtime, so the same flow definition runs against different datasets in dev, staging, and production. This export-to-Pipeline pattern is what eliminates training-serving skew: the same flow runs at training time and at inference preprocessing time, guaranteeing consistent feature transformations.

Q5 — When should I use MICE imputation instead of mean imputation for missing values?

Choose based on missingness severity and accuracy requirements. Mean / median / mode imputation is fast and simple — fine for under 5% missing where bias from naive imputation is negligible. MICE (Multiple Imputation by Chained Equations) is more accurate because it iteratively imputes each column using regression on the others — fine for 20-40% missingness where naive imputation would distort the distribution. Use MICE when accuracy matters and compute is available, and especially when missingness is correlated with other features (MICE captures the correlation; mean imputation does not). Above 40% missing, consider whether the column should be dropped entirely or replaced with a _was_missing indicator if the missingness itself carries signal. Data Wrangler offers both as built-in transforms; the choice is about cost (MICE is more expensive) and accuracy requirements.

Q6 — Does Data Wrangler replace SageMaker Processing or are they used together?

They are complementary, not competing. Data Wrangler is the interactive feature-engineering UI in SageMaker Studio — used for exploration, prototyping, and rapid iteration on a sample. SageMaker Processing is the production execution engine that runs preprocessing scripts at full data scale on managed infrastructure. The standard production pattern: a data scientist designs the feature pipeline interactively in Data Wrangler, then exports the flow as a Processing job script (or as a SageMaker Pipeline step that wraps Processing), and the ML engineer wires that into the production training and inference pipeline. Both stages use the same flow definition — Data Wrangler at design time, Processing at runtime — guaranteeing the production pipeline replicates exactly what worked in prototyping. The MLA-C01 stem "ML engineer iterates on features in Studio, then ships to production" maps to Data Wrangler flow → Processing job export.