ML Data Quality, Integrity, and Labeling - MLA-C01 ML Engineer Study Notes

ML data quality, integrity, and labeling is the third pillar of MLA-C01 Domain 1 (Data Preparation for Machine Learning, 28 percent weight) and corresponds directly to Task 1.3 — "Ensure data integrity and prepare data for modeling." Community study guides from K21 Academy and the Pluralsight editorial team consistently flag this surface as under-studied by candidates, who over-rotate on Glue ETL and Feature Store while skimming the data quality and labeling content. The exam punishes that gap with scenario questions of the form "the model is accurate in evaluation but degrades within two weeks in production" — where the correct answer is a data quality rule that should have failed the pipeline before a flawed dataset ever reached the trainer.

This guide covers what data quality means in an ML context, the five quality dimensions the exam expects you to know, how AWS Glue Data Quality enforces rules at pipeline scale, how SageMaker Data Wrangler surfaces quality reports interactively, how schema validation and data contracts prevent type drift, how SageMaker Model Monitor baselines tie training-time data quality to production drift detection, the four labeling paths in SageMaker Ground Truth and Ground Truth Plus, automated labeling with active learning, PII and PHI compliance with Macie and Comprehend, and the data residency rules that AWS Config enforces. Each section maps to a specific MLA-C01 trap.

What Is ML Data Quality?

ML data quality is the discipline of ensuring training data is accurate, complete, consistent, timely, and unique before a model is trained on it. Poor data quality is the root cause of most production ML failures — far more than algorithm choice or hyperparameter tuning. A model trained on a dataset where 30 percent of one feature is missing, or where a label column is silently flipped for one customer segment, will produce predictions that look reasonable in evaluation but degrade in deployment. Data quality is therefore the first ML engineering discipline the MLA-C01 exam tests, because every later pipeline stage assumes the data is clean.

Why Data Quality Is The Root Cause Of Most Model Degradation

Industry surveys of failed production models consistently identify data issues — missing values, schema drift, mislabeled records, training-serving skew — as the dominant failure mode. Algorithms are commodities; data is the differentiator. An ML engineering team that invests in quality gates, schema validation, and labeling discipline ships models that survive production; a team that does not ships models that look great until launch and then quietly rot. The exam reflects this reality with questions about quality rules, baseline drift, and labeling workflow design.

The Five Data Quality Dimensions

The five quality dimensions you must memorize:

Completeness — every required field is populated; missing-value rate stays under threshold.
Accuracy — values reflect ground truth; for example, a postal code matches the city.
Consistency — same entity has same representation across rows and tables; a user ID is not "USR-123" in one feed and "user_123" in another.
Timeliness — data arrives within an acceptable freshness window; yesterday's training set is fresh, last quarter's is not.
Uniqueness — primary keys are unique; duplicate rows do not skew model fit.

Glue Data Quality and Data Wrangler quality reports both organize their checks around these dimensions.

Plain-Language Explanation: ML Data Quality

Data quality is the kind of topic where six AWS services compete for the same paragraph. Three concrete analogies make the structure stick.

Analogy 1 — The Restaurant Health Inspection

Picture a busy restaurant. Every morning before service, the head chef runs a station inspection — refrigerator temperature within range, produce within use-by date, knives sharp, prep portions match the day's forecast. If any check fails, the kitchen does not open until the failure is remediated. The inspection is AWS Glue Data Quality — a defined ruleset that runs before downstream consumption and blocks the pipeline if the data does not meet specification. The chef's tasting spoon, used to verify a sauce reduction tastes right before plating, is SageMaker Data Wrangler's quality report — interactive sampling and visual confirmation during recipe development.

The health inspector who arrives unannounced once a quarter to verify the kitchen meets regulatory standards is AWS Config — automated, scheduled compliance checks evaluating whether resources match a policy. The inspector who searches the freezer for unlabeled allergens is Amazon Macie — scanning storage for sensitive contents that should not be there. The food critic who checks whether the dish on the table matches what the customer ordered is SageMaker Ground Truth — comparing the labeled deliverable against the ground-truth specification.

If the chef skips the morning inspection, the kitchen ships unsafe food and customers get sick (model degrades in production). If the chef fakes the inspection paperwork, the regulator finds out and shuts the restaurant down (audit failure). The discipline of running every check, every shift, every dish, is what keeps the restaurant operating — and what keeps an ML system in production.

Analogy 2 — The Pharmaceutical Manufacturing Quality Control Lab

Picture a drug manufacturing plant. Every batch of API (active pharmaceutical ingredient) passes through a quality control lab that measures purity, potency, sterility, and dissolution rate against a written specification. Batches outside spec are quarantined, never shipped. The QC lab is Glue Data Quality running rule sets against ML datasets — every batch is the dataset for a training run, every check is a rule like "completeness of customer_id column ≥ 99 percent" or "mean of transaction_amount within ±2 standard deviations of historical baseline."

The certificate of analysis (CoA) that travels with each batch is the Data Wrangler quality report — a snapshot document attesting to what the QC team measured and signed off. The signed batch record stored in regulatory archives is the SageMaker Model Monitor baseline — a saved statistical reference that downstream production monitoring compares ongoing batches against to detect drift. The trace label on every vial that lets investigators recall a contaminated lot is AWS Config and CloudTrail — the audit chain that documents which dataset was used to train which model and where it was deployed.

The compliance officer scanning the warehouse for unlabeled or expired products is Macie scanning S3 buckets for PII that should not be in training data. The training program that teaches new lab techs to spot adulterated raw materials is SageMaker Ground Truth — humans in a workflow producing labeled data so the automated systems learn what good and bad look like. Every step is mandatory; skipping any one of them puts the patient (the production model and its end users) at risk.

Analogy 3 — The University Dissertation Defence

Picture a graduate student preparing a dissertation. The data they collected is their training dataset. The dissertation committee has rules — methodology must be sound (data quality), citations must be complete (completeness), references must agree across chapters (consistency), data must be from the relevant decade (timeliness), and no source can be plagiarized or duplicated (uniqueness). The committee runs these checks at the proposal defence before the student writes the final dissertation, and again at the final defence before the degree is granted. Each defence is a quality gate that can block progression.

The proposal defence is Glue Data Quality at ingestion — fail fast before the heavy work happens. The final defence is Model Monitor baseline establishment plus drift detection — once the work is done, ongoing comparison to a saved benchmark catches divergence in production. The institutional review board (IRB) that approves the use of human-subject data is Macie plus Comprehend PII detection — explicit verification that sensitive data is handled within ethical and regulatory rules. The librarian who taught the student how to cite is SageMaker Ground Truth labeling guidelines — documented standards that produce consistent, high-quality labeled data across many graduate students working on the same field.

A dissertation that passes proposal defence and final defence and IRB approval is a defensible scholarly contribution. An ML model that passes Glue Data Quality at ingestion, Model Monitor baseline at training, and Macie scans at storage is a defensible production system — both rest on the same discipline of running every check at every stage.

AWS Glue Data Quality — Pipeline-Scale Quality Enforcement

AWS Glue Data Quality is the AWS-native service for defining and running data quality rules at pipeline scale.

Built-In Rule Types

Glue Data Quality ships with around twenty built-in rule types that map to the five quality dimensions:

Completeness rules — Completeness "column" > 0.95 requires at least 95 percent of values populated.
Uniqueness rules — Uniqueness "column" > 0.99 requires near-unique values.
Range rules — ColumnValues "column" between X and Y.
Pattern rules — ColumnValues "column" matches "[regex]".
Statistical rules — Mean "column" between X and Y or StandardDeviation "column" between X and Y.
Freshness rules — RowCountMatch between datasets or DataFreshness against last updated timestamp.
Foreign-key rules — ReferentialIntegrity between two tables.

Rules are written in DQDL (Data Quality Definition Language), a domain-specific syntax purpose-built for these checks.

How Glue Data Quality Runs

You attach a ruleset to a Glue Data Catalog table or define it inline in a Glue ETL job. When the job runs, the ruleset evaluates against the data, producing a score (percentage of rules passed) and a report (per-rule pass or fail with the offending row count). You configure the job to fail and stop on a score below threshold, or to continue but emit a CloudWatch event for asynchronous handling. The fail-stop pattern is the right default for production ML pipelines — you do not want to train a model on a dataset that failed quality.

Integration With SageMaker Pipelines

A Glue Data Quality job is a natural step early in a SageMaker Pipelines DAG: ingest → quality check → training data preparation → training → evaluation → registry → deploy. The ConditionStep can branch on the Data Quality score, sending bad data to a remediation queue and good data forward to training. This is the canonical MLOps pattern and a heavily tested exam scenario.

What Glue Data Quality Is Not

Glue Data Quality runs on batch data at rest in the Data Catalog or Glue ETL job context. It does not monitor streaming data in real time (use Kinesis Data Analytics or custom Lambda for that), and it does not validate that a deployed model's inference inputs match the training distribution (that is SageMaker Model Monitor's job). The exam tests this distinction: a question asking "how do I detect production input drift" is answered by Model Monitor, not Glue Data Quality.

AWS Glue Data Quality enforces data quality rules at pipeline scale before downstream ML stages consume the data, and is the canonical "fail the pipeline if data is bad" gate for ML training. Rules are defined in DQDL and cover completeness, uniqueness, freshness, pattern matching, statistical bounds, and referential integrity. Glue Data Quality runs on batch data in the Data Catalog or Glue ETL — it does not monitor streaming inputs and does not detect production inference drift. For drift detection at inference time, use SageMaker Model Monitor with a baseline; for ML pipeline gating, use Glue Data Quality.

SageMaker Data Wrangler Quality Report — Interactive Quality Surface

SageMaker Data Wrangler is the interactive feature-engineering tool, and its quality and insights report is the interactive complement to Glue Data Quality.

What The Report Shows

The Data Wrangler quality report displays:

Missing value summary — percentage missing per column.
Outlier detection — values outside statistical bounds per numeric column.
Class imbalance — distribution of label classes for classification datasets.
Feature target correlation — how strongly each feature correlates with the label.
Feature feature correlation — multicollinearity risk across input features.
Anomaly samples — specific rows flagged as outliers, viewable inline.
Quick model — a fast XGBoost baseline showing predictive lift before any feature engineering.

When To Use The Wrangler Report Vs Glue Data Quality

Use Data Wrangler quality report during exploratory data analysis when a data scientist is interactively shaping features in Studio — it is human-in-the-loop, sampling-based, and informational. Use Glue Data Quality in production pipelines when an automated check must pass or fail without human review — it runs at full scale, deterministically, and integrates with pipeline orchestration. The two are complementary: the data scientist explores in Wrangler, distills the findings into DQDL rules, and the rules guard the production pipeline going forward.

Schema Validation And Data Contracts

Schema validation is the structural layer beneath statistical quality.

What A Schema Validates

A schema declares the expected types and structure of a dataset: column names, types (string, integer, float, boolean), nullability, and acceptable enum values. Schema validation rejects records whose structure does not match — a record with a string in a column declared integer is rejected before any statistical check runs.

Glue Data Catalog As Schema Source

The Glue Data Catalog is the central schema registry for AWS analytics and ML. Glue Crawlers infer schemas from data and register them; you can also define schemas explicitly. Downstream consumers — Athena, EMR, SageMaker Processing — read the schema from the catalog. Schema drift (a new column appears, a type changes) is detectable by comparing crawler runs over time.

Data Contracts Across Teams

A data contract is an agreement between a data producer and a data consumer fixing the schema, the freshness SLA, and the quality rules. In practice on AWS, the contract is encoded as: a Glue Data Catalog table definition (schema), a Glue Data Quality ruleset (quality), and an EventBridge schedule plus freshness rule (freshness). Producers commit to maintaining all three; consumers commit to handling failures gracefully. Data contracts are an MLOps maturity practice and an emerging exam theme.

Schema Evolution Mechanics

When a schema must change — a new column added, a column renamed, a type widened — schema evolution rules govern how. Backward-compatible changes (adding nullable columns) are safe; breaking changes (removing columns, narrowing types) require coordinated producer and consumer updates. Glue Schema Registry (separate from Data Catalog) manages versioned schemas with compatibility rules and is the right tool for evolving schemas across many producers and consumers.

SageMaker Model Monitor Baseline — Bridging Training And Production

SageMaker Model Monitor baselines are the link between training-time data quality and production drift detection.

What A Baseline Is

A baseline is a saved snapshot of statistical properties of the training dataset — column means, standard deviations, distinct value counts, missing-value rates, and so on. The baseline is computed once via a SuggestBaseline Processing job and saved to S3 as a constraints JSON and a statistics JSON.

How Model Monitor Uses The Baseline

Once deployed, a Model Monitor schedule runs periodically against captured inference inputs. It compares the inference input statistics to the baseline statistics; if any feature's mean, distribution, or distinct count drifts beyond the baseline's constraint thresholds, the monitor emits a CloudWatch metric and a violation report. This is production data quality monitoring — the runtime complement to training-time Glue Data Quality.

Baseline Must Be Regenerated After Model Update

A common bug: the team trains a new model on a slightly different dataset, deploys the new model, but forgets to regenerate the baseline. The Monitor now compares inference inputs against an old baseline and emits constant false positive drift alerts. The discipline: every model retrain regenerates the baseline as a Pipeline step.

Every SageMaker model retrain must regenerate the Model Monitor baseline, otherwise production drift detection compares inference inputs to stale training statistics and produces false alerts. The baseline is a snapshot tied to a specific training dataset; when the training dataset changes (new features, refreshed records, schema evolution), the baseline becomes invalid for the new model. The canonical pattern in SageMaker Pipelines is: training step → baseline regeneration step (a SuggestBaseline Processing job on the same training dataset) → model registration step → conditional deployment. Skipping the baseline step is one of the most common MLOps bugs and a popular MLA-C01 exam trap.

SageMaker Ground Truth — Human Labeling Workflows

SageMaker Ground Truth is the AWS-native data labeling service.

Labeling Job Types

Ground Truth supports built-in labeling job templates for common tasks:

Image classification — assign one or more labels to an image.
Bounding box — draw rectangles around objects in an image.
Semantic segmentation — pixel-level labeling of image regions.
Text classification — assign labels to text spans or whole documents.
Named entity recognition — label entities (person, location, organization) in text.
Video classification and tracking — frame-level or object-level video labeling.
3D point cloud annotation — labeling LiDAR data for autonomous driving.
Custom workflows — bring your own labeling UI via Liquid templates.

Workforce Options

Three workforce types: public workforce via Amazon Mechanical Turk (cheapest, lowest privacy, suitable for non-sensitive data), private workforce of your own employees or contractors (most control, suitable for sensitive data), and vendor workforce via the AWS Marketplace (specialized vendors with domain expertise like medical imaging or legal documents).

RLHF Workflows

Reinforcement learning from human feedback (RLHF) is supported via Ground Truth's preference-ranking labeling tasks — workers compare two model outputs and select the preferred one, producing the preference data used to fine-tune large language models. RLHF is an emerging MLA-C01 topic given Bedrock and JumpStart's prominence.

Active Learning And Automated Labeling

Ground Truth's automated labeling feature uses a two-model active learning loop. As workers label a starter set of records, an internal model trains on those labels and predicts labels for unlabeled records. Records where the model is confident (above a confidence threshold) are auto-labeled; records below the threshold are sent to humans. As more human labels accumulate, the model improves and the auto-label rate increases. The result: labeling cost drops by 30-70 percent on large datasets while quality remains comparable to fully-human labeling. Active learning has minimum dataset size requirements (typically thousands of records) and works only with select task types.

Ground Truth Plus

Ground Truth Plus is the fully managed labeling option — AWS provides the workforce, the labeling instructions, the quality assurance, and the project management. You hand off the data and the requirements; AWS returns labeled data within the agreed SLA. Use it when you do not have a labeling team and labeling is not your core competency. The price is higher per label but the total cost (no internal team to hire and manage) is often lower.

Use Ground Truth automated labeling for any dataset over a few thousand records — the cost savings of 30-70 percent versus fully-human labeling are substantial and the quality difference is small. Active learning sends only ambiguous records to humans, while confident records are auto-labeled by the internal model. The dataset size threshold is task-dependent — image classification needs about 1,250 minimum, bounding box about 5,000, semantic segmentation about 2,000. Below those sizes, human-only labeling is required. On the MLA-C01 exam, scenarios about "labeling 100,000 images cost-effectively" point at automated labeling; scenarios about "labeling 500 highly sensitive medical records" point at private workforce, fully human, no automation.

PII And PHI Compliance In Training Data

PII (personally identifiable information) and PHI (protected health information) handling is the data integrity surface that K21 Academy flags as under-studied.

Amazon Macie For S3 Sensitive Data Discovery

Amazon Macie scans S3 buckets and discovers sensitive data — credit card numbers, SSNs, names, addresses, health record patterns, secrets like API keys. Findings publish to Security Hub and EventBridge for automated remediation. Macie covers S3 only; it does not scan streaming data, RDS, or other sources.

Comprehend PII Detection And Redaction

Amazon Comprehend has dedicated PII detection APIs that identify and optionally redact PII inline from text. The flow: a Lambda or Glue job reads training text, calls Comprehend ContainsPiiEntities or DetectPiiEntities, and replaces identified spans with redacted tokens. Use Comprehend for text data being prepared for NLP training; use Macie for bulk S3 discovery.

Data Residency And Region Pinning

Regulatory regimes (GDPR, HIPAA, regional data sovereignty laws) often require data to remain in specific AWS regions. The enforcement layer is:

AWS Config rules — s3-bucket-cross-region-replication-enabled plus custom rules verifying the replication target is an approved region.
S3 bucket policies — aws:RequestedRegion conditions denying access from disallowed regions.
SCPs at the OU level — denying s3:PutBucketReplication to non-approved regions and denying SageMaker actions outside permitted regions.

Encryption Is Not Integrity

A common candidate confusion: assuming KMS encryption protects data integrity. Encryption protects confidentiality — unauthorized readers cannot read the data. Integrity is a separate guarantee — the data has not been altered. KMS does not detect tampering after decryption; integrity requires hashing, signing, or WORM storage. Macie does not address integrity either; it addresses sensitivity discovery. Integrity for ML training data is enforced through schema validation, Glue Data Quality rules, and S3 Object Lock for immutable archives.

Macie scans S3 buckets at rest and is not a real-time scanner — data flowing through Kinesis or Firehose is not detected by Macie until it lands in S3. Engineers often assume Macie watches "the data" and miss that streaming pipelines need separate PII handling, typically Lambda transforms calling Comprehend on each batch before delivery to S3. The exam plants this as a scenario: "we need to redact PII from streaming data before it reaches our ML training bucket" — the answer is a Firehose Lambda transform calling Comprehend, not Macie. Macie's role is post-hoc discovery — it finds what already landed and should not have. Use Comprehend in-stream for prevention; use Macie at rest for detection.

Common Exam Traps For Data Quality, Integrity, And Labeling

The MLA-C01 exam plants a recurring set of traps in this surface. Memorize them.

Trap 1 — Data Integrity Equals Encryption

Wrong. Encryption protects confidentiality. Integrity protects against tampering and is enforced by hashing, schema validation, and immutable storage. KMS-encrypting an S3 bucket does not prevent an authorized writer from corrupting the data.

Trap 2 — Macie Scans Streams In Real Time

Wrong. Macie is an S3-at-rest scanner. Real-time PII detection on streams uses Comprehend in a Firehose Lambda transform.

Trap 3 — Glue Data Quality Detects Production Drift

Wrong. Glue Data Quality runs on batch data in the pipeline before training. Production inference drift detection is SageMaker Model Monitor's job.

Trap 4 — Ground Truth Public Workforce For Sensitive Data

Wrong. Mechanical Turk's public workforce is appropriate only for non-sensitive data. Sensitive data (PII, PHI, internal documents) requires the private workforce or Ground Truth Plus with appropriate vendor agreements.

Trap 5 — Automated Labeling Works On Any Dataset Size

Wrong. Automated labeling has minimum dataset sizes per task type (typically 1,000 to 5,000 records). Below the minimum, fully human labeling is required.

Trap 6 — Schema Validation And Data Quality Are The Same Thing

Wrong. Schema validates structure (types, columns, nullability). Data quality validates statistical properties (completeness, distribution, ranges). Both are needed — schema first as a fast gate, then quality.

Trap 7 — Model Monitor Baseline Is Created Once

Wrong. Every model retrain regenerates the baseline. A stale baseline against a new model produces constant false-positive drift alerts.

Trap 8 — Ground Truth Plus Is Just A Pricier Ground Truth

Wrong. Ground Truth Plus is a managed service where AWS provides the workforce and project management. Ground Truth requires you to provide and manage the workforce yourself.

Trap 9 — Comprehend PII Detection Works On Any Language

Partial. Comprehend PII detection is supported in select languages — primarily English plus a smaller set including Spanish, French, German, Italian, Portuguese, and others. Verify language support before designing a workflow around it.

Trap 10 — AWS Config Enforces Data Residency Automatically

Wrong. Config detects non-compliance but does not block actions. Blocking requires SCPs, IAM conditions, or S3 bucket policies. Config is the audit and detection layer; SCPs are the enforcement layer.

The five data quality dimensions are completeness, accuracy, consistency, timeliness, and uniqueness — every Glue Data Quality rule and every Data Wrangler quality report check maps to one of these. Memorize the list because the exam phrases questions around individual dimensions: "the dataset has duplicate customer rows" is uniqueness; "the dataset is from last quarter" is timeliness; "30 percent of the values are null" is completeness; "values are outside reasonable range" is accuracy; "user IDs use different formats across rows" is consistency. Each maps to a specific Glue Data Quality rule type. Knowing the dimension lets you pick the rule type quickly under exam time pressure.

Key Numbers And Must-Memorize Data Quality Facts

Five Data Quality Dimensions

Completeness
Accuracy
Consistency
Timeliness
Uniqueness

AWS Glue Data Quality

DQDL (Data Quality Definition Language) syntax
Around twenty built-in rule types
Score plus per-rule report
Fail-stop or continue-and-emit-event modes
Integrates with SageMaker Pipelines via ConditionStep

SageMaker Data Wrangler Quality Report

Interactive, sampling-based
Surfaces missing values, outliers, class imbalance, correlations
Quick model baseline included
Complementary to Glue Data Quality, not a replacement

Schema Validation

Glue Data Catalog as central schema registry
Glue Schema Registry for versioned schema evolution
Backward-compatible changes safe; breaking changes need coordination

SageMaker Model Monitor Baseline

Computed via SuggestBaseline Processing job
Constraints JSON plus statistics JSON in S3
Must be regenerated every model retrain
Compares production inference inputs to training statistics

SageMaker Ground Truth

Built-in templates for image, text, video, 3D point cloud
Three workforces: public (Mechanical Turk), private, vendor
Active learning automated labeling cuts cost 30-70 percent
Ground Truth Plus is fully managed by AWS

PII and PHI Compliance

Macie scans S3 at rest, not streams
Comprehend PII in-line redaction for text
AWS Config detects, SCPs enforce
Encryption is not integrity

FAQ — ML Data Quality, Integrity, and Labeling Top Questions

Q1 — When should I use Glue Data Quality versus SageMaker Data Wrangler quality reports?

Use Glue Data Quality for automated, deterministic, production pipeline gates — it runs as a step in your ML pipeline, evaluates a defined ruleset, and fails the pipeline if quality drops below threshold. Use Data Wrangler quality reports for interactive, exploratory data analysis when a data scientist is shaping features in Studio — it samples the data, surfaces visualizations, and supports human-in-the-loop investigation. The two are complementary: the data scientist explores in Wrangler, identifies the quality risks, distills them into DQDL rules, and the rules guard the production Glue Data Quality job going forward. The exam plants scenarios where you must pick one — interactive exploration is Wrangler, automated pipeline gating is Glue Data Quality.

Q2 — How do I detect data drift before it impacts production model accuracy?

Two complementary mechanisms. At the pipeline level, run Glue Data Quality with statistical rules (mean, standard deviation, distribution) against the latest dataset and compare the score to historical scores — sudden drops indicate distribution shift. At the production inference level, run SageMaker Model Monitor with a baseline; the monitor evaluates captured inference inputs against the training-time baseline and emits CloudWatch metrics on drift. The pipeline catches drift before retraining; the monitor catches drift in real time during inference. Both are required for full coverage. The pipeline is preventive (do not train on drifted data); the monitor is detective (alert if production inputs drift from training assumptions).

Q3 — When should I use Ground Truth versus Ground Truth Plus?

Use Ground Truth if you have an internal labeling team or established vendor relationships, want fine-grained control over labeling instructions and quality assurance, and have ongoing labeling needs that justify in-house tooling. Use Ground Truth Plus if you do not have a labeling team, need labeling done quickly without standing up infrastructure, or have a one-off labeling project. Ground Truth Plus is fully managed — AWS provides the workforce, the labeling tool configuration, the quality assurance, and project management. Per-label cost is higher but the total project cost (no team to hire, no tooling to build) is often lower. The exam plants scenarios where the operational maturity of the customer determines the answer — large ML team with existing labelers picks Ground Truth, lean team without labeling staff picks Plus.

Q4 — How do I handle PII in training data without breaking model performance?

Three complementary strategies. Redaction — replace PII spans with placeholder tokens before training; the model learns from the structure but never sees the actual values. Use Comprehend DetectPiiEntities to identify spans and replace inline. Tokenization — replace PII with deterministic tokens that preserve referential integrity (the same email always maps to the same token) but reveal nothing about the underlying identity. Use AWS-managed tokenization or build with KMS-encrypted lookup tables. Synthetic substitution — replace PII with realistic but fake values from a synthetic data generator. The choice depends on whether the model needs to learn from the patterns in PII (use tokenization or synthetic) or whether PII is incidental and can be removed (use redaction). Macie scans the training bucket post-hoc as the final compliance gate to confirm no PII slipped through.

Q5 — How do I enforce data residency for ML training data?

Three layers of control. SCPs at the AWS Organizations level deny s3:PutObject to buckets outside approved regions and deny sagemaker:CreateTrainingJob outside approved regions — preventing actions before they happen. AWS Config managed rules like s3-bucket-cross-region-replication-enabled and custom rules detect non-compliance, surface in Security Hub, and trigger SSM auto-remediation. S3 bucket policies with aws:RequestedRegion conditions deny access from non-approved regions. The combination prevents data from leaving the approved region, detects accidental policy gaps, and enforces at the request level. The exam tests this with stems about GDPR or financial-regulation residency requirements where the answer must include all three layers.

Q6 — When does Ground Truth automated labeling save money versus pure human labeling?

Automated labeling pays off when the dataset is large enough to amortize the cost of training the internal active-learning model — typically 1,000 to 5,000 records minimum depending on task type. Below the minimum, human-only labeling is required because the active-learning model cannot reach useful confidence. Above the minimum, savings of 30-70 percent are typical, with quality comparable to human labeling because the auto-labeled records are only those where the internal model is highly confident — humans label the ambiguous tail. Choose automated labeling for large datasets where labeling cost is a meaningful budget line; choose pure-human for small high-stakes datasets where automation does not meet the size threshold.

Q7 — How do I link training-time data quality to production drift detection?

The mechanical link is the SageMaker Model Monitor baseline generated from the training dataset. The flow: at training time, run a SuggestBaseline Processing job on the training data, producing a constraints JSON and a statistics JSON in S3. The model is deployed, data capture is enabled on the endpoint, and a Model Monitor schedule runs periodically. The monitor compares captured inference inputs to the baseline and emits CloudWatch metrics on drift. Critically, every model retrain regenerates the baseline against the new training dataset — a stale baseline produces constant false drift alerts. In SageMaker Pipelines, this is encoded as: training step → baseline regeneration step → model registration step → conditional deployment. The pipeline ensures the baseline is always fresh and tied to the deployed model.