SageMaker Model Monitor — Data Quality and Model Quality

SageMaker Model Monitor is the production-grade safety net for deployed ML models on AWS — without it, a model that quietly degraded last Tuesday because an upstream data source changed its column meaning will keep returning seemingly-valid predictions while business KPIs erode silently for weeks. On the MLA-C01 exam, SageMaker Model Monitor anchors Task 4.1 (Monitor model inference) of Domain 4 (ML Solution Monitoring, Maintenance, and Security, 24 percent weight). The associate exam pushes hard on the precise boundary between the four Model Monitor types and the precise boundary between SageMaker Model Monitor and SageMaker Clarify — community pain-point reports universally call out this conflation as the most common mistake on the live exam.

This guide is built for the engineer perspective. It covers the four monitor types at conceptual and configuration depth, baseline creation mechanics, monitoring schedules and data capture wiring, the ground-truth merge problem unique to model quality, training-serving skew as a special drift case, CloudWatch and EventBridge integration for automated retraining triggers, and the exact exam traps that cause candidates to fail Domain 4 questions. It then walks through architecture decisions and an FAQ that mirrors live exam stem patterns.

What Is SageMaker Model Monitor and Where It Fits

SageMaker Model Monitor is a managed service that schedules processing jobs against captured inference traffic from a SageMaker real-time endpoint or batch transform job and emits violation reports plus CloudWatch metrics whenever the live distribution diverges from a captured training-time baseline. It is the post-deployment half of the SageMaker monitoring story; the pre-deployment half is SageMaker Clarify and SageMaker Debugger. Model Monitor runs inside Amazon SageMaker Processing as a recurring job — by default hourly — and writes results to S3 plus CloudWatch.

Why Model Monitor Exists — The Drift Problem

Production models degrade. Data drift happens when input feature distributions move away from training (a sensor recalibration, a marketing campaign that brings new user demographics, a currency rebasing). Concept drift happens when the relationship between features and labels itself changes (consumer tastes evolve, fraud patterns mutate). Label drift happens when the proportion of classes shifts. Without monitoring, all three failure modes are invisible until the business notices conversion or accuracy collapsed weeks ago. Model Monitor gives ML engineers a contractual SLA on the data and prediction quality of the live system.

The Four Model Monitor Types — Memorize the Boundary

Model Monitor offers exactly four monitor types, and the MLA-C01 exam tests the boundary precisely. The four are: Data Quality Monitor (input feature distribution drift), Model Quality Monitor (prediction-vs-ground-truth metric drift), Bias Drift Monitor (fairness metric drift across protected groups), and Feature Attribution Drift Monitor (SHAP value distribution shift). The first two are covered in this guide; the latter two run on SageMaker Clarify under the hood and are covered in the sibling topic. The exam will plant a stem describing a symptom and ask which monitor type detects it — pick wrong and the question is wrong.

Plain-Language Explanation: SageMaker Model Monitor

SageMaker Model Monitor is the kind of managed service that bundles four distinct detection capabilities under one console panel. Three concrete analogies make the structure stick.

Analogy 1 — The Hospital Continuous Patient Monitoring System

Picture a hospital intensive-care unit where every patient is hooked up to a vital-signs monitor — heart rate, blood pressure, oxygen saturation, body temperature — and the nursing station receives an alarm the moment any reading drifts outside the patient's personal baseline. SageMaker Model Monitor is the same architecture for production ML models, where the ML model is the patient and the inference traffic is the vital-signs telemetry.

The vital-signs sensors clipped to the patient's finger and chest are the data capture configuration on a SageMaker real-time endpoint — every request and every response is recorded to S3 like a continuous EKG strip. The patient's admission baseline captured during intake (resting heart rate, normal blood pressure range) is the SuggestBaseline processing job that runs against training data and emits two JSON files — statistics.json describing per-feature distributions and constraints.json describing the rules the live data must satisfy. The nursing station alarm that beeps when oxygen saturation drops below threshold is the CloudWatch alarm wired to a Model Monitor metric like feature_baseline_drift_TotalCharges. The shift-change check the head nurse performs every hour is the monitoring schedule — a cron-driven SageMaker Processing job comparing the last hour of captured data against the baseline. The morning rounds where a senior physician compares actual recovery progress against the expected recovery curve is the Model Quality Monitor — comparing predictions against ground-truth labels that arrive later. The fall-risk reassessment when the patient's vitals shift outside their personal norm is the violation report in S3, listing exactly which features breached which constraints. When the ICU monitor goes quiet, the trouble-shooting walk-through is: is the sensor clipped on (data capture enabled), is the cable plugged in (S3 destination accessible), is the chart recording (monitoring schedule active), and was the baseline taken from this patient (baseline regenerated after model update). That walk-through is exactly the MLA-C01 Task 4.1 troubleshooting tree.

Analogy 2 — The Bakery Quality Control Station

Imagine a commercial bakery where the head baker tasted the first ten batches of sourdough on opening day and wrote down the exact texture, crumb size, salt level, and crust thickness in a quality logbook. From day eleven forward, every hour a quality-control inspector pulls a random loaf from the production line and compares it to the logbook. If today's loaves are saltier than the baseline says they should be, the inspector files a violation report, the head baker investigates the salt supplier, and either the recipe is corrected or a new baseline is captured.

The opening-day taste test is SuggestBaseline against training data — the snapshot of what "good" looks like. The logbook entries are statistics.json (the numerical signature of the baseline) and constraints.json (the rules the live distribution must satisfy). The hourly random pull is the monitoring schedule running a SageMaker Processing job. The flagged loaf delivered to the head baker is the violation record written to the S3 output prefix and emitted as a CloudWatch metric. The separate inspector who measures whether customers actually liked the bread by reading review cards days later is the Model Quality Monitor — predictions arrive immediately but ground truth (did the customer enjoy it) only arrives after delay, so the merge job stitches predictions to labels using a shared identifier. The rule that says "if salt level breaches twice in a row, retrain the recipe" is EventBridge subscribing to the CloudWatch alarm and triggering a SageMaker Pipeline to retrain. When a baker swaps the flour supplier without telling anyone, the symptom is training-serving skew — the model was trained on one supplier's flour and is now being served loaves made from another, and the inspector's distribution drift detects it within hours.

Analogy 3 — The Power Grid SCADA Monitoring

Picture a regional electricity grid where every substation reports voltage, frequency, and load every five seconds to a central SCADA system. The SCADA system was trained on six months of normal operating envelope and now flags every sensor reading that drifts outside the trained envelope. Model Monitor is conceptually identical — the SageMaker endpoint is the substation, every inference call is a sensor reading, and the monitoring schedule is the SCADA evaluator.

The trained operating envelope captured during commissioning is the baseline with statistics.json and constraints.json. The per-substation sensor stream is the data capture configuration writing every input feature and every prediction to S3 as JSONL records partitioned by hour. The SCADA evaluator running every five minutes is the Model Monitor scheduled processing job. The alarm panel in the control room is CloudWatch metrics with names like feature_constraint_check_violations and feature_baseline_drift_voltage. The automatic load-shedding trigger that fires when voltage drops below 95 percent of nominal is EventBridge → SageMaker Pipeline → automated retraining. The post-incident review that compares predicted demand against actual demand is Model Quality Monitor — predictions made an hour ago can finally be scored when actual consumption data lands. The rule that demands resampled baselines after a winter-to-summer load profile shift is the operational practice of regenerating the baseline whenever the underlying training distribution legitimately changes, not patching constraints to accept the drift. Each piece of the analogy lands on a piece of MLA-C01 syllabus.

Data Capture — The Prerequisite for All Four Monitor Types

Before any Model Monitor can run, the SageMaker endpoint must be capturing inference traffic to S3. Data capture is opt-in and configured at endpoint creation or via UpdateEndpoint.

Configuring Data Capture On a Real-Time Endpoint

DataCaptureConfig is set on the endpoint configuration, not the endpoint itself, and includes: EnableCapture (true), InitialSamplingPercentage (1 to 100; for production, often 100 unless cost-prohibitive), DestinationS3Uri (where captured records land), CaptureOptions (Input, Output, or both), and optional KmsKeyId for SSE-KMS encryption. The endpoint writes JSONL records grouped into hourly prefixes — s3://<bucket>/<prefix>/<endpoint-name>/<variant-name>/<yyyy>/<mm>/<dd>/<hh>/.

Sampling Trade-Offs

Sampling at 10 percent reduces capture cost by 90 percent but introduces statistical noise into drift detection, especially for low-volume endpoints where 10 percent of 1,000 daily requests is only 100 captures per day — too few for reliable distribution comparison. For low-traffic endpoints, sample at 100 percent. For very high-traffic endpoints, sample at a rate that delivers at least 10,000 captures per monitoring window.

Capture Format And Limits

Each captured record contains the full input payload, the full output payload, and metadata (timestamp, model name, custom attributes). Maximum captured payload size: 6 MB per inference. If the endpoint serves 7 MB images, the capture is truncated and downstream Model Monitor parsing fails — for image and large-payload endpoints, use a custom processing script that captures features (not raw payloads) into a parallel stream.

Why Async and Batch Endpoints Differ

Asynchronous inference endpoints write request and response objects directly to S3 by design, so data capture is automatic — no separate DataCaptureConfig needed. Batch transform jobs similarly write inputs and outputs to S3 by default. Only real-time endpoints require explicit DataCaptureConfig. The exam loves to plant a stem with an async endpoint and ask "how do you enable capture for monitoring" — the answer is the captures already exist; you simply point the monitoring schedule at the existing input/output S3 paths.

Data Quality Monitor — Detecting Input Feature Drift

Data Quality Monitor compares the statistical distribution of incoming inference features against a captured training baseline and flags violations.

What Data Quality Detects

Per-feature distribution drift on numerical and categorical columns: mean shift, standard deviation shift, distinct-value count shift, missing-value rate shift, and categorical proportion shift. The underlying engine is Deequ (an Apache Spark library for declarative data quality). Each constraint in constraints.json declares a rule like "feature Age must have completeness ≥ 0.99" or "feature Region must take values from {US, EU, APAC}". The monitoring job evaluates every constraint against the current capture window and lists the violations.

Baseline Creation — The SuggestBaseline Processing Job

The baseline is built by running DefaultModelMonitor.suggest_baseline() (Python SDK) or the equivalent CreateProcessingJob API call, pointed at training data in S3. The processing job emits two artifacts: statistics.json (per-feature distribution descriptors — mean, std, min, max, distinct count, missing count) and constraints.json (data type, completeness threshold, distinct count tolerance, categorical value list). Both files land in the configured S3 output prefix. The training data passed to baseline must be in the same format as inference inputs — same columns, same order, same encoding.

Format Requirements For Baseline Data

Baseline data must be CSV or JSONL. CSV without a header row requires dataset_format=csv(header=False) in the SDK call; mismatched header configuration produces silent column-shifting that ruins drift detection. The features in the baseline file must be in the same order as features delivered to the endpoint at inference time — Model Monitor compares by column index, not column name in the default flow.

Customizing Constraints

The auto-generated constraints.json is a starting point, not a final policy. Engineers should review and edit it: relax thresholds for features expected to drift seasonally, tighten thresholds for features that should be perfectly stable, and add custom rules for compound conditions. The edited constraints.json is uploaded to S3 and referenced by the monitoring schedule. Keeping constraints.json under version control alongside the model code is a mature MLOps practice and an exam-friendly answer pattern.

A baseline must be regenerated whenever the model is retrained on new training data — the baseline is bound to a specific training distribution, not to the model itself. Reusing the previous baseline against a model trained on a refreshed dataset produces false-positive drift alarms (the new training data legitimately differs from the old) or, worse, false-negative misses (the new training data already accounts for what should have been flagged as drift). The MLOps pipeline must include a baseline-regeneration step that runs after every successful training and before deployment. On the MLA-C01 exam, scenarios that mention "we retrained the model and now Model Monitor alerts constantly" are testing this exact pattern.

Monitoring Schedule — When And How Often Drift Is Evaluated

A monitoring schedule binds a baseline, a captured-data S3 location, a processing script, and a cron expression into a recurring evaluation.

Cron Expressions For Monitoring Schedules

Schedules use standard cron syntax. The most common is hourly — cron(0 * ? * * *) evaluates at the top of every hour against the previous hour's capture. Daily evaluation — cron(0 8 ? * * *) runs once at 08:00 UTC against the prior 24 hours. Schedules can also be triggered on-demand via start_monitoring_schedule() for ad-hoc evaluation after a suspected anomaly.

Schedule Status — Pending, Scheduled, Stopped, Failed

A newly-created schedule is Pending until the first job starts, then Scheduled while running. Failed runs do not stop the schedule — the next cron tick runs again. A misconfigured schedule (bad S3 path, bad role) that produces persistent failures must be debugged via the processing job logs in CloudWatch. The schedule itself can be paused with stop_monitoring_schedule() and resumed with start_monitoring_schedule().

Output Artifacts Per Run

Each monitoring run writes four artifacts to S3: statistics.json (current-window statistics), constraint_violations.json (the violation list), constraint_checks.csv (per-constraint pass/fail), and the processing job's CloudWatch logs. The violation file is the primary diagnostic — read it first when an alarm fires.

A SageMaker Model Monitor monitoring schedule is a four-tuple binding: (baseline reference, captured-data S3 path, processing script, cron expression) that produces a violation report on every tick. The processing script is the built-in Model Monitor container for data quality and model quality jobs, or a custom Clarify container for bias and feature attribution jobs. The cron expression governs cadence, but on-demand evaluation is supported for incident response. The schedule lives independently of the endpoint — deleting an endpoint does not delete its monitoring schedules, and orphaned schedules continue to run, fail (because capture S3 prefix is empty), and emit failure metrics. Always include schedule cleanup in the endpoint teardown automation.

Model Quality Monitor — Comparing Predictions To Ground Truth

Model Quality Monitor goes one layer deeper than Data Quality Monitor. Where Data Quality watches inputs, Model Quality watches whether the predictions are actually right.

Why Model Quality Is Harder To Run

Predictions are emitted instantly, but ground-truth labels arrive late — sometimes hours, sometimes weeks after inference. A fraud-detection model predicts "fraud" or "not fraud" at the moment a transaction is authorized, but the actual ground truth (was it really fraud) only emerges when the customer disputes a charge thirty days later. A churn model predicts "will churn next month" but cannot be scored for accuracy for thirty more days. Model Quality Monitor handles this delay by ingesting ground-truth labels from a separate S3 path and merging them with captured predictions using a shared inference ID.

The Ground-Truth Merge Mechanism

The merge job is configured via the GroundTruthS3Input field on the monitoring schedule. Predictions in the capture path and ground-truth labels in the ground-truth path share an inferenceId field. A merge processing job (built into Model Quality Monitor) joins them on inferenceId and produces the merged dataset that the metrics evaluator scores. If inferenceId is missing from either side, the merge fails for those records — which is why every prediction must include a unique inference ID set at request time via the InferenceId parameter on InvokeEndpoint.

Metrics Computed By Model Quality Monitor

For binary classification: accuracy, precision, recall, F1, AUC, true-positive rate, false-positive rate, confusion matrix counts. For regression: MAE, MSE, RMSE, R-squared. For multi-class classification: weighted versions of the binary metrics. Each metric has a baseline value (computed from the training-evaluation set) and a current value (computed on the merged window). Drift is flagged when the current value diverges beyond the configured threshold.

Baseline Creation For Model Quality

The Model Quality baseline is created from a labeled validation dataset — a CSV or JSONL with both predictions and ground-truth labels. The processing job emits statistics.json (the metric values on the validation set) and constraints.json (the threshold ranges the live metrics must stay within). Like Data Quality, the file format must match what the endpoint emits.

Operational Reality — Ground Truth Pipelines

Building the ground-truth pipeline is the hardest part of Model Quality monitoring in production. The pattern: a downstream system (CRM, billing, support) writes labeled outcomes to S3 in JSONL with the matching inferenceId. EventBridge schedules trigger a Lambda that compresses and partitions labels, lands them in the configured ground-truth S3 path, and the next monitoring run merges them with old predictions. This pipeline is application-specific and is the one place where Model Monitor cannot abstract the work.

Always set the InferenceId header on every InvokeEndpoint call from the moment Model Quality monitoring is on the roadmap — even months before you actually deploy the monitor. The InferenceId becomes the join key between predictions and ground truth, and there is no retroactive way to add it to historical captures. Teams that bolt on Model Quality monitoring late often discover their existing capture stream has no inference IDs, forcing a backfill via custom Lambda or accepting that historical Model Quality data is unrecoverable. Set a UUID per request from the calling service and propagate it as InferenceId. Storage cost is trivial; rework cost if you skip it is enormous.

Training-Serving Skew — A Special Case Of Data Drift

Training-serving skew is the situation where the feature pipeline at training time produces subtly different feature values than the feature pipeline at serving time, even though both pipelines were intended to be identical.

Why Training-Serving Skew Happens

Common causes: training uses pandas with one timezone interpretation, serving uses Python datetime with another; training one-hot-encodes a categorical with a fitted encoder, serving uses a different encoder fit independently; training imputes missing values with the training-set median, serving imputes with the production median; training uses string casefolding, serving does not. Each minor divergence shifts feature distributions and degrades model accuracy at deployment.

Detecting Skew With Model Monitor

Data Quality Monitor detects skew naturally — the captured features at inference time will not match the baseline derived from training data, and the constraint violations point at exactly which features drifted. Skew shows up as an immediate, persistent drift on day one of deployment, distinct from gradual drift that emerges over weeks.

Preventing Skew — SageMaker Feature Store

The architectural solution to training-serving skew is SageMaker Feature Store with a single feature pipeline writing to both online and offline stores. Training reads from offline store; serving reads from online store; both stores are populated by the same transformation code. Without Feature Store, teams must rigorously share preprocessing code between training and serving paths and keep it under joint version control.

CloudWatch Integration — From Detection To Action

Detecting drift is half the value. The other half is wiring detection to alerting and to automated remediation.

CloudWatch Metrics Emitted By Model Monitor

Each monitoring run publishes per-feature and per-metric CloudWatch metrics under the aws/sagemaker/Endpoints/data-metrics and aws/sagemaker/Endpoints/model-metrics namespaces. Metric names follow patterns like feature_baseline_drift_<feature_name> (Data Quality) and model_metric_<metric_name> (Model Quality). Each metric carries dimensions for endpoint name, monitoring schedule name, and feature/metric name.

Alarms And EventBridge Patterns

CloudWatch alarms on Model Monitor metrics fire SNS notifications, trigger Lambda remediations, or kick off Step Functions workflows. EventBridge can listen for Amazon SageMaker Model Monitor Schedule events with detail-type aws.sagemaker and trigger SageMaker Pipelines for automated retraining when violations exceed thresholds.

Automated Retraining Pipeline

Mature Model Monitor deployments wire alarms to an EventBridge rule that triggers a SageMaker Pipeline executing: regenerate baseline → retrain model → evaluate against held-out test set → conditional approve in Model Registry → blue/green deploy. The retraining is idempotent and self-documenting because every Pipeline run records its lineage. This is the canonical MLOps loop the MLA-C01 exam tests under Domain 3 and Domain 4 jointly.

Model Monitor publishes CloudWatch metrics under aws/sagemaker/Endpoints/data-metrics and aws/sagemaker/Endpoints/model-metrics automatically — there is no extra wiring needed to surface drift in CloudWatch. Any CloudWatch alarm or dashboard can consume these metrics directly. The dimensional model is endpoint name + variant + schedule name + feature/metric name, which means a single alarm can either watch one specific feature on one endpoint (high precision) or aggregate across an entire endpoint (high recall). Pair high-precision alarms with on-call paging for known sensitive features and aggregate alarms with weekly digest reports. The exam favors answers that use the native CloudWatch path; custom log-parsing and shipping to third-party SIEM is an over-engineered wrong answer.

SageMaker Model Monitor vs SageMaker Clarify — The Critical Distinction

The single most-failed boundary on MLA-C01 Domain 4: when does a question call for Model Monitor and when does it call for Clarify.

Clarify Has Two Lives — Training-Time And Production-Time

SageMaker Clarify is a single service that runs in two contexts. Training-time Clarify runs as a processing job during model development, computing pre-training bias metrics on the dataset and post-training bias metrics on the trained model's predictions, plus SHAP explainability. Production-time Clarify runs as the engine inside Model Monitor's Bias Drift and Feature Attribution Drift monitor types. Same Clarify container, two deployment contexts.

Model Monitor Has Four Types — Two Run On Clarify

Of the four Model Monitor types, Data Quality Monitor and Model Quality Monitor use the built-in Model Monitor container. Bias Drift Monitor and Feature Attribution Drift Monitor delegate to Clarify processing jobs under the hood. The exam rarely tests this internal detail directly but does test the user-facing distinction: which symptom calls for which monitor type.

The Decision Tree

If the symptom is "input feature distribution looks different from training data" — Data Quality Monitor. If the symptom is "predictions no longer match ground truth labels well" — Model Quality Monitor. If the symptom is "fairness metrics across protected groups have shifted in production" — Bias Drift Monitor (covered in sibling topic). If the symptom is "the model's reliance on individual features has shifted" or "feature importance moved" — Feature Attribution Drift Monitor (covered in sibling topic). If the symptom is "this dataset is unbalanced before training even started" — training-time Clarify pre-training bias, not a Model Monitor type at all.

Do not conflate SageMaker Clarify with SageMaker Model Monitor on the MLA-C01 exam. Clarify is the analysis service that computes bias and explainability metrics; it has both training-time and production-time uses. Model Monitor is the scheduling and capture service that runs four types of monitors against live endpoints. Bias Drift and Feature Attribution Drift monitors happen to use Clarify internally, but the user-facing API is Model Monitor — you create monitoring schedules, not Clarify processing jobs, when you want continuous post-deployment monitoring. The exam stem will describe a scenario like "team needs to track production bias drift over time" — the answer is Model Monitor Bias Drift type, not Clarify alone. A stem that says "team needs a one-time fairness audit on the trained model before deployment" — the answer is Clarify post-training bias analysis, not Model Monitor. Read the stem for "continuous", "production", "drift", or "scheduled" to decide.

Common MLA-C01 Exam Traps For Model Monitor

Trap 1 — Data Quality Monitor Detects Concept Drift

Wrong. Data Quality Monitor watches input features only. Concept drift (the relationship between features and labels has changed) only shows up in Model Quality Monitor, because detecting it requires comparing predictions to ground truth.

Trap 2 — Model Quality Monitor Works Without Ground Truth

Wrong. Model Quality requires ground-truth labels to merge with captured predictions. Without a ground-truth ingestion pipeline, Model Quality Monitor has nothing to score. Engineers who forget this expect immediate metric reporting and find empty violation files.

Trap 3 — Reusing The Old Baseline After Retraining

Wrong. Every retraining round must regenerate statistics.json and constraints.json from the new training data. Reusing an old baseline produces meaningless drift signals.

Trap 4 — Data Capture Works Without Configuration For Real-Time Endpoints

Wrong. Real-time endpoints require explicit DataCaptureConfig on the endpoint configuration. Asynchronous and batch endpoints write inputs and outputs to S3 by design, but real-time endpoints discard payloads unless capture is enabled.

Trap 5 — Sampling At 10 Percent Is Always Fine

Wrong. Low-traffic endpoints do not produce enough samples for reliable drift detection at low sampling rates. Production endpoints with under 10,000 daily requests should capture at 100 percent.

Trap 6 — Model Monitor Replaces Ground-Truth Labeling Pipelines

Wrong. Model Monitor schedules the merge job; the application team owns the ground-truth ingestion pipeline. Building the labeled-outcomes feed is application-specific work.

Trap 7 — Bias Drift And Feature Attribution Use Model Monitor Container

Wrong. They use Clarify processing jobs under the hood. The user creates a monitoring schedule of type bias or feature attribution, but the worker is Clarify.

Trap 8 — Constraint Violations Always Mean The Model Is Broken

Partially wrong. Violations indicate the live distribution diverged from baseline; whether this requires retraining depends on the cause. Seasonal patterns, legitimate population shifts, or onboarding new customer segments all produce drift without indicating model degradation. Retraining decisions need a human-in-the-loop review of the violation report, not an unconditional retrain on every alarm.

Trap 9 — Monitoring Schedules Auto-Pause When The Endpoint Is Deleted

Wrong. Schedules persist independently. Failing schedules from deleted endpoints continue to run and emit failure metrics. Always include delete_monitoring_schedule() in endpoint teardown.

Trap 10 — Drift Detection Replaces Application-Level Quality Gates

Wrong. Model Monitor is statistical drift detection on a window of captured data; it does not replace synthetic monitoring, integration tests, or business KPI dashboards. A drop in conversion rate detected by the product team is a different signal layer.

Key Numbers And Must-Memorize Model Monitor Facts

Monitor Types

Four types: Data Quality, Model Quality, Bias Drift, Feature Attribution Drift
First two run on Model Monitor container; last two run on Clarify under the hood
Each type requires its own baseline; they are not interchangeable

Data Capture

Must be enabled on real-time endpoints via DataCaptureConfig
Sampling 1 to 100 percent; 100 for low-traffic, 10-50 for very high traffic
Maximum payload 6 MB; output written to hourly S3 prefix
Async and batch endpoints capture by design without separate configuration

Baseline

Created via SuggestBaseline processing job pointed at training data
Two artifacts: statistics.json and constraints.json
Must regenerate on every retraining round
File format must match endpoint inference format

Schedule

Cron-driven; default hourly
Supports on-demand triggering for ad-hoc evaluation
Independent of endpoint lifecycle — clean up explicitly

Outputs

statistics.json (current window stats)
constraint_violations.json (failed constraints)
constraint_checks.csv (per-constraint result)
CloudWatch metrics under aws/sagemaker/Endpoints/data-metrics and model-metrics

FAQ — Model Monitor Top Exam Questions

Q1 — A team enables Data Quality Monitor and sees zero violations even though the live endpoint is clearly producing odd predictions. What is the most likely cause?

The captured data is missing or the baseline was generated from a sample that does not represent training reality. First, verify that data capture is enabled on the endpoint and that S3 objects are landing in the configured capture prefix. Second, verify that constraints.json has meaningful thresholds — auto-generated constraints sometimes accept very wide tolerances on noisy training features, masking real drift. Third, check whether the symptom is concept drift (feature distributions look fine but predictions are wrong against ground truth) — Data Quality cannot detect concept drift; switch to Model Quality Monitor. The exam stem favoring this pattern usually has a hint like "predictions are wrong but inputs look normal" — that points at Model Quality, not Data Quality.

Q2 — How do I prevent training-serving skew from being misdiagnosed as gradual data drift?

Detect it by timing. Training-serving skew presents as immediate, persistent drift starting on day one of deployment; gradual drift presents as slowly widening drift over weeks. The architectural prevention is SageMaker Feature Store with one feature pipeline writing both online and offline stores so training and serving paths share identical transformations. Without Feature Store, version-control the preprocessing code in a single repo and import it into both the training script and the inference container; do not allow two implementations to diverge.

Q3 — A team needs Model Quality monitoring but ground-truth labels arrive 30 days after inference. How is this configured?

Configure the monitoring schedule to merge predictions with the ground-truth S3 path; the merge job will only score predictions whose corresponding labels exist. For 30-day-late labels, the schedule running today merges predictions from 30 days ago with labels delivered today. Build a Lambda that, on each ground-truth arrival, lands JSONL with the matching inferenceId into the configured ground-truth path. Set the monitoring schedule cadence to daily rather than hourly so each run has enough data to compute stable metrics. Plan for a 30-day delay between deployment and the first meaningful Model Quality metric; this is unavoidable when ground truth is delayed.

Q4 — Model Monitor is alerting on drift after a successful retraining. The new model is verified accurate but every monitoring run files violations. What is wrong?

The baseline was not regenerated from the new training data. Reusing the old baseline against a model trained on a refreshed dataset produces persistent false-positive violations because the new training distribution differs from the old. The fix: rerun SuggestBaseline on the new training data, upload the new statistics.json and constraints.json, and update the monitoring schedule to point at the new baseline location. Embed this regeneration step into the SageMaker Pipeline so it is automatic on every retraining round.

Q5 — What is the difference between Data Quality Monitor finding "feature_baseline_drift" violations and the model actually being broken?

Drift detection reports statistical divergence between live features and baseline; broken-model symptoms include accuracy collapse against ground truth. The two are correlated but not identical. Drift can be benign — population shifts, seasonal patterns, onboarding new geographies all produce feature-level drift without harming model accuracy. The actionable signal is the combination of Data Quality drift plus Model Quality metric degradation — together they indicate the model is failing because of a real distributional change. Wire alarms on the joint condition rather than on Data Quality alone to suppress false-positive retraining triggers.

Q6 — How do I monitor an asynchronous inference endpoint with Model Monitor?

Async endpoints already write input and output to S3 by design. No separate DataCaptureConfig is needed. Create the monitoring schedule pointing the input data path at the async input S3 location and the output data path at the async output S3 location. The processing script in the schedule reads from these prefixes the same way it would read from real-time capture prefixes. The exam loves this distinction — the wrong answer is "enable data capture on the async endpoint"; the right answer is "configure the monitoring schedule against the async endpoint's existing input/output S3 paths."

Q7 — A monitoring schedule keeps failing with "S3 prefix is empty". What is the diagnosis?

Three common causes. First, the monitoring schedule's start time is before any captures landed — bring the schedule into a time window after capture began. Second, the capture prefix differs from the prefix the schedule expects — the endpoint writes to <prefix>/<endpoint-name>/<variant-name>/<yyyy>/<mm>/<dd>/<hh>/ and the schedule must look at the same prefix root. Third, IAM role permission issues — the monitoring processing job's role needs s3:GetObject on the capture prefix and s3:PutObject on the output prefix; missing permissions surface as empty-prefix or AccessDenied errors. Read the processing job's CloudWatch log for the exact diagnostic message before guessing.