examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

ML Deployment Strategies — A/B Testing, Shadow, and Blue/Green

4,000 words · ≈ 20 min read ·

Master ML deployment strategies for MLA-C01 Domain 3 Tasks 3.1/3.2: SageMaker production variants for A/B testing, blue/green deployment, canary and linear traffic shifting, shadow deployment for risk-free model validation, deployment guardrails with auto-rollback, statistical significance, rollback mechanics, and the deployment strategy decision matrix the exam tests heavily.

Do 20 practice questions → Free · No signup · MLA-C01

ML deployment strategy is where MLOps maturity is tested most directly on MLA-C01. Replacing a model in production is not the same as replacing a Lambda function — model behaviour can drift in production for reasons that never showed up in offline evaluation (training-serving skew, distribution shift, edge-case inputs the test set never contained), and the wrong cutover strategy can degrade business metrics for hours before anyone notices. The exam plants stems with phrases like "minimise risk when deploying a new model," "compare a new model against current production with real traffic," "shift traffic gradually with automatic rollback," and "validate the new model with zero customer impact" — each phrase maps to exactly one strategy: A/B test, shadow, canary or linear, blue/green. Confusing these is the most-cited MLA-C01 mistake on the Tutorials Dojo and ExamCert.App pain-point lists.

This guide walks five deployment strategies (production variants for A/B, blue/green, canary, linear, shadow), then the SageMaker deployment guardrails feature that automates rollback, then the integration with Model Registry and SageMaker Pipelines for end-to-end CI/CD. It is written for the ML Engineer perspective — what gets configured, which CloudWatch metrics matter for rollback triggers, and the precise distinctions the exam tests.

Why Deployment Strategy Matters For ML Specifically

Software deployments fail in well-understood ways: a 5xx error rate spike, a memory leak, a database migration gone wrong. ML deployments fail in additional, subtler ways: a new model that scores fine offline but underperforms online due to training-serving skew, a model that handles 99 percent of inputs better but degrades dramatically on a critical 1 percent edge case, or a model whose latency profile is acceptable on average but has a heavy P99 tail that violates SLA. The deployment strategy is the engineering control that catches these failures before all traffic is exposed.

The Cost Of A Bad Cutover

A direct in-place model replacement (delete old endpoint config, attach new model) means 100 percent of traffic instantly hits the new model. If the new model is bad, every user sees the bad behaviour from second one until the team detects, decides to roll back, and executes the rollback — typically minutes to hours. For high-stakes models (fraud detection, recommendations, pricing), even minutes of bad behaviour translates to revenue loss, customer churn, or regulatory exposure. Deployment strategy is the buffer that limits blast radius.

The MLOps Maturity Curve

Maturity tier 1: replace endpoint config, fingers crossed. Tier 2: blue/green with manual rollback. Tier 3: canary with automated rollback on CloudWatch alarm. Tier 4: shadow deployment validating new model on real traffic with zero customer exposure before any cutover. The MLA-C01 exam expects awareness of all four tiers and ability to pick the right one for a given risk level.

Plain-Language Explanation: ML Deployment Strategies

The five strategies feel similar in the abstract until you map them to physical deployment patterns most people have encountered.

Analogy 1 — The Restaurant Menu Update

Imagine a restaurant chain rolling out a new recipe for its signature dish across 100 locations. Direct cutover is changing the recipe everywhere on Monday morning at 9 AM — every customer in every store gets the new dish from minute one. If the recipe is wrong, every customer is unhappy at once. Blue/green is running both kitchens in parallel — the existing kitchen (blue) keeps serving the old recipe while the new kitchen (green) is staffed and fully prepared, then at the chosen moment the chain switches all orders from blue to green at once. If green is bad, switch back to blue (still warm and running). Canary is rolling out the new recipe in two stores first, watching for complaints and food-safety issues for a week, then expanding to ten stores, then to all 100 — small initial blast radius, gradual expansion. Linear is the same idea but in equal scheduled steps: 10 percent on Monday, 20 percent on Tuesday, all the way to 100 percent on Friday, regardless of feedback. Shadow deployment is the most cautious — every customer order goes to the existing recipe (blue) and is served, but a parallel kitchen also cooks the new recipe (green) and discards it for taste-testing comparison; no customer ever eats the new recipe until shadow validation completes. A/B testing is letting two recipes run simultaneously in the same stores, with half of customers randomly served the new and half the old, and then comparing satisfaction metrics statistically — keep the recipe that wins. Each pattern matches a different risk tolerance and a different question being asked. MLA-C01 stems use the phrasing of the question to tell you which strategy is the answer.

Analogy 2 — The Hospital Drug Trial Pipeline

Picture a pharmaceutical company moving a new drug from development to widespread use. Direct cutover is releasing the drug to every pharmacy on day one — fastest, riskiest, never done in regulated industries. Blue/green is keeping the old drug fully stocked and parallel-distributing the new drug to all pharmacies on a switch date — if the new drug shows side effects, recall to the old (still stocked). Canary is the FDA Phase 1 / Phase 2 / Phase 3 trial structure — give to 100 patients, then 1,000, then 10,000, watching for adverse events at each stage and stopping if any threshold is breached. Linear is scheduled rollout once approved — 10 percent of pharmacies week 1, 30 percent week 2, full coverage by week 4, on a fixed schedule. Shadow deployment is the FDA's parallel-track research where a control group continues on the existing drug and the trial group on the new drug, with neither group's outcome affecting the other's care decisions. A/B testing is the randomised controlled trial — half on the existing standard of care, half on the new drug, measure outcomes statistically, adopt the winner. The hospital industry's risk-aversion is exactly the mental model the MLA-C01 exam uses for high-stakes ML — fraud models, medical-imaging models, financial-pricing models — where shadow and canary are preferred over blue/green or direct cutover.

Analogy 3 — The Bridge Construction Replacement

Picture replacing a heavily-used bridge across a river. Direct cutover is closing the old bridge on Saturday morning and opening the new bridge that afternoon — fastest, but if the new bridge has a structural issue, the river crossing is blocked. Blue/green is building the new bridge in parallel beside the old one, keeping both open during a transition weekend, then closing the old bridge once traffic flows smoothly on the new — instant rollback by reopening the old bridge if needed. Canary is opening only one lane of the new bridge, watching for problems for a week, then opening more lanes as confidence grows. Linear is a scheduled lane-by-lane opening — one lane Monday, two by Wednesday, full open by Friday. Shadow deployment is the engineering parallel testing — the new bridge is structurally complete and bears simulated traffic loads (testing with sandbags and water trucks driven across it) but no public traffic crosses it until validation passes. A/B testing is having both bridges open simultaneously with different routing — half of southbound traffic uses the old, half the new, and engineers measure travel time, accident rate, and structural metrics on each. The civil engineering analogy maps to ML deployment because both are high-stakes irreversible decisions where failure recovery is expensive — exactly why the MLA-C01 exam tests these patterns at depth.

Production Variants — The Foundation For A/B Testing

SageMaker real-time endpoints support production variants — multiple model versions hosted on the same endpoint with traffic distributed by configurable weights.

Variant Configuration

An endpoint configuration declares one or more ProductionVariant entries, each with a VariantName, ModelName, InstanceType, InitialInstanceCount, and InitialVariantWeight. Traffic to the endpoint is distributed proportionally across variants by weight — variant A weight 9 and variant B weight 1 means 90 percent of traffic to A and 10 percent to B.

A/B Testing Workflow

  1. Create endpoint configuration with current production model as variant A (weight 1.0)
  2. Deploy endpoint
  3. Update endpoint configuration to add variant B (new model) at weight 0.1, keeping variant A at weight 0.9
  4. Use UpdateEndpointWeightsAndCapacities to shift traffic gradually as confidence grows
  5. Compare per-variant metrics (latency, business metric like CTR or conversion) for statistical significance
  6. Promote variant B to 1.0 weight if it wins, or remove if it loses

Statistical Significance

Production variant traffic split is not random per request — it is per-invocation deterministic based on a hash. To run a statistically valid A/B test, ensure traffic volume per variant is large enough that the comparison metric exceeds noise floor, and run for enough time to capture daily/weekly seasonality. Typical A/B tests run for one to four weeks at meaningful traffic split.

Per-Variant Auto-Scaling

Each variant has its own auto-scaling configuration. Variant A at 90 percent and variant B at 10 percent will scale independently based on per-variant InvocationsPerInstance — the small variant scales down to its MinCapacity while the large variant scales out as load grows.

A SageMaker production variant is a hosted model version on a real-time endpoint with an assigned traffic weight, used for A/B testing model performance with live traffic. Multiple variants share an endpoint, traffic distributes proportionally by weight, and variants can be scaled independently. A/B testing with production variants is the standard pattern when the goal is to compare a new model against current production using real customer traffic and a measurable business metric (conversion rate, click-through rate, prediction accuracy against ground truth). It is NOT shadow deployment (which serves no traffic to the new model) and NOT canary (which gradually shifts all traffic away from the old). A/B testing keeps both models serving in parallel with statistical comparison as the goal.

Blue/Green Deployment — Parallel Stack With Instant Cutover

Blue/green is the classic "two parallel stacks, switch traffic" pattern.

Mechanics On SageMaker

The "blue" environment is the current endpoint configuration serving production. The "green" environment is a new endpoint configuration with the new model fully provisioned. SageMaker UpdateEndpoint with the new endpoint config triggers the blue/green swap — green is brought online, traffic shifts 100 percent to green, then blue instances are torn down. This is the default SageMaker UpdateEndpoint behaviour with a BlueGreenUpdatePolicy.

Traffic Shift Modes

Within blue/green, the traffic shift can be:

  • All-At-Once — instant 100 percent cutover from blue to green (default for many configurations)
  • Canary — small percentage to green first, validate, then full shift
  • Linear — equal-step shift over time

Despite the names, AWS treats canary and linear as traffic shift modes within a blue/green deployment — not as separate deployment strategies. This is a heavily-tested vocabulary distinction.

Rollback Mechanics

If green fails the post-deployment validation period (configurable, with CloudWatch alarms as the trigger), SageMaker shifts traffic back to blue and tears down green. Blue stays warm during the validation period precisely so rollback is instant.

When To Use Blue/Green

  • Need fast cutover with instant rollback capability
  • Tolerable to run double infrastructure cost during deployment window
  • Confidence in pre-production testing high enough that gradual shift is unnecessary
  • Time-pressured deployment with confidence backed by automated guardrails

Canary Deployment — Small-Percentage Initial Traffic

Canary in the SageMaker blue/green context means the green environment first receives a small fraction of traffic (the "canary") for a baking period before full cutover.

Configuration

BlueGreenUpdatePolicy.TrafficRoutingConfiguration.Type = "CANARY" with a CanarySize specifying either a percentage or instance count. The canary phase runs for the configured duration; if no CloudWatch alarms fire and the wait period elapses, traffic shifts to 100 percent green.

CloudWatch Alarms For Auto-Rollback

The canary's purpose is to expose problems on a small percentage of users before all users are affected. Pair the canary phase with CloudWatch alarms on:

  • Endpoint error rate (Invocation5XXErrors)
  • Endpoint latency (ModelLatency P90 or P99)
  • Custom business metrics (model-specific accuracy, fairness)

If any alarm fires during the canary period, SageMaker auto-rolls back. This is the SageMaker Deployment Guardrails feature.

When To Use Canary

  • Risk-averse deployments where 1-5 minute exposure on 5-10 percent of users is acceptable but not 100 percent
  • Have meaningful real-time CloudWatch alarms that can trigger rollback
  • Production traffic large enough that 5-10 percent in canary period gives signal

Linear Deployment — Scheduled Equal-Step Shift

Linear deployment shifts traffic in equal steps on a schedule.

Configuration

BlueGreenUpdatePolicy.TrafficRoutingConfiguration.Type = "LINEAR" with LinearStepSize (percentage) and WaitIntervalInSeconds. SageMaker shifts a fixed percentage of traffic from blue to green every wait interval until 100 percent.

When To Use Linear

  • Want predictable scheduled rollout pace
  • Each step is paired with CloudWatch alarms for incremental validation
  • Time-bounded deployment window (e.g. complete by end of business day)

Linear vs Canary

Canary is "small slice first, then jump to 100 percent." Linear is "equal-step gradual shift." Both can integrate with auto-rollback. Canary is faster but less granular; linear is slower but each step is independently validated. The MLA-C01 exam tests this distinction by stem language: "small percentage first" → canary; "gradual equal steps" → linear.

Blue/green, canary, and linear are NOT independent deployment strategies in SageMaker — canary and linear are TRAFFIC SHIFT MODES within a blue/green deployment. This is the most frequently misunderstood vocabulary on MLA-C01. AWS implements all three as a single BlueGreenUpdatePolicy with three values for TrafficRoutingConfiguration.Type: ALL_AT_ONCE, CANARY, and LINEAR. Tutorials Dojo's pain-point list cites this confusion as the most-cited mistake. When the exam writes "blue/green deployment," confirm whether the question is asking about the parallel-stack pattern in general (which includes the all-at-once subcase) or the specific traffic-shift mode. Reading the specific phrasing — "all-at-once cutover" vs "small percentage first" vs "equal-step shift" — disambiguates the right answer.

Shadow Deployment — Risk-Free Model Validation

Shadow deployment is the most conservative pattern: the new model receives a duplicate of live traffic but its predictions are never returned to the customer.

Architecture

A shadow variant is configured on the endpoint alongside the production variant. Every incoming request is sent to both variants. The production variant's response is returned to the client; the shadow variant's response is captured (logged to S3 for analysis) and discarded. Customers see no impact; engineers compare shadow predictions against production predictions and against eventual ground truth.

SageMaker Shadow Tests

CreateShadowTestSchedule configures a structured shadow test with a defined duration, traffic percentage to mirror, and post-test analysis report. The report compares the shadow model's predictions against production for divergence and against ground truth for accuracy.

When To Use Shadow Deployment

  • Highest risk tolerance scenarios where any customer exposure to a bad model is unacceptable
  • Validating new model on real production traffic before any cutover
  • Testing model latency under real load before committing
  • Validating that retrained model agrees with current production on most predictions before promoting
  • Regulated industries where every production-impacting change requires evidence of safety

Shadow vs A/B Test

A/B test serves both models to real customers with statistical comparison of business outcomes. Shadow serves only the production model to customers and validates the new model offline. A/B is for "which model performs better in production"; shadow is for "is the new model safe to put in front of customers at all."

Shadow deployment validates a new model against live traffic with ZERO customer impact — production responses are served from the existing model, shadow predictions are captured for analysis only. This is the right answer when the exam stem says "test the new model against real traffic with no risk to customers" or "compare predictions before exposing any users." It is the only deployment strategy where customers literally never see the new model's output. Shadow is the highest-confidence validation pattern but does not measure business outcomes — for that, you need an A/B test where customers actually receive the new model's predictions. The pattern: shadow first to confirm safety, then A/B to measure business impact, then promote on win.

Deployment Guardrails — Auto-Rollback Configuration

SageMaker Deployment Guardrails wrap the blue/green deployment with automatic CloudWatch-alarm-driven rollback.

Configuration

When calling UpdateEndpoint, attach a DeploymentConfig containing:

  • BlueGreenUpdatePolicy — traffic shift mode (all-at-once, canary, linear)
  • AutoRollbackConfiguration — list of CloudWatch alarms to monitor

If any monitored alarm fires during the blue/green deployment, SageMaker:

  1. Stops shifting traffic to green
  2. Shifts all traffic back to blue
  3. Marks the deployment as failed
  4. Leaves both blue and green configurations in place for forensic analysis

Alarm Choice

Typical alarms:

  • Invocation5XXErrors exceeds threshold
  • ModelLatency P95 exceeds latency SLA
  • CPUUtilization saturation indicating undersized green instances
  • Custom business metric alarm (e.g. CTR drops below baseline by N percent)

Manual Rollback

Even outside a deployment guardrails window, manual rollback is just an UpdateEndpoint call pointing back at the previous endpoint configuration (which is preserved in your EndpointConfig history). Always tag endpoint configurations with model version metadata so the previous version is identifiable for instant rollback.

Always pair every blue/green deployment (canary or linear) with at least three CloudWatch alarms in the AutoRollbackConfiguration: error rate, P99 latency, and a model-specific quality metric. Deployment Guardrails only rolls back on alarms it is told to monitor — without configuration, the deployment proceeds even if metrics degrade. The minimum production set is Invocation5XXErrors (catches outright failures), ModelLatency P95 or P99 (catches latency regressions that slip past unit tests), and a custom CloudWatch metric emitted by the model itself or by Model Monitor (catches accuracy drift, fairness regression, or business-metric degradation). For high-stakes models, add fifth and sixth alarms covering downstream business metrics measurable within the canary window.

Integration With Model Registry — End-To-End MLOps Flow

Deployment strategies become operationally clean when wired through SageMaker Model Registry.

Registry-Driven Deployment

The flow:

  1. SageMaker Pipeline trains a new model
  2. ConditionStep evaluates the new model's metrics against a threshold
  3. If passing, RegisterModel step adds the model to a Model Package Group with PendingManualApproval status
  4. ML Engineer reviews and approves the model package, status becomes Approved
  5. Approval event triggers an EventBridge rule
  6. EventBridge invokes a Lambda or CodePipeline that calls UpdateEndpoint with deployment guardrails configuration
  7. Blue/green canary deployment proceeds with auto-rollback on CloudWatch alarms
  8. Deployment success records the new model version as the active production version

Why Registry-Driven Wins

The model registry provides the audit trail (who approved, when, against which metrics) and the deployment trigger (approval event). Combined with deployment guardrails, every model deployment is reviewable, reversible, and recorded.

Deployment Strategy Decision Matrix

Strategy Risk Level Speed Customer Impact Use Case
Direct (replace endpoint) Highest Fastest All-or-nothing Dev/test only
Blue/green all-at-once Medium Fast Brief swap High-confidence prod releases
Canary Low Medium Small % first Risk-averse with auto-rollback
Linear Low Medium-slow Gradual shift Time-bounded scheduled rollout
Shadow Lowest Slowest None Safety-critical validation
A/B test Variable N/A (long-running) Full to both Compare two models on business metrics

How To Map Stem To Strategy

  • "minimise risk during deployment" → canary or shadow
  • "compare new model to current model" → A/B testing with production variants
  • "validate new model with no customer exposure" → shadow
  • "fast cutover with instant rollback" → blue/green
  • "gradual scheduled traffic shift" → linear
  • "small percentage first then full" → canary

Common MLA-C01 Deployment Strategy Traps

Trap 1 — Blue/Green Means All-At-Once

Wrong. Blue/green is a category that includes all-at-once, canary, and linear traffic shift modes. The exam plants stems where "blue/green" alone is ambiguous and the answer depends on the traffic shift mode specified.

Trap 2 — Canary And Linear Are Independent Of Blue/Green

Wrong on SageMaker. Canary and linear are traffic-shift modes within a blue/green deployment. Outside SageMaker, in CodeDeploy and other AWS services, canary and linear are separate concepts.

Trap 3 — A/B Testing Is The Same As Canary

Wrong. A/B is steady-state comparison of two models for statistical significance, typically running for weeks. Canary is a transient stage of a blue/green deployment lasting minutes to hours. The intent differs: A/B measures business outcomes; canary catches deployment failures.

Trap 4 — Shadow Deployment Serves Predictions To Some Users

Wrong. Shadow never serves predictions to customers. Production model serves all customers; shadow runs in parallel and discards its predictions to S3 logs.

Trap 5 — Rollback Requires Re-Training

Wrong. Rollback is updating the endpoint configuration to point at the previous model package. The model is already in the registry; deployment is just a pointer change.

Trap 6 — Deployment Guardrails Auto-Rollback On Any Alarm

Wrong. Auto-rollback fires only on alarms listed in the AutoRollbackConfiguration. Other CloudWatch alarms exist but do not affect deployment.

Trap 7 — Production Variants Require Separate Endpoints

Wrong. Production variants share a single endpoint with traffic distributed by weight. Separate endpoints would be a different deployment topology.

Trap 8 — Shadow Tests Are Free

Wrong. Shadow variant runs on its own instances (mirroring production traffic doubles inference compute cost during the shadow window). Plan capacity and budget accordingly.

FAQ — ML Deployment Strategies

Q1 — When should I pick canary deployment over linear deployment for a new model release?

Canary is right when you want to expose the new model to a small slice of traffic for a relatively short validation window, watch CloudWatch alarms during that window, and then jump to 100 percent if no alarms fire. Linear is right when you want a longer, more gradual shift in equal steps, with each step independently validated against alarms before the next step proceeds. Canary is faster end-to-end (one short canary window then 100 percent); linear is slower but each step is incrementally observed. For most production releases with strong CloudWatch coverage, canary is the default — the canary window catches the bulk of deployment-time issues, and the speed benefit matters. For models with subtle degradation that only shows up at scale or over time, linear's longer shift period gives more validation surface.

Q2 — What is the difference between a SageMaker production variant A/B test and a canary deployment?

A/B testing with production variants is a long-running steady-state configuration where two models serve real customer traffic in parallel for statistical comparison of business metrics — runs for days or weeks, intent is to measure which model wins. Canary deployment is a transient stage in a blue/green deployment lasting minutes to hours, intent is to catch deployment failures before all traffic is exposed. They are not interchangeable. If the question asks "which model has higher conversion rate over a month of traffic," the answer is A/B testing. If the question asks "deploy the new model with reduced risk," the answer is canary. The vocabulary cue in MLA-C01 stems: "compare two models" or "measure business metric" → A/B; "minimise deployment risk" or "small percentage first then full rollout" → canary.

Q3 — How do I implement automated rollback for a SageMaker blue/green deployment?

Configure a DeploymentConfig on the UpdateEndpoint call containing both a BlueGreenUpdatePolicy (specifying the traffic shift mode — all-at-once, canary, or linear) and an AutoRollbackConfiguration listing the CloudWatch alarms to monitor during the deployment. Typical alarms: endpoint 5XX error rate above threshold, P95 or P99 model latency exceeding the SLA, and a custom CloudWatch metric emitted by Model Monitor or by the application measuring prediction quality. If any monitored alarm enters the ALARM state during the deployment window, SageMaker automatically stops the traffic shift, reverts traffic back to blue (the previous configuration), and marks the deployment as failed. Both blue and green configurations remain in place for post-mortem analysis. The deployment timer plus alarm monitoring period are configurable; typical canary window is 30 minutes to two hours with 1-3 minute alarm evaluation period.

Q4 — Why would I use shadow deployment instead of an A/B test for a new model?

Use shadow when customer exposure to a potentially bad model is unacceptable — common in high-stakes domains like healthcare, fraud detection, financial pricing, or regulated industries. Shadow runs the new model against duplicated live traffic but never returns its predictions to customers; production model serves all real predictions. You compare shadow predictions against production predictions and against ground truth to validate safety. A/B test, by contrast, exposes the new model to real customers — half of users get the new model's predictions and act on them. If the new model has a critical failure mode (e.g. wrongly approves fraud, wrongly denies a loan, wrongly recommends a medication), A/B test customers experience the failure. Shadow is the safety-validation step before A/B; A/B is the business-impact-measurement step before full rollout. The MLA-C01 exam tests shadow as the answer to "test new model against real traffic with no customer impact."

Q5 — What CloudWatch metrics should I monitor during a model deployment?

At minimum: Invocation5XXErrors (deployment-broke errors), ModelLatency P95 or P99 (latency regression catches), and Invocations (sanity check that traffic is reaching the new variant). For higher confidence: CPUUtilization and MemoryUtilization per variant (catches under-provisioned green), OverheadLatency (SageMaker-side overhead, catches misconfiguration), and a custom CloudWatch metric for model-specific quality emitted by Model Monitor or by the application code (catches accuracy regression, fairness regression, or business-KPI degradation). Configure each as a CloudWatch alarm with sensible thresholds (e.g. 5XX rate above 1 percent, P99 latency above 500ms, accuracy below 0.85), then list those alarms in the deployment guardrails AutoRollbackConfiguration. Without listed alarms, deployment proceeds even when metrics degrade — the alarm-driven rollback only fires on alarms it is configured to watch.

Q6 — Can I run a shadow deployment and an A/B test on the same endpoint at the same time?

Practically yes — production variants on a real-time endpoint can include both standard variants (with non-zero traffic weight) and shadow variants (mirroring traffic from a designated production variant without serving responses). The configuration becomes complex: variant A at weight 0.7 (current production), variant B at weight 0.3 (A/B test of new model), and variant C as shadow on variant A (validating a candidate model without exposure). This is rare in practice because the cost is high (three sets of inference instances) and the analysis surface gets complex. The typical pattern is to sequence: shadow first to validate safety, A/B second to measure impact, then full rollout. Combining shadow and A/B simultaneously is reserved for organisations with deep MLOps maturity and dedicated experimentation infrastructure.

Q7 — How does the integration between Model Registry and deployment strategies work in an end-to-end MLOps pipeline?

The Model Registry holds versioned model packages with Approved / PendingManualApproval / Rejected status. A SageMaker Pipeline trains the model, runs a ConditionStep to gate quality (e.g. AUC > 0.85), and on pass calls a RegisterModel step that creates a new model package in PendingManualApproval. An ML Engineer or business stakeholder reviews the model package (training metrics, evaluation report, model card, lineage) and either approves or rejects. Approval emits an EventBridge event (SageMaker Model Package State Change) which triggers a CodePipeline or Lambda that calls UpdateEndpoint with a deployment guardrails configuration — typically a canary blue/green with auto-rollback alarms. If the deployment fails (alarms trigger), the model package status can be flipped to Rejected automatically, preventing further deployment attempts. If the deployment succeeds, the previous model package is marked Archived (a custom status convention) and the new package becomes the deployed production version. This closed-loop pattern — train → register → approve → deploy guarded → succeed or rollback — is the canonical end-to-end MLOps flow tested on MLA-C01.

Official sources

More MLA-C01 topics