examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

SageMaker Endpoint Types — Real-Time, Async, Serverless, and Batch

4,000 words · ≈ 20 min read ·

Master SageMaker endpoint type selection for MLA-C01 Domain 3 Task 3.1: real-time vs asynchronous vs serverless inference vs Batch Transform, multi-model and multi-container endpoints, Inference Recommender benchmarking, SageMaker Neo compilation, auto-scaling policies, payload size limits, latency SLAs, cost trade-offs, and the decision tree the exam tests with nuance for ML Engineer scenarios.

Do 20 practice questions → Free · No signup · MLA-C01

SageMaker endpoint type selection is the most heavily tested decision in MLA-C01 Domain 3, and the exam will not let you off easy with "real-time vs batch" — it pushes the nuance between four production inference modes (real-time, asynchronous, serverless, Batch Transform) plus two hosting topologies (multi-model endpoints, multi-container endpoints). The ML Engineer who picks the wrong endpoint type will either over-pay by 10x for a real-time endpoint serving sporadic traffic, miss a latency SLA by routing a 30-second model behind a synchronous request, or hit the 6 MB payload ceiling with a 200 MB video file. The exam plants stems with an explicit traffic pattern, payload size, latency target, and cost constraint, and only one endpoint type fits all four. Memorize the selection criteria; guessing fails.

This guide walks the four endpoint types in operational depth, then the multi-model and multi-container topologies, then the optimisation services (Inference Recommender, Neo, auto-scaling), then the decision matrix that maps a problem statement directly to the right answer. It is written for the MLOps perspective — what an ML Engineer actually configures, which CloudWatch metrics matter, and which traps the MLA-C01 exam plants in the answer choices.

What Are SageMaker Endpoint Types and Why Selection Matters

A SageMaker inference endpoint is the runtime surface where a deployed model serves predictions. Unlike training jobs (which run, finish, and shut down), endpoints are long-lived hosting infrastructure with their own lifecycle, scaling policies, networking, and billing model. SageMaker exposes four distinct endpoint architectures because no single design is right for every workload. Latency-sensitive request-response APIs need persistent capacity. Long-running large-payload jobs need queue buffering. Sporadic traffic with no SLA needs scale-to-zero. Periodic offline scoring of millions of rows needs ephemeral compute that disappears when finished.

The Four-Way Decision Surface

Every endpoint-selection question on MLA-C01 collapses to four concrete dimensions: latency SLA (sub-second / seconds / minutes / hours), payload size (kilobytes / megabytes / gigabytes), traffic pattern (steady / bursty / sporadic / one-shot), and cost shape (always-on / pay-per-request / pay-per-job). The endpoint type that matches all four wins; any answer that violates one constraint is wrong, no matter how technically reasonable it looks.

Why ML Engineers Get This Wrong

Data-scientist mental models default to "real-time endpoint" because it is the SageMaker tutorial default. Production ML Engineers think first about traffic shape and payload size — a model invoked once a day for an overnight batch job should never be a real-time endpoint, and a vision model processing 500 MB videos should never try to fit a 6 MB synchronous payload. The MLA-C01 exam rewards the engineering instinct over the textbook default.

Plain-Language Explanation: SageMaker Endpoint Types

The four endpoint types feel abstract until you map them to everyday delivery patterns. Three concrete analogies make the differences stick.

Analogy 1 — The Restaurant Service Models

Imagine four ways a restaurant serves food, each matched to a different customer pattern. Real-time inference is the dine-in counter at a fast-food chain — a kitchen line is always staffed, every customer is served in under thirty seconds, and the chain pays cooks even during slow hours so the wait stays short when a rush hits. The cost is high because labor runs continuously, but the experience is instant. Asynchronous inference is the catering desk at a hotel — you place an order for a 200-person banquet, drop off the menu and the deposit, leave, and come back four hours later when a notification says the food is ready. The kitchen runs the job in the background, holds your order in a queue if other catering events are ahead, and the staff size scales with demand instead of staying at peak. Serverless inference is the food truck that opens only when an event is happening — no truck means no cost, but the first customer of the night waits while the grill heats up (cold start). When the event ends, the truck closes and bills nothing. Batch Transform is the meal-prep service that produces ten thousand boxed lunches once a week — the kitchen spins up Monday morning, processes the entire batch, ships it, and shuts down. No standing capacity, no queue, no per-request billing — one job, one invoice, gone. Multi-model endpoints are the food-court arrangement where five restaurants share one kitchen and one cashier, swapping ingredients in and out as orders arrive. Multi-container endpoints are the assembly-line cafeteria where every tray passes the salad station, then the entree station, then the dessert station in fixed sequence. Pick the right service model for the customer pattern, and the kitchen runs efficiently. Pick wrong, and you either lose customers to slow service or burn cash heating empty grills.

Analogy 2 — The Postal and Delivery Hierarchy

Picture four delivery services at a mail-order company. Real-time inference is the same-store-pickup service — the warehouse is staffed continuously, every order ships within an hour, customers expect instant fulfilment, and the company pays for warehouse staff to stand ready even on quiet days. Async inference is the international freight forwarder — large shipments (up to 1 GB payload, the SageMaker async limit) drop into a queue at the dock, the forwarder processes them in the order they arrived, and the customer is notified when the container clears customs hours or days later. The dock can be sized for average load because the queue absorbs surges. Serverless inference is the on-demand courier service that activates when called — no idle drivers, no fleet cost, but the courier needs ten minutes to reach the pickup point (cold start). For sporadic shipments this is the cheapest option; for continuous traffic the courier-call pattern would never end. Batch Transform is the once-a-month bulk mailing run — every address in the database is processed in one massive job, the postage is bought in bulk at a discount, the truck rolls out once, and no infrastructure remains afterward. The MLA-C01 exam asks "you have 50,000 records to score nightly with results back by morning, what is the cheapest pattern?" — the answer is Batch Transform, not a real-time endpoint that idles 22 hours a day.

Analogy 3 — The Hospital Emergency Department Triage

Picture a hospital with four patient flows. Real-time inference is the trauma bay — every minute matters, full medical staff is on duty 24x7, and the cost of empty beds is accepted because a stroke patient cannot wait. Latency SLA is sub-second to single-digit seconds. Async inference is the radiology imaging suite where MRI scans are queued, run sequentially, and the radiologist reviews the result hours later. Payload (the MRI scan, hundreds of MB to a gigabyte) is too large for the trauma bay's instant flow, but the queue ensures every scan eventually completes without overwhelming the radiologist. Serverless inference is the on-call specialist — a rare endocrinology consult is requested, the specialist arrives in fifteen minutes (cold start), handles the case, and goes home. No salary while idle. Batch Transform is the annual cohort study — every patient record from the past year is run through a risk-scoring model in one offline job, results are written to the data warehouse, and no compute remains afterward. Multi-model endpoints are the shared examination room used by five different specialists in rotation throughout the day; multi-container endpoints are the surgical pipeline where every patient passes anesthesia, surgery, and recovery in fixed sequence. Choosing the right flow saves lives and dollars; mismatching the patient to the flow is the mistake the MLA-C01 exam tests.

Real-Time Inference Endpoints — The Default

Real-time endpoints are the SageMaker default and the right answer for genuine request-response APIs.

Architecture and Provisioning

A real-time endpoint hosts a model on one or more dedicated EC2 instances behind a SageMaker-managed load balancer. Each InvokeEndpoint call is synchronous — the client sends a payload (up to 6 MB), the model produces a prediction, the response returns within the connection timeout (60 seconds default). Instances are persistent — they run until you scale them down or delete the endpoint, and they bill per second whether traffic is flowing or not.

Latency Profile

Real-time endpoints deliver low single-digit-millisecond to sub-second latency for typical models. P50 latency depends on the model size and instance type; P99 latency depends on cold-start avoidance, GC pauses, and whether autoscaling is mid-scale-up. Real-time is the only endpoint type that can credibly hit a sub-100ms SLA.

Auto-Scaling Configuration

SageMaker endpoint auto-scaling supports target tracking (most common — scale on InvocationsPerInstance to a target value), step scaling (CloudWatch alarm thresholds trigger discrete scaling steps), and scheduled scaling (anticipated load patterns). Target tracking is the right answer for unpredictable production traffic; step scaling for known spike patterns; scheduled for known periodicity (e.g. business-hours-only services).

When to Pick Real-Time

  • API serving with strict latency SLA (< 1 second)
  • Continuous moderate-to-high traffic where idle cost is offset by request volume
  • Payload under 6 MB
  • Need for warm models to avoid any cold-start delay
  • Production variants for A/B testing or shadow deployment

A SageMaker real-time inference endpoint is a persistent, synchronous, request-response model serving surface with sub-second latency and a 6 MB payload limit. It is the right choice when latency SLA is strict, traffic is steady or moderate, and payload fits in 6 MB. It is the wrong choice for sporadic traffic (use serverless), large payloads (use async or Batch Transform), or once-a-day batch scoring (use Batch Transform). The endpoint runs continuously and bills per instance-second whether or not traffic is flowing — this is the cost trap that causes ML Engineers to over-pay for low-traffic models that should have been serverless.

Asynchronous Inference Endpoints — Large Payload, Long Processing

Async inference solves the cases real-time cannot: large payloads, long-running model invocations, and tolerable wait times measured in minutes.

Architecture

The client uploads input to S3, calls InvokeEndpointAsync with the S3 URI, and receives an InferenceId immediately. The request lands in a SageMaker-managed SQS queue. Backend instances pull from the queue, run the model, write the response to an S3 output location, and optionally publish to an SNS topic on completion. Maximum payload: 1 GB. Maximum processing time: 1 hour per request.

Scale-To-Zero

Async endpoints support scale-to-zero by setting the MinCapacity of the auto-scaling policy to zero. When the queue drains and traffic stops, instance count drops to zero and billing pauses. New requests in the queue trigger a cold start (instances boot, model loads), then processing resumes. Scale-to-zero is unique to async (and serverless) — real-time endpoints always keep at least one instance running.

When to Pick Async

  • Payload between 6 MB and 1 GB (above real-time limit)
  • Processing time more than 60 seconds and up to 1 hour
  • Traffic that can tolerate queue-based delivery
  • Cost-sensitive workloads where scale-to-zero between bursts saves money
  • Long-running inference like video processing, OCR on large documents, multi-step LLM reasoning

Async vs Batch Transform — The Confusion

Async holds a queue and serves individual requests asynchronously over time; Batch Transform processes one defined dataset in one job and finishes. If you need to serve incoming requests one by one with results delivered minutes later, use async. If you have a fixed dataset of N records to score in one shot with no further requests until next time, use Batch Transform.

Serverless Inference Endpoints — Sporadic Traffic

Serverless inference is the cost-optimised pattern for low-volume sporadic models.

Architecture and Cold Start

No persistent instances. AWS provisions capacity per request, charges per millisecond of compute used, and tears down when idle. The first request after idle triggers a cold start — the runtime container loads the model, which can take seconds to tens of seconds for large models. Subsequent requests within a warm window reuse the loaded container.

Configuration Limits

  • Memory: 1 GB to 6 GB (in 1 GB increments)
  • Max concurrency: 1 to 200 concurrent requests (provisioned concurrency reduces cold start at higher cost)
  • Payload: 4 MB request, 4 MB response (smaller than real-time)
  • Timeout: 60 seconds per request

Provisioned Concurrency

To trade cost for cold-start avoidance, configure Provisioned Concurrency — a number of containers kept warm at all times. Billed at a lower per-millisecond rate than on-demand, but pays a base hourly charge. Useful when known low traffic still needs sub-second first-request latency.

When to Pick Serverless

  • Sporadic traffic with multi-minute idle gaps
  • Tolerable cold start (or budget for provisioned concurrency)
  • Payload under 4 MB
  • Cost-sensitive models that would idle a real-time endpoint
  • Internal tools, dev/test endpoints, low-volume APIs

Pick serverless inference when expected traffic has long idle gaps and the workload tolerates cold start. A real-time endpoint serving 100 requests per day costs the same as one serving 10 million per day because instances run continuously — at low volumes, you pay per request rates that are 50x to 500x more expensive than at scale. Serverless flips the model: you pay only when invoked, scale-to-zero between requests, and accept a one-time cold-start penalty (typically 5-30 seconds for medium models). For internal dashboards, low-traffic APIs, and dev/test, serverless is consistently the cost-optimal choice. For latency-critical production with sub-100ms SLA, the cold-start penalty is unacceptable and real-time wins.

Batch Transform — Offline Bulk Scoring

Batch Transform is not an endpoint at all — it is an ephemeral inference job pattern.

Architecture

You submit a CreateTransformJob request specifying the model, input S3 prefix, output S3 prefix, instance type, and instance count. SageMaker spins up the requested instances, partitions the input across them, runs inference in parallel, writes results to the output prefix, and tears down the instances. No persistent endpoint, no queue, no auto-scaling — just a single bounded job.

Input Modes

  • MultiRecord — one HTTP request carries many records, batched up to MaxPayloadInMB (default 6 MB)
  • SingleRecord — one HTTP request per record, slower but simpler for variable-size inputs

Splitting Strategy

SplitType parameter controls how SageMaker chunks the input file: Line (split on newlines, common for CSV/JSONL), RecordIO (binary RecordIO format), TFRecord (TensorFlow records), or None (whole file per request).

When to Pick Batch Transform

  • Defined dataset to score in one job (no streaming requests)
  • Latency SLA in hours (overnight batch is fine)
  • Cost-sensitive — pay only for the job duration
  • Large datasets where parallelisation across many instances reduces wall-clock
  • Periodic scoring (daily/weekly/monthly) where standing endpoints waste money

Why Batch Transform Beats An Endpoint For Bulk

Running ten million records through a real-time endpoint requires either keeping the endpoint hot for the duration (paying per second) or building a custom client that throttles requests to avoid 5xx errors. Batch Transform handles partitioning, parallelism, and tear-down automatically with one API call. Cost is dramatically lower because the job duration is exactly the work duration — no idle time.

Async inference and Batch Transform are NOT the same — confusing them is the canonical MLA-C01 trap. Both are non-real-time, both can handle large payloads, both run in the background. The difference: async is a persistent endpoint serving an ongoing stream of individual requests asynchronously over time, with a queue that may never empty. Batch Transform is a bounded job that processes one defined dataset in one shot and disappears. If the question describes "ongoing streaming requests with results delivered later," it is async. If the question describes "a fixed dataset to score in one run," it is Batch Transform. The MLA-C01 exam writes stems where both look plausible — the giveaway is the word "streaming," "ongoing," or "queue" (async) versus "dataset," "nightly batch," or "one-shot job" (Batch Transform).

Multi-Model Endpoints — Hosting Many Models On Shared Infrastructure

Multi-Model Endpoints (MME) host thousands of models on a shared instance fleet, dynamically loading models from S3 on demand.

Architecture

A single SageMaker endpoint with a single inference container points at an S3 prefix containing many model artifacts. When InvokeEndpoint arrives with a TargetModel header naming a specific model, the container checks if that model is loaded in memory; if not, it pulls the artifact from S3, loads it, evicts the least-recently-used model if memory is full, and serves the prediction. Subsequent calls to the same model are served from the in-memory cache.

When To Pick MME

  • Many similar models with the same framework (e.g. one model per customer in a SaaS scenario)
  • Many models, each with low individual traffic
  • Models small enough to fit several in memory simultaneously
  • Acceptable cold-load latency on first call after eviction (seconds)

When MME Is Wrong

  • Heterogeneous frameworks (one model in TF, another in PyTorch — MME requires one container)
  • Strict per-model latency SLA where cold-load is unacceptable
  • Few large models that each saturate one instance's memory

MME vs Multi-Container Endpoints

MME hosts many models behind one container; multi-container endpoints chain multiple containers in a sequence behind one endpoint. MME = horizontal model fanout; multi-container = vertical container pipeline. Confusing the two is a heavily-tested trap.

Multi-Container Endpoints — Sequential Pipeline Of Containers

Multi-container endpoints chain up to fifteen containers behind one endpoint, with options for direct (single-container invocation) or serial (pipeline) modes.

Serial Mode — The Pipeline Pattern

Each request flows through containers in sequence: container 1 (preprocessing) → container 2 (model) → container 3 (postprocessing). Used for ML pipelines that need consistent preprocessing applied at inference time matching what was applied at training time, eliminating training-serving skew.

Direct Mode

Multiple models share an endpoint but each is invoked independently with TargetContainerHostname. Useful when you want endpoint-level cost consolidation but per-call container isolation.

When To Pick Multi-Container

  • Need to apply preprocessing/postprocessing at inference without packaging into the model container
  • Want a clean separation between feature transformation and model code
  • Hosting an ensemble where outputs combine sequentially

When Multi-Container Is Wrong

  • Hosting many independent models — that is MME territory
  • Need for independent scaling of each step — multi-container shares one endpoint's scaling

SageMaker Inference Recommender — Benchmark-Driven Instance Selection

Picking the right instance type for an endpoint is non-trivial — Inference Recommender automates the benchmarking.

What It Does

Inference Recommender runs your model against a list of candidate instance types (or a fully managed default list) with simulated load, measures latency and throughput, and produces a recommendation report ranked by latency, cost, or throughput-per-dollar.

Default vs Advanced Job

  • Default job — quick benchmark across a SageMaker-curated instance list
  • Advanced job — custom instance list, custom traffic pattern, custom model latency target, auto-scaling configuration recommendation

Output

A report listing each candidate with measured P50/P90/P99 latency, max throughput, cost per million inferences, and the recommended instance type for the specified objective. The recommendation can be applied directly to a real-time endpoint config.

When To Use

Always run Inference Recommender before pinning an instance type for production. Manual instance picks are usually wrong by 2-3x in either direction (over-provisioning or under-provisioning).

Always run SageMaker Inference Recommender before deciding on a production endpoint instance type. Manual instance selection based on training instance type or "what worked in dev" is wrong almost every time — inference workloads have a different CPU/GPU/memory profile than training. Inference Recommender produces a benchmark report measuring P50/P90/P99 latency, max throughput, and cost-per-million-inferences for each candidate instance, and ranks them by your objective (lowest latency, lowest cost, highest throughput). The MLA-C01 exam asks "what is the recommended way to pick an endpoint instance type for production?" — the answer is Inference Recommender, not engineering judgement.

SageMaker Neo — Compiling Models For Target Hardware

SageMaker Neo compiles trained models to run optimised on specific target hardware, including edge devices and cloud instances.

How It Works

Submit a trained model and a target hardware spec; Neo compiles the model graph into an optimised binary using TVM or other compilers, often achieving 2x speed and smaller memory footprint. Targets include ml.c5, ml.p3, ml.inf1 (Inferentia), Jetson devices, Raspberry Pi, and many more.

Inferentia With Neo

For maximum cost-efficient inference at scale, compile to AWS Inferentia (ml.inf1, ml.inf2) via Neo. Inferentia delivers up to 70 percent lower cost per inference vs comparable GPU instances for many workloads.

When To Use Neo

  • Edge deployment (Greengrass, IoT devices) where the optimised binary must fit constrained hardware
  • Cost optimisation on cloud endpoints — Inferentia compiled via Neo
  • Latency reduction without changing instance type

Endpoint Auto-Scaling — Configuration Patterns

All endpoint types except Batch Transform support auto-scaling.

Target Tracking — The Default

The default is SageMakerVariantInvocationsPerInstance — scale up when invocations per instance exceeds a target value, scale down when below. Simple, robust, recommended for most production endpoints.

Step Scaling

CloudWatch alarm thresholds trigger discrete scaling steps. More control than target tracking but harder to tune.

Scheduled Scaling

Pre-scale before known traffic events (Black Friday, end-of-quarter reporting). Combine with target tracking as a safety net.

Scale-To-Zero

Available for serverless and async endpoints by setting MinCapacity = 0. Not available for real-time endpoints — they always run at least one instance.

SageMaker Endpoint Decision Matrix

The single page that answers most MLA-C01 endpoint-selection questions.

Constraint Real-Time Async Serverless Batch Transform
Latency SLA Sub-second Minutes-hours Sub-second after warm Hours (job duration)
Payload 6 MB 1 GB 4 MB Full S3 file
Traffic Steady-bursty Bursty large Sporadic One-shot
Cost shape Always-on Scale-to-zero Pay-per-ms Pay-per-job
Use case API serving Video, OCR, large docs Internal tools, low-traffic Nightly scoring
Cold start None Yes (scale-to-zero) Yes (5-30 sec) N/A (job init)

How To Read The Matrix On The Exam

The stem gives you constraints. Cross out columns that violate any constraint. The remaining column is the answer. If two columns survive, look for the cost-optimisation cue or the latency cue and pick the tighter fit.

Memorize the four-way decision matrix and apply it ruthlessly: real-time for steady sub-second SLA with payload under 6 MB, async for large payload (up to 1 GB) with multi-minute tolerance, serverless for sporadic traffic with cold-start tolerance, Batch Transform for one-shot offline scoring of fixed datasets. The MLA-C01 exam writes stems containing exactly the cues that map to one cell of this matrix — payload size in MB, latency in seconds vs minutes, traffic pattern (steady / bursty / sporadic / one-shot), and cost shape (always-on vs scale-to-zero vs pay-per-job). Memorize the table and the right answer becomes obvious. Manual reasoning from first principles is slower and error-prone under exam time pressure.

Common MLA-C01 Endpoint Type Traps

Trap 1 — Real-Time For Sporadic Traffic

A model invoked 100 times per day on a real-time endpoint costs the same as one invoked 10 million times. The exam plants this stem with a "twice an hour" or "infrequent" cue — the right answer is serverless.

Trap 2 — Async For Hard Deadlines

Async endpoints have a queue. If the queue depth grows, response latency grows linearly. Async is wrong for "request must complete in under 30 seconds" scenarios — that is real-time territory.

Trap 3 — Batch Transform For Streaming Requests

Batch Transform processes a fixed dataset and shuts down. If the workload is a continuous stream of incoming requests with results required minutes later, it is async, not Batch Transform.

Trap 4 — Multi-Model Endpoint For Heterogeneous Frameworks

MME hosts many models behind one inference container. All models must use the same framework and container. Mixing TensorFlow and PyTorch models requires multi-container endpoint or separate endpoints, not MME.

Trap 5 — Serverless For Sub-100ms SLA

Cold start makes serverless unsuitable for hard sub-100ms SLAs. Even with provisioned concurrency, the first request after a long idle window may incur startup cost. Latency-critical production = real-time.

Trap 6 — Inference Recommender Output Is Optional

The exam expects you to use Inference Recommender for production sizing. "Pick the instance type that ran training" is always wrong.

Trap 7 — Neo Compilation Is Free

Neo compilation jobs run on SageMaker compute and incur cost. Neo is a one-time investment that pays back through lower endpoint cost — but the compilation itself is not free.

Trap 8 — Real-Time Supports 1 GB Payload

Real-time payload limit is 6 MB request and 6 MB response. For larger payloads, use async (up to 1 GB) or upload to S3 first and pass the S3 URI in a small payload (a common pattern).

FAQ — SageMaker Endpoint Type Selection

Q1 — How do I choose between async inference and Batch Transform when both could fit my workload?

Look at the request pattern. Async serves an ongoing stream of individual requests asynchronously, where each request arrives at a different time and gets its own response. Batch Transform processes one fixed dataset in one job, with the entire dataset available as input before the job starts. If your input is a continuous stream where requests arrive throughout the day, use async. If your input is a known dataset materialised as a set of S3 objects, use Batch Transform. Cost-wise, Batch Transform is cheaper for one-shot jobs because there is no persistent endpoint between runs; async is cheaper than real-time for streaming with scale-to-zero, but more expensive than Batch Transform when the work could be batched into a single job. The MLA-C01 stem giveaway: "ongoing requests" or "throughout the day" → async; "nightly batch" or "process the dataset" → Batch Transform.

Q2 — When does a multi-model endpoint outperform separate single-model endpoints?

When you have many similar models (same framework, same container) with low individual traffic. Each separate single-model endpoint runs at minimum one instance, paying full instance cost regardless of traffic. With one hundred low-traffic models, that means one hundred instances minimum. Behind one MME, those one hundred models share a few instances, with models loaded from S3 on demand and evicted under memory pressure. Cost can drop 10x to 100x. The break-even depends on per-model traffic — once individual model traffic justifies its own instance, MME no longer helps. MLA-C01 stems with "thousands of customer-specific models" or "one model per tenant in a SaaS app" almost always have MME as the right answer.

Q3 — What is the practical cold-start cost of serverless inference and how do I mitigate it?

Cold start is typically 5-30 seconds for medium models, depending on container size and model artifact size. The first request after a long idle window pays the full cost; subsequent requests within the warm window (typically 5-15 minutes) hit the warm container with sub-second latency. Mitigation options: enable provisioned concurrency (warm containers always reserved, lower per-ms rate but base hourly cost), use smaller models that load faster, cache the model in a smaller artifact format, or accept the cold start if it only affects the first user of a session. For internal tools and dev/test, raw serverless is fine. For external-facing low-traffic APIs where the first user must not see a cold start, provisioned concurrency at concurrency=1 is the typical fix.

Q4 — Can I use auto-scaling with a real-time endpoint that has multiple production variants?

Yes — each production variant has its own auto-scaling configuration. You can scale variant A and variant B independently based on traffic to each, useful when running an A/B test where one variant gets 90 percent of traffic and the other gets 10 percent. The 10 percent variant scales down to its minimum capacity (typically 1 instance for real-time) while the 90 percent variant scales out as load grows. Auto-scaling on production variants is one of the configurations exercised by SageMaker deployment guardrails for safe rollouts — guardrails monitor the new variant's metrics and can trigger automated rollback if the variant scales unexpectedly or violates a CloudWatch alarm threshold.

Q5 — When should I compile a model with SageMaker Neo and what is the typical performance gain?

Compile with Neo when you are deploying to edge hardware (constrained memory or compute) or when you want to run on AWS Inferentia for cost-optimised cloud inference. Typical gains: 2x latency reduction on the same hardware, 3x to 10x cost reduction when moving from GPU to Inferentia for compatible models. Neo is most effective for vision and NLP models with well-defined operator sets (ResNet, BERT, YOLO derivatives); cutting-edge custom architectures may not compile cleanly. The workflow: train on GPU, compile with Neo targeting ml.inf1 or ml.inf2, deploy the compiled artifact behind a real-time endpoint on the Inferentia instance type, validate latency and accuracy parity, then cut over. The MLA-C01 exam tests Neo as the answer for edge deployment and as the answer for "reduce inference cost without retraining."

Q6 — How does Inference Recommender differ from manually testing instance types?

Manual testing requires you to deploy the model to each candidate instance, run a load generator, measure metrics, tear down, repeat for each instance type — typically four to eight instance types means four to eight full deployment cycles, each taking thirty minutes to an hour. Inference Recommender automates this: one API call kicks off the benchmark, SageMaker provisions every candidate instance in parallel, runs simulated load, captures latency and throughput metrics, and produces a comparison report in fifteen to thirty minutes. The output ranks instances by your chosen objective (lowest latency, lowest cost per million inferences, highest throughput) and recommends auto-scaling configuration. For production deployment decisions, Inference Recommender is the AWS-recommended pattern; manual testing is acceptable for one-off experiments. The MLA-C01 exam reliably tests Inference Recommender as the right answer for "what is the recommended way to pick an endpoint instance type."

Q7 — Is there a SageMaker endpoint pattern that combines preprocessing, model inference, and postprocessing in one deployment?

Yes — the multi-container endpoint in serial mode. You define a sequence of up to fifteen containers (typically preprocessing → model → postprocessing), and SageMaker chains them per request: input goes to container 1, container 1's output becomes container 2's input, and so on. This pattern keeps preprocessing logic out of the model container, which is useful when the preprocessing must match training-time preprocessing (eliminating training-serving skew) and when you want to share one preprocessing container across multiple model versions. The alternative is to bundle preprocessing into the model container itself (simpler, but couples preprocessing and model versions), or to do preprocessing client-side (risky for skew). For MLOps maturity, multi-container serial mode is the right answer when the question mentions "consistent preprocessing across training and inference" or "feature transformation pipeline before the model."

Official sources

More MLA-C01 topics