Distributed Training on SageMaker - MLA-C01 ML Engineer Study Notes

Distributed training on SageMaker is the discipline of splitting a single training job across multiple GPUs and multiple instances so that a model that would otherwise take days on one machine completes in hours, or so that a model that would not fit in a single GPU's memory can be trained at all. On the MLA-C01 exam, distributed training anchors Task 2.2 (Train and refine models) of Domain 2 (ML Model Development, 26 percent weight). This is an engineering exam, not a data-science exam — the question is rarely "explain backpropagation across a pipeline-parallel mesh" and almost always "given a model size, dataset size, instance budget, and time-to-train constraint, choose the right distributed training strategy and instance type."

This guide is built for the ML engineer perspective. It covers the three parallelism strategies (data, model, tensor/pipeline), the two SageMaker libraries (SMDP for data parallelism, SMP for model parallelism), the instance types that distributed training actually runs on (ml.p3.16xlarge, ml.p4d.24xlarge, ml.p5.48xlarge), the networking layer that makes multi-node training viable (Elastic Fabric Adapter), the SageMaker Training Compiler that adds XLA-based graph optimisation on top, and the checkpointing patterns that keep distributed training resilient against Spot interruptions and node failures. It also covers the equally important question the exam likes to test — when not to use distributed training, because synchronisation overhead can make a small model train slower on eight GPUs than on one.

What Is Distributed Training and Why MLA-C01 Cares

Distributed training is the technique of running a single training job across multiple compute units — multiple GPUs in one instance, or multiple instances each with multiple GPUs — to reduce wall-clock time or to fit larger models. The MLA-C01 exam tests distributed training because modern ML workloads have crossed the threshold where single-GPU training is impractical: foundation model fine-tuning, computer vision on large datasets, and time-series forecasting at retail scale all require distributed training to be commercially viable. The community signal is consistent — distributed training depth is harder than candidates expect, and the test plants scenarios where the wrong parallelism choice multiplies compute cost without reducing training time.

Distributed Training Is About Two Different Bottlenecks

Distributed training solves two distinct problems, and the parallelism strategy you pick depends on which one you have. Time-to-train is the throughput problem — your model fits on one GPU but training takes 50 hours and the business needs results in 5. Model size is the memory problem — your model has 70 billion parameters and the activations plus optimiser state will not fit in a single 80 GB H100, no matter how patient you are. Data parallelism solves time-to-train; model parallelism solves model size; tensor parallelism solves an even narrower slice of model size where individual layers exceed one GPU's memory. Mixing the wrong strategy with the wrong bottleneck wastes money — a 100M-parameter model run with model parallelism on eight GPUs trains slower and costs eight times as much as the correct data-parallel configuration.

Why SageMaker, Not Self-Managed EC2

You can run distributed training on raw EC2 with NCCL, MPI, and your own orchestration. SageMaker distributed training adds three things the exam expects you to recognise. First, the SageMaker Data Parallel and Model Parallel libraries are AWS-tuned implementations that exploit AWS network topology (placement groups, EFA) better than naive open-source defaults. Second, SageMaker manages the cluster lifecycle — spinning up instances in the same placement group, distributing the training image, configuring inter-node networking, and tearing down on completion — so the engineer focuses on the training script. Third, SageMaker integrates distributed training with Spot, checkpoints, warm pools, Experiments, and Model Registry, all of which are tested. Self-managed EC2 distributed training is a valid architectural choice, but for MLA-C01 answers, SageMaker is almost always the expected service unless the stem explicitly rules it out.

Why Distributed Training - The Three Drivers

Three forces push a workload from single-GPU to distributed.

Driver 1 - Time-to-Train Constraints

A 30-billion-parameter LLM fine-tune on 100 GB of text takes roughly 80 hours on a single ml.p4d.24xlarge (8 A100 GPUs). With data parallelism across four ml.p4d.24xlarge nodes (32 A100 GPUs), the same fine-tune completes in roughly 22 hours. The cost is not 32x compute — sublinear scaling means it is closer to 5x — but the 4x reduction in wall-clock time is what the business is paying for. Time-to-train shows up on the exam as "the team needs results before the demo on Friday" or "competitive analysis requires daily retraining." When wall-clock time is the constraint, scale out.

Driver 2 - Model Size That Exceeds One GPU's Memory

A 175-billion-parameter model in FP16 needs 350 GB just for weights, plus roughly 2x that for activations and optimiser state during training — well beyond the 80 GB on an A100 or H100. No amount of patience trains this on one GPU; the model literally cannot load. Model parallelism splits the model across GPUs so each GPU holds a fraction of the parameters. Tensor parallelism slices individual matrix multiplications across GPUs. Pipeline parallelism splits the model into sequential stages that run on different GPUs in a pipelined schedule. When the model does not fit, scale across.

Driver 3 - Dataset Size That Makes Single-GPU IO the Bottleneck

A 10 TB image dataset is too large to stage on a single instance's local NVMe, and reading it from S3 in File mode every epoch wastes hours. Data parallelism with each node reading a shard of the dataset in parallel, combined with FSx for Lustre as a shared high-throughput layer, removes IO as a bottleneck. The exam plants this as "training is slow but GPU utilisation is 30 percent" — the answer is rarely "more GPUs" and usually "remove the IO bottleneck first," then scale.

Plain-Language Explanation: Distributed Training

Distributed training is one of those topics where the abstractions ("data parallel," "tensor parallel") feel academic until you map them to physical work. Three analogies make it concrete.

Analogy 1 - The Restaurant Kitchen Scaling Up

Imagine a restaurant kitchen training (cooking for) a banquet. The recipe is the model, the ingredients are the dataset, and the hours until the banquet starts is the time-to-train budget.

Data parallelism is hiring eight identical cooks, each given the same recipe and a different one-eighth of the ingredient pile. Every cook prepares their portion in parallel; at the end, they all compare notes ("how much salt did you use") and average their adjustments — that comparison step is the AllReduce gradient synchronisation. The recipe (model) fits in every cook's head; the only constraint is throughput. This is the SageMaker Data Parallel Library (SMDP) pattern.

Model parallelism is when the recipe is so massive — a 200-page cookbook — that no single cook can memorise it. You split the cookbook into chapters: cook 1 handles the appetiser chapter, cook 2 handles the main course chapter, cook 3 handles the dessert chapter. They work in a pipeline — cook 1 finishes the appetiser, hands plates to cook 2, who starts the main course while cook 1 starts the next batch's appetiser. This is pipeline parallelism, the SageMaker Model Parallel Library's default.

Tensor parallelism is when even one chapter is too long for one cook — say the main course chapter has 50 pages. You split that single chapter across four cooks who each work on different paragraphs simultaneously, talking constantly to coordinate. Tensor parallelism is heavy on inter-cook communication, so the cooks must stand right next to each other (same instance, NVLink) — placing them in different kitchens (different instances) ruins throughput.

Elastic Fabric Adapter (EFA) is a high-speed pneumatic tube system between kitchens that lets cooks in different buildings pass plates as fast as if they were standing next to each other. Without EFA, cross-kitchen communication is by carrier pigeon (standard Ethernet) and the gradient sync at the end of every batch ruins the speedup. The Training Compiler is a sous-chef who reads the recipe in advance and rewrites it for efficiency — combining steps, eliminating redundant prep work, picking the optimal order — so each cook does less work to produce the same dish.

Checkpointing to S3 is the head chef photocopying the kitchen's progress every 15 minutes and locking the photocopies in the safe; if the kitchen burns down (Spot interruption), training resumes from the last photocopy instead of starting over from raw ingredients.

Analogy 2 - The Construction Crew Building a Skyscraper

Picture a construction firm building a 100-floor skyscraper (training a large model). The blueprints are the model architecture, the building materials are the training data, and the deadline is the time-to-train.

Data parallelism is dividing the materials evenly across eight identical crews, each building an identical small replica skyscraper from their portion. At the end of each day, they meet, compare what they learned ("the steel beams expand more than expected"), and average their building techniques into a unified standard for the next day. This works because the entire blueprint fits in every foreman's binder — the constraint is how fast you process materials, not blueprint size.

Model parallelism via pipeline parallelism is when the blueprint is so massive that no foreman can hold it all. Crew 1 specialises in the foundation, crew 2 in floors 1-30, crew 3 in floors 31-70, crew 4 in floors 71-100. Each crew passes their finished section up the chain. The pipeline keeps crews busy — when crew 1 finishes the foundation of building B, crew 2 is still working on floors 1-30 of building A. Pipeline bubbles (idle time when the pipeline starts and ends) are unavoidable overhead.

Tensor parallelism is when even one floor's blueprint is too detailed for one crew. You split the floor itself across four crews working on different rooms simultaneously, with constant radio coordination. Heavy radio chatter only works if the crews are on the same site (same instance) with walkie-talkies (NVLink); putting them in different cities ruins coordination.

EFA is a high-bandwidth radio network between sites for the rare occasions when cross-site coordination is needed. Training Compiler is the structural engineer who reviews the blueprint and rewrites it for buildability — combining adjacent walls, eliminating redundant supports, picking better materials — so each crew builds the same building with 30 percent less effort. S3 checkpointing is photographing the construction site at the end of every shift and storing the photos in a fireproof archive; if a hurricane destroys the site, you resume from the most recent photo.

Analogy 3 - The Factory Assembly Line for Cars

Picture a car factory training (manufacturing) a fleet of vehicles. The assembly process is the model, the parts inventory is the training data, and the production deadline is the time-to-train.

Data parallelism is running eight identical assembly lines in parallel, each producing the same car from its share of the parts inventory. At the end of each day, the factory managers compare quality metrics across lines and update the standard procedure on every line. Every line follows the entire assembly process — the constraint is throughput, not process complexity.

Model parallelism is when the assembly process has so many steps (5,000 stations) that no single assembly line is long enough. You split the process across multiple lines: line 1 handles stations 1-1500, line 2 handles 1501-3000, and so on. Each line passes the partially-assembled car to the next. The pipeline overlap means while line 1 is starting car B, line 2 is in the middle of car A.

Tensor parallelism is when even one station is too complex for one worker — say the engine assembly requires four workers operating simultaneously on the same engine, each handling different bolts. They must coordinate in real time, so they stand side-by-side at one station (same instance, NVLink connection); separating them across the factory ruins the synchronisation.

EFA is a high-speed conveyor system between buildings of the factory — the equivalent of standing side-by-side even when physically separated. Training Compiler is the industrial engineer who reviews the assembly process and rewrites it for efficiency — eliminating redundant steps, combining stations, optimising the order of operations. Checkpointing to S3 is the factory shift supervisor photographing every car's state at every station every 15 minutes and storing the records in a fireproof safe; if a tornado destroys the factory, production resumes from the last snapshot rather than starting from raw materials.

Data Parallelism - The Default for Most Distributed Training

Data parallelism is the most common and the simplest distributed training strategy. Master this first because it covers the majority of MLA-C01 distributed training scenarios.

How Data Parallelism Works

Each GPU gets a complete copy of the model and a different mini-batch slice of the data. All GPUs perform forward and backward passes independently on their slice. After the backward pass, gradients are synchronised across all GPUs — the AllReduce operation — so every GPU ends up with the same averaged gradient. The optimiser then updates every replica's parameters identically. The model stays in sync across all GPUs at every step. Data parallelism scales near-linearly until network bandwidth saturates.

When Data Parallelism Is the Right Choice

Three signals point to data parallelism. First, the model fits comfortably in one GPU's memory — typical for models under 10 billion parameters, or under 1 billion parameters when also training with optimiser states. Second, the dataset is large enough that single-GPU training is wall-clock-bound, not compute-bound. Third, the team wants minimal code changes — data parallelism in SageMaker Data Parallel requires only a few wrapper lines around the existing PyTorch DataLoader and optimiser.

The SageMaker Data Parallel Library (SMDP)

SMDP is an AWS-optimised data-parallel library tuned for the EC2 network topology used by ml.p4d, ml.p4de, and ml.p5 instances. SMDP replaces the open-source NCCL AllReduce with a custom AllReduce that exploits the EFA fabric and the placement-group topology. The library handles gradient compression, overlapping communication with computation, and adaptive batch sizing. Activate it with distribution={"smdistributed": {"dataparallel": {"enabled": True}}} in the Estimator configuration. The training script changes are minimal — wrap the model with smdistributed.dataparallel.torch.distributed, replace the standard DataLoader with the distributed sampler, and continue.

Open-Source Alternatives - PyTorch DDP and Horovod

If your team already uses PyTorch DistributedDataParallel (DDP) or Horovod, both run on SageMaker without modification — SageMaker provides launcher scripts that wire up the cluster. SMDP typically outperforms DDP on AWS-specific instance types because of EFA optimisations, but the performance delta narrows on smaller clusters. Horovod is a vendor-neutral choice that works across PyTorch, TensorFlow, and MXNet — useful for portability but slower than SMDP on AWS.

The SageMaker Data Parallel Library is optimised specifically for ml.p3.16xlarge, ml.p4d.24xlarge, ml.p4de.24xlarge, and ml.p5.48xlarge instances with EFA networking. Outside this instance set — on ml.g4dn, ml.g5, or smaller p-family — SMDP either falls back to standard NCCL or refuses to enable. The performance argument for SMDP only holds when the workload is large enough to justify the EFA-equipped instances; for smaller jobs, PyTorch DDP on ml.g5 or ml.p3.8xlarge is more cost-effective. On the MLA-C01 exam, a stem citing "ml.p4d.24xlarge cluster, 50-billion-parameter LLM fine-tune, minimal code change" points to SMDP; a stem citing "ml.g5.12xlarge for budget reasons, smaller model" points to PyTorch DDP without the SageMaker library.

Model Parallelism - When the Model Does Not Fit

Model parallelism is the answer when a model is too large for any single GPU. The MLA-C01 exam tests when to switch from data parallel to model parallel.

Pipeline Parallelism

Pipeline parallelism splits the model layer-wise across GPUs. Layers 1-25 of a 100-layer transformer might run on GPU 0, layers 26-50 on GPU 1, and so on. Each microbatch flows through the pipeline; while GPU 0 is processing microbatch 2, GPU 1 is processing microbatch 1. The pipeline keeps GPUs busy except at the start (filling the pipeline) and end (draining it) — the "pipeline bubble" overhead. Pipeline parallelism reduces per-GPU memory by the number of stages, letting models that need 200 GB run on 4x80GB GPUs.

Tensor Parallelism

Tensor parallelism slices individual operations across GPUs. A large matrix multiplication that needs 100 GB of memory becomes four 25 GB partial matrix multiplications running on four GPUs in parallel, with results combined via AllReduce. Tensor parallelism has high inter-GPU communication — it should only span GPUs connected by NVLink (within one instance), not across instances. Crossing the network for tensor parallelism kills throughput.

The SageMaker Model Parallel Library (SMP)

SMP is an AWS library that automates pipeline and tensor parallelism for PyTorch and TensorFlow. SMP partitions the model graph automatically, places partitions on GPUs, schedules microbatches through the pipeline, and inserts the necessary collective operations. Engineers configure parallelism degrees (pipeline_parallel_degree, tensor_parallel_degree) in the Estimator distribution dictionary, and SMP handles the rest. SMP supports automatic mixed precision (FP16/BF16), activation checkpointing (recompute activations during backward to reduce memory), and integration with FSDP (Fully Sharded Data Parallel) for additional memory savings.

Combining Data and Model Parallelism

For very large models, you combine both. Across nodes, replicate the model with data parallelism (each node trains on its own data shard). Within a node, split the model with tensor and pipeline parallelism (the model is too large for one GPU but fits across the 8 GPUs of one ml.p4d.24xlarge). This is sometimes called "3D parallelism" — data, pipeline, tensor — and it is what trains models above 100 billion parameters. SMP supports this hybrid out of the box.

Tensor parallelism across separate instances destroys throughput because the AllReduce on every layer crosses the network. Tensor parallelism should be confined to GPUs within a single instance connected by NVLink (which inside an ml.p4d.24xlarge provides 600 GB/s GPU-to-GPU bandwidth). Pipeline parallelism is more network-tolerant because activations only cross between GPUs at stage boundaries, not at every operation. The exam will plant a stem like "configured tensor parallelism across 4 nodes and saw 30 percent throughput of single-node training" — the correct fix is keep tensor parallelism within one node and use pipeline parallelism plus data parallelism across nodes. Mistaking the parallelism types is one of the top community-cited distributed training pain points.

Instance Selection for Distributed Training

Instance choice is half the distributed training decision. The exam tests recognition of which instances support which features.

The GPU Instance Family

For MLA-C01, four GPU instance families dominate distributed training questions.

ml.p3.16xlarge — 8 V100 GPUs (16 GB or 32 GB), 25 Gbps networking. Older but cost-effective for moderate-sized training. EFA available on the 16xlarge but not smaller p3 sizes.
ml.p4d.24xlarge — 8 A100 GPUs (40 GB), 400 Gbps EFA networking. The workhorse for large-scale data-parallel and model-parallel training in 2023-2024 era.
ml.p4de.24xlarge — 8 A100 GPUs (80 GB), 400 Gbps EFA networking. The 80GB A100 lets bigger models fit per GPU.
ml.p5.48xlarge — 8 H100 GPUs (80 GB), 3200 Gbps EFA networking. The current state-of-the-art for foundation model training on AWS.
ml.g5.48xlarge — 8 A10G GPUs (24 GB), 100 Gbps networking. Cheaper than p4d, suitable for smaller-scale data parallelism but lacks EFA.
ml.g4dn.12xlarge — 4 T4 GPUs (16 GB), modest networking. Suitable for inference and small training; not recommended for large distributed training.

Why GPU Memory Matters Even With Model Parallelism

Larger GPU memory shifts the threshold at which you need model parallelism. A 70B model in FP16 weighs ~140 GB; on 80GB H100s you need only 2 GPUs of model parallelism for weights, leaving room for data parallelism across many nodes. On 40GB A100s you need 4 GPUs of model parallelism, halving your data-parallel scaling. The exam asks "given a 30B parameter model, which instance lets us train with the simplest configuration" — pick the largest GPU memory you can afford.

Networking - Why EFA Is the Distributed-Training Multiplier

Elastic Fabric Adapter (EFA) is a Libfabric-based networking interface that bypasses the OS kernel for sub-microsecond latency between instances. EFA is required for distributed training to scale beyond a few nodes — without it, AllReduce collective operations across the network become the bottleneck at 8+ nodes. EFA is available on ml.p3.16xlarge, ml.p4d, ml.p4de, ml.p5, and a handful of other large instances. SageMaker enables EFA automatically when you configure distributed training on supported instances and request a placement group.

Placement Groups and Co-location

SageMaker automatically places distributed training instances in a cluster placement group, ensuring all instances reside in the same physical AZ partition for low-latency inter-node communication. You do not configure this manually — SageMaker handles it. But you do need to be aware that distributed training jobs may queue waiting for placement-group capacity in busy regions.

Always enable EFA when running distributed training across more than two ml.p3.16xlarge, ml.p4d, ml.p4de, or ml.p5 instances. SageMaker enables EFA automatically when you select these instance types and configure SMDP or SMP. EFA delivers 4-10x AllReduce throughput compared to standard ENI networking on the same instances, which translates directly into linear vs sublinear scaling. The cost difference is zero — EFA is included with these instances at no extra charge — so the only reason to skip it is if your workload genuinely fits on a single instance, in which case you do not need distributed training at all. On the exam, any answer that proposes ml.p4d for distributed training but omits EFA networking is suboptimal.

SageMaker Training Compiler - The Optimisation Layer

Training Compiler is an XLA-based graph optimisation layer that compiles PyTorch and TensorFlow models for the target accelerator hardware, often delivering 20-50 percent speed-up on the same hardware.

What the Training Compiler Does

The Training Compiler analyses the training graph and applies kernel fusion, memory layout optimisation, mixed-precision conversion, and operator scheduling. It produces a hardware-optimised compiled graph that runs faster than the unoptimised PyTorch/TF version. The speed-up varies by model — transformer architectures often see 30 percent improvements, while convolutional models may see less.

When to Use Training Compiler vs Skip It

Training Compiler adds compilation time at the start of every training job — typically 5-15 minutes. For short training runs, the compilation overhead may exceed the speed-up gain. Training Compiler is most beneficial for runs of 4 hours or more. It also has model-architecture restrictions — not every PyTorch operator is supported, and unsupported operators force fallback to eager execution, eroding the speed-up. Test compatibility on a small representative training job before committing to a full multi-day distributed run with the compiler enabled.

Combining Training Compiler with SMDP and SMP

Training Compiler stacks with both SMDP (data parallelism) and SMP (model parallelism). The configuration is layered: SMDP/SMP define how the work is partitioned across GPUs; Training Compiler optimises the work that runs on each GPU. The combined speed-up is multiplicative rather than additive — a 30 percent compiler speed-up on top of 80 percent data-parallel scaling efficiency is meaningful at scale.

Checkpointing for Resilient Distributed Training

Distributed training jobs run for hours or days. Without checkpointing, any interruption — Spot reclaim, instance failure, network partition — destroys all progress.

S3 Checkpoint Configuration

SageMaker training jobs accept a checkpoint_s3_uri and checkpoint_local_path in the Estimator. Files written to the local checkpoint path during training are automatically synced to S3, and on job restart, S3 contents are restored to the local path. The training script must (1) periodically save model state to the local path and (2) check for existing checkpoints at start and resume from them.

Distributed Checkpoint Patterns

For data-parallel training, only one GPU writes the checkpoint — typically rank 0. The other ranks have identical model state, so duplicate writes waste S3 throughput. SMP provides distributed checkpoint utilities that gather sharded model state from all ranks into a single coherent checkpoint, then write it from rank 0.

Checkpoint Frequency Trade-off

Frequent checkpointing protects against interruption but costs IO time. Typical patterns: every epoch for short jobs (under 1 hour per epoch), every N steps for long jobs where one epoch may take 8+ hours. Spot training jobs benefit from more frequent checkpointing — the goal is to lose less than 30 minutes of work on a Spot reclaim.

Distributed training without S3 checkpointing is not production-grade — every Spot interruption, every node failure, and every job restart destroys all progress. SageMaker's checkpoint_s3_uri parameter automates the sync between local instance storage and S3; on restart, the prior checkpoint is restored before training resumes. For Managed Spot Training (up to 90 percent cost savings), checkpointing is not optional — Spot reclaim probability over a multi-hour training job is high enough that an unrecoverable job will eventually happen. The exam tests this combination: any answer combining "Managed Spot Training" with distributed training MUST also configure checkpoint paths; an answer that proposes Spot without checkpoints is wrong.

When NOT to Use Distributed Training

Distributed training has overhead. Synchronisation cost across GPUs and across nodes adds latency that can exceed the speed-up benefit for small workloads.

The Synchronisation Tax

Every batch in data parallelism requires an AllReduce across all GPUs. AllReduce time grows with cluster size and is bounded by network bandwidth. For a small model with a small per-GPU batch, the sync time may exceed the per-GPU compute time — meaning more GPUs makes the job slower, not faster. The break-even point depends on model size, batch size, and network. Rule of thumb: under 100 million parameters with batches under 64, distributed training rarely pays off.

The Communication-Computation Ratio

Data-parallel scaling efficiency depends on the ratio of compute time per batch to communication time per batch. Large models with large batches have high compute time and amortise the communication cost — they scale efficiently to dozens of nodes. Small models with small batches have low compute time, so communication dominates — they scale poorly past 2-4 nodes.

Single-Node Multi-GPU First

Before adding instances, first scale within one instance. An ml.p4d.24xlarge has 8 GPUs connected by NVLink at 600 GB/s — far faster than EFA between instances. Training that fits in one ml.p4d.24xlarge with data parallelism across the 8 internal GPUs scales near-linearly without any cross-instance overhead. Only when one instance is insufficient should you scale to multiple instances.

Increase Batch Size Instead

If GPU memory has slack, increasing batch size improves throughput without distribution overhead. A ml.p4d.24xlarge with batch size 32 may be GPU-memory-underutilised; bumping to batch size 256 keeps the same instance count but processes 8x more data per second. Combine larger batches with gradient accumulation to maintain effective batch size while fitting in memory.

Common MLA-C01 Exam Traps for Distributed Training

The exam plants specific misconceptions. Recognise them.

Trap 1 - Choosing Model Parallelism for a Small Model

Engineers fresh from foundation-model articles default to model parallelism. For a 100M-parameter model that fits comfortably in one GPU, model parallelism adds pipeline overhead with zero benefit. Data parallelism (or no parallelism at all) is correct.

Trap 2 - Using Tensor Parallelism Across Instances

Tensor parallelism is communication-heavy and must stay within one instance (NVLink). Spreading it across instances ruins throughput. Pipeline parallelism is the cross-instance model-parallel choice; tensor parallelism is the within-instance choice.

Trap 3 - Skipping EFA on Multi-Node Distributed Training

Without EFA, multi-node training is bottlenecked by ENI networking. The exam lists ml.p4d without EFA as a wrong answer; the correct choice always includes EFA.

Trap 4 - Distributed Training Without Checkpointing

Especially with Managed Spot Training. Spot reclaim is probabilistic; a multi-hour job will eventually be reclaimed. No checkpoint means starting over.

Trap 5 - Assuming Linear Scaling

Doubling GPUs does not halve training time. AllReduce overhead, IO, and pipeline bubbles cause sublinear scaling. Real-world scaling efficiency at 32 GPUs is typically 60-80 percent; at 256 GPUs it can drop to 40-60 percent without careful tuning.

Trap 6 - Training Compiler Always Speeds Things Up

False. Compilation time can exceed the speed-up for short jobs. Test on a representative job first.

Trap 7 - Ignoring IO as the Real Bottleneck

If GPU utilisation is below 80 percent, the bottleneck is IO or CPU preprocessing, not GPU compute. Adding more GPUs does not help; fixing the data pipeline does.

Trap 8 - Spot Distributed Training Without Hyperparameter Considerations

Distributed Spot can interrupt mid-training. Hyperparameters that depend on consistent step counts (warmup schedules, LR decay schedules) need to be aware of restarts.

Trap 9 - Confusing SMDP With SMP

SMDP is data parallel — every GPU has the full model. SMP is model parallel — each GPU has a fraction. They are different libraries solving different problems. The exam plants this distinction repeatedly.

Trap 10 - Mixing Horovod with SMDP

Pick one distribution framework. Mixing them produces unpredictable behaviour. SageMaker Estimator's distribution argument selects exactly one strategy.

FAQ - Distributed Training on SageMaker

Q1 - When should I use SageMaker Data Parallel (SMDP) vs PyTorch DDP?

Use SMDP when you are running on EFA-equipped instances (ml.p3.16xlarge, ml.p4d, ml.p4de, ml.p5) and want maximum AllReduce throughput on AWS-specific topology. SMDP outperforms NCCL on these instances by 20-40 percent on multi-node configurations. Use PyTorch DDP when you are on smaller instances without EFA (ml.g5, ml.p3.8xlarge), when your team has existing DDP code and minimal AWS-specific tuning budget, or when portability between AWS and other clouds matters. The MLA-C01 exam favours SMDP for any large p4d/p5 distributed training scenario; PyTorch DDP shows up as the answer for smaller cost-constrained training.

Q2 - My 70-billion-parameter model does not fit on a single A100 80GB. What is the simplest distributed training configuration?

Pipeline parallelism across the 8 GPUs of one ml.p4de.24xlarge or ml.p5.48xlarge. The 70B model in FP16 weights is ~140 GB; spread across 8 GPUs that is 17.5 GB per GPU for weights, leaving room for activations and optimiser state. Configure SMP with pipeline_parallel_degree=8, enable activation checkpointing to reduce activation memory, use BF16 mixed precision to halve weight memory. If a single node is insufficient (e.g., training requires also data parallelism for throughput), scale to multiple nodes with hybrid pipeline + data parallelism — pipeline parallel within node, data parallel across nodes.

Q3 - I configured 8 ml.p4d.24xlarge nodes but training is barely faster than 1 node. What did I miss?

Run through the diagnostic tree. (1) Is EFA enabled — without EFA, ENI bandwidth bottlenecks AllReduce. SageMaker enables EFA automatically on supported instances; verify in CloudWatch metrics. (2) Is the data loader the bottleneck — check GPU utilisation. If under 80 percent, your dataset IO or preprocessing is starving GPUs. Switch to FastFile mode or FSx for Lustre. (3) Is the batch size too small — small per-GPU batches make communication cost dominate. Increase batch size or use gradient accumulation. (4) Is the AllReduce strategy correct — SMDP's optimised AllReduce often outperforms naive NCCL. (5) Are the nodes co-located — SageMaker should place them in a placement group automatically; if you are running on raw EC2, you must configure this manually.

Q4 - Can I use Managed Spot Training for distributed jobs?

Yes, but you must configure checkpointing. Managed Spot saves up to 90 percent on training cost but Spot capacity can be reclaimed with 2-minute warning. For distributed training, configure checkpoint_s3_uri and checkpoint_local_path in the Estimator. The training script must save state to the local path periodically and restore from S3 on start. SageMaker handles the S3 sync automatically. Expect interruptions on jobs longer than a few hours; tune checkpoint frequency so each interruption costs less than 30 minutes of wall-clock time.

Q5 - What is the difference between pipeline parallelism and tensor parallelism, and when do I use each?

Pipeline parallelism splits the model along the layer axis — different GPUs hold different layers. Communication happens at layer boundaries (activations passed between stages). Pipeline parallelism is communication-light and tolerates the EFA latency between instances, making it the right choice for cross-instance model parallelism. Tensor parallelism splits individual operations (matrix multiplications) across GPUs. Communication happens within every operation, demanding NVLink-class bandwidth — tensor parallelism only works within one instance. Combine them: tensor parallelism across the 8 GPUs within a node (NVLink), pipeline parallelism across nodes (EFA), data parallelism replicating the whole structure across more groups of nodes.

Q6 - I want to use the SageMaker Training Compiler with my distributed PyTorch job. What should I expect?

Expect 5-15 minutes of compilation overhead at job start, then 20-50 percent speed-up on the per-GPU compute. The compiler stacks cleanly with SMDP — configure distribution={"smdistributed": {"dataparallel": {"enabled": True}}} and compiler_config=TrainingCompilerConfig(). Run a short representative test (e.g., 30 minutes) first to verify (1) all your operators are supported (unsupported operators fall back to eager execution and erode speed-up) and (2) your batch size still fits — the compiler can change memory layout and a batch that fit pre-compilation might OOM post-compilation. The compiler is most worthwhile on training runs of 4+ hours where the compilation cost amortises.

Q7 - My cluster sometimes fails to launch with "insufficient capacity" errors. How do I handle this?

Distributed training requires placement-group capacity in the same AZ, which can be scarce in busy regions for ml.p4d and ml.p5. Three mitigations. (1) Use SageMaker's training-job retry logic — set max_retry_attempts so the job re-queues automatically. (2) Choose a less-busy region for training (us-west-2 often has more p4d/p5 capacity than us-east-1). (3) Reserve capacity via Capacity Reservations or Savings Plans for predictable training schedules. (4) Use warm pools — once you successfully claim cluster capacity, warm pools keep it allocated for subsequent training jobs at modest cost. For one-off large fine-tunes, the simple retry is usually enough; for production retraining cadences, reserve capacity.