IAM, KMS, and VPC for SageMaker workloads is the security backbone of every production ML system on AWS — and the MLA-C01 exam tests it heavily because it is exactly the layer where engineering candidates fall down. Domain 4 Task 4.3 asks you to design execution roles with least privilege, encrypt training data and model artifacts with customer-managed KMS keys, isolate training jobs and endpoints inside a VPC with no internet access, and prove the chain holds up to confused-deputy attacks and cross-account abuse. The traps are concrete: a KMS key policy that omits the SageMaker service principal, a VPC config that activates without an S3 VPC endpoint, a network isolation toggle that conflicts with the very Docker image pull the training job needs. This note covers IAM, KMS, and VPC for SageMaker end-to-end, then walks the troubleshooting decision trees for the AccessDenied scenarios MLA-C01 plants in scenario stems.
This guide is engineered for the MLA-C01 ML engineer perspective. It assumes you can launch a SageMaker training job and create an endpoint, but want to see how IAM, KMS, and VPC for SageMaker compose into a production-grade security architecture that satisfies regulated workloads — HIPAA, PCI-DSS, FedRAMP — and the kinds of compliance drills the exam loves to set up. We cover IAM execution roles, role chaining for cross-account training, KMS at rest and in transit, VPC isolation with private subnets, the network isolation mode toggle, the S3 VPC endpoint requirement, PrivateLink interface endpoints, and CloudTrail audit. Every callout points to the official AWS documentation so the source of truth for IAM, KMS, and VPC for SageMaker is one click away.
What Is IAM, KMS, and VPC for SageMaker?
IAM, KMS, and VPC for SageMaker is the three-layer security model AWS expects you to apply to every SageMaker workload. IAM controls who and what can call SageMaker APIs and what those callers can do once a SageMaker job is running. KMS controls how data at rest — training datasets in S3, EBS volumes attached to training instances, model artifacts, endpoint storage — is encrypted with keys you own and can revoke. VPC controls the network boundary inside which the training container or inference container runs, whether it can reach the public internet, and which AWS services it can talk to via private endpoints. Together, these three layers form the security envelope around IAM, KMS, and VPC for SageMaker workloads.
Why Three Layers, Not One
A common engineering mistake is to assume that IAM alone is enough. IAM, KMS, and VPC for SageMaker each protect a different attack surface. IAM protects against an attacker who steals a credential and tries to call a SageMaker API. KMS protects against an attacker who reads bytes from an S3 bucket or an EBS snapshot but does not have the key. VPC protects against an attacker who sits inside the training container itself — through a malicious dependency or a poisoned base image — and tries to exfiltrate data over the public internet. A well-architected SageMaker workload assumes any one layer can be breached and still keeps the data inside the envelope. The MLA-C01 exam tests this layered thinking through scenario stems where two layers are correct and one is missing.
Where the Shared Responsibility Line Sits
AWS manages the SageMaker control plane, the EC2 hosts that run training containers, and the underlying network fabric. The customer manages the IAM execution role, the KMS keys and key policies, the VPC and subnet design, the security groups, the S3 bucket policies that scope training data access, and the container code itself. For IAM, KMS, and VPC for SageMaker, the customer side is where exam questions live. AWS does not give you a "secure by default" SageMaker — you must configure each layer.
Plain-Language Explanation: IAM, KMS, and VPC for SageMaker
Three concrete analogies make the structure of IAM, KMS, and VPC for SageMaker stick. Choose the one that matches your mental model.
Analogy 1 — The Hospital Research Lab
Imagine a hospital research lab studying patient genetic data. IAM, KMS, and VPC for SageMaker maps to three real-world controls: the lab access badges (IAM execution role) that say which researchers can enter which rooms and which equipment they can operate, the biometric vault (KMS) that holds patient samples and only opens when the chief medical officer's fingerprint authorizes a specific researcher to retrieve a specific sample, and the isolation chamber (VPC private subnet with network isolation mode) where a researcher analyzing a dangerous pathogen works in a sealed room with no door to the outside world — only filtered air vents (S3 VPC endpoint, ECR VPC endpoint) connect them to the supply room. A bad actor who steals a badge cannot retrieve samples without the vault. A bad actor who breaks into the vault without a researcher's credentials is denied. A bad actor inside the chamber cannot leak samples to the outside because the chamber has no internet door. That triple-control pattern is exactly what IAM, KMS, and VPC for SageMaker enforces.
Analogy 2 — The Bank Vault Inside a Bank Inside a Compound
Picture a private bank with a perimeter wall, a bank building, a vault room, and a safety-deposit box. The perimeter wall and guards (VPC and security groups) decide who gets onto the property and which doors they can approach. The vault key cards (IAM execution role) decide which staff can swipe into the vault room. The safety-deposit box keys (KMS customer-managed keys) decide which specific box a staff member can open even after they reach the vault. A robber who scales the perimeter wall but has no key card cannot enter the vault. A robber with a stolen key card who reaches the vault but lacks the box key cannot open any deposit. A robber who somehow steals all three is still constrained by the dual-control rule (KMS grants combined with IAM conditions and CloudTrail audit) requiring two principals to authorize a withdrawal of high-value items. This is the layered defence-in-depth pattern that IAM, KMS, and VPC for SageMaker is built on.
Analogy 3 — The Top-Secret Workshop
Imagine a defence contractor's top-secret R&D workshop. The employee ID and clearance level (IAM execution role and IAM conditions) say which engineer can sign into which workstation and what work they may perform. The encrypted hard drive in each workstation (KMS encryption of EBS volumes) means that if a thief steals the workstation, the data is unreadable without the corporate HSM. The secure compartmentalized facility (VPC with private subnets, network isolation mode, and PrivateLink endpoints) means the workshop has no public phone line — engineers communicate with the parts warehouse (S3) only through a sealed pneumatic tube (S3 VPC endpoint) that the security officer (VPC endpoint policy) inspects every package on. Engineers signed off the project still cannot exfiltrate designs because the room has no external network. That is IAM, KMS, and VPC for SageMaker in production.
SageMaker IAM Execution Roles — The Foundation
The SageMaker execution role is the IAM role assumed by the SageMaker service when it runs a training job, processing job, or hosts an endpoint on your behalf. It is the most heavily tested IAM concept in IAM, KMS, and VPC for SageMaker.
What the Execution Role Actually Does
When you call CreateTrainingJob, you pass a RoleArn. SageMaker assumes that role and uses the resulting credentials to do everything on your behalf for the duration of the job: pull the Docker image from ECR, download the training data from S3, write checkpoints back to S3, write logs to CloudWatch, encrypt EBS volumes with KMS, attach ENIs in your VPC, and upload the final model artifact to S3. Every API call the training container needs is authorized by the execution role's permissions. If you omit s3:GetObject for the training-data prefix, the job fails with AccessDenied during data download. If you omit kms:Decrypt for the dataset KMS key, the job fails before training starts.
Trust Policy — Who Can Assume the Role
The execution role's trust policy must allow the SageMaker service principal sagemaker.amazonaws.com to assume the role. To defeat confused-deputy attacks, add an aws:SourceAccount condition pinning the role to your AWS account ID and an aws:SourceArn condition pinning to the specific resource (training job ARN, endpoint ARN). Without these conditions, a misconfigured account could be tricked into letting another customer's SageMaker job assume your role.
The Permission Boundary Pattern for SageMaker Roles
Production teams attach a permission boundary to the execution role that caps the maximum permissions the role can ever have. This protects against a developer who accidentally adds iam:* to the role policy — the boundary still blocks it. For IAM, KMS, and VPC for SageMaker, a typical boundary allows S3 read/write to specific bucket prefixes, ECR pull from specific repositories, KMS encrypt/decrypt for specific keys, CloudWatch Logs write, and EC2 ENI operations only inside a tagged VPC. Anything else is denied even if the inline policy attempts to grant it.
The SageMaker execution role is the single most over-permissioned IAM role in most AWS accounts. Because the AWS console wizard offers an AmazonSageMakerFullAccess managed policy as the default, teams ship to production with s3:* on every bucket and iam:PassRole on *. Both are violations of least privilege. For IAM, KMS, and VPC for SageMaker, scope the execution role to: specific S3 bucket prefixes for input data and output artifacts, specific ECR repositories for the training image, specific KMS keys for encrypt/decrypt, specific CloudWatch Logs groups for the job, and iam:PassRole only to the same execution role for nested operations. Use SageMaker Role Manager or hand-craft the policy from the API permissions reference. Never ship AmazonSageMakerFullAccess to production.
Role Chaining for Cross-Account SageMaker Workflows
Real ML organizations split data across accounts — a Data account owns the curated S3 datasets, a Modelling account owns SageMaker training, a Production account owns the endpoint. Role chaining is how IAM, KMS, and VPC for SageMaker spans these accounts.
The Three-Account Pattern
The Data account's S3 bucket policy grants s3:GetObject on the dataset prefix to the Modelling account's SageMaker execution role ARN. The dataset's KMS key policy in the Data account grants kms:Decrypt to the same Modelling-account role. The Modelling account's execution role has s3:GetObject and kms:Decrypt in its inline policy targeting the Data-account bucket and key. Both sides of the cross-account permission must align — bucket policy grant and execution-role policy allow.
Cross-Account Role Assumption for Pipelines
For SageMaker Pipelines or CodePipeline that orchestrate training across accounts, the orchestration role assumes a deployment role in the target account. The trust policy on the deployment role allows the orchestration account principal with an aws:PrincipalTag condition. The orchestration role has sts:AssumeRole permission on the target deployment role ARN. The chain — orchestration role → assume → deployment role → assume → execution role — is the production CI/CD pattern.
Why aws:SourceArn Matters in Cross-Account Setups
A bucket policy grant of the form "any SageMaker role from account 222" is a confused-deputy vulnerability. Another customer's SageMaker training job, if their role ARN is reused or guessed, could read your data. Always pin aws:SourceArn to specific training job ARNs or specific role ARNs, and pair with aws:SourceAccount condition. The MLA-C01 exam plants stems where the bucket policy is missing these conditions and asks "what is the security flaw" — the answer is confused deputy.
KMS Encryption for SageMaker — At Rest and In Transit
KMS in IAM, KMS, and VPC for SageMaker is where regulated workloads either pass or fail audit. The exam tests three encryption surfaces: training data, model artifacts, and endpoint storage.
Training Data Encryption — the S3 Side
SageMaker downloads training data from S3 at job start. If the bucket uses SSE-KMS with a customer-managed key, the execution role must have kms:Decrypt on the key, and the key policy must allow the role principal to call Decrypt. Both sides — IAM policy and key policy — must align. The key policy is the gating layer; if the key policy denies, no IAM grant can override it. This is the single most common KMS failure on the MLA-C01 exam.
EBS Volume Encryption — the Training Instance Side
When SageMaker provisions training instances, the attached EBS volumes can be encrypted with either an AWS-managed KMS key or a customer-managed key supplied via the VolumeKmsKeyId parameter on the CreateTrainingJob API. For regulated workloads, always use a customer-managed key — it gives you key rotation control and CloudTrail audit of every Encrypt/Decrypt call. The execution role needs kms:CreateGrant to the volume key, because SageMaker creates a grant at job start that allows the underlying EC2 host to encrypt and decrypt the volume.
Model Artifact Encryption — the Output Side
The training job uploads the model artifact (model.tar.gz) to S3 at completion. Encrypt the output prefix with a customer-managed KMS key — separate from the input data key if you want to enforce different access patterns. Pass the output KMS key ARN via the OutputDataConfig.KmsKeyId parameter. The execution role needs kms:Encrypt, kms:GenerateDataKey, and kms:Decrypt on the output key.
Endpoint Storage Encryption — the Inference Side
When you create an endpoint, SageMaker provisions inference instances with EBS volumes. The endpoint configuration accepts a KmsKeyId parameter that encrypts these volumes. For real-time endpoints with network isolation, this is required to prevent disk-level data leakage. For asynchronous endpoints, the input and output S3 locations also need KMS configured.
Customer-managed KMS keys for SageMaker require both an IAM grant on the execution role AND a matching grant in the KMS key policy — neither alone is sufficient. The KMS key policy is the authoritative gate; if the key policy does not list the SageMaker execution role principal as allowed for the relevant operations (Decrypt, Encrypt, GenerateDataKey, CreateGrant), the IAM policy on the role cannot override the deny. The reverse is also true — if the key policy allows the role but the role's IAM policy does not, the call still fails. Both must align. The console wizard creates the policies correctly; CloudFormation or Terraform configurations regularly miss the key policy update and produce silent KMSAccessDenied at job start.
Inter-Container Traffic Encryption for Distributed Training
For distributed training with multiple instances, set EnableInterContainerTrafficEncryption=true on the training job. This encrypts the AllReduce traffic between training nodes with TLS. It is required for HIPAA, PCI-DSS, and FedRAMP-aligned workloads. The trade-off is a small throughput penalty — typically 5 to 10 percent — because TLS termination consumes CPU. For single-instance training, the flag is a no-op.
VPC Isolation for SageMaker Training Jobs
VPC configuration is the network layer of IAM, KMS, and VPC for SageMaker. It is where compliance drills live.
Default Mode — No VPC
Without a VPC config on the training job, SageMaker runs the training container in an AWS-managed VPC with full internet access. This is fine for non-sensitive workloads but unacceptable for regulated data. The container can reach any internet endpoint, exfiltrate data, or download malicious code. Always configure a VPC for production.
Customer VPC Mode — Private Subnets
Pass VpcConfig with subnets and security groups to CreateTrainingJob. SageMaker creates ENIs in your subnets and runs the training container with those ENIs as its only network interface. The subnets should be private — no route to an Internet Gateway. Outbound traffic flows through a NAT Gateway (if internet is allowed) or through VPC endpoints (if internet is blocked).
Network Isolation Mode — No Internet At All
Set EnableNetworkIsolation=true to block all network access from the training container. The container cannot reach the internet, cannot reach S3 directly, cannot reach any AWS service except through the limited container-to-host channels SageMaker establishes for log delivery and metric publication. This is the highest-security mode. For training data ingress, SageMaker downloads the data to the EBS volume before the container starts, so the container does not need network access to read data. For S3 output, SageMaker uploads from the host after the container exits.
The S3 VPC Endpoint Requirement
When the training job is in a VPC and you allow internet access (network isolation off), but the S3 bucket is policy-restricted to your VPC, you need an S3 gateway VPC endpoint in the route table for the training subnets. Without it, the SageMaker host's call to S3 traverses the public internet, the bucket policy's aws:SourceVpc condition rejects it, and the training job fails to download data. The S3 gateway endpoint is free; always include it.
ECR VPC Endpoint for Image Pull
The training container image lives in ECR. When network isolation is on and the VPC has no internet route, the host must reach ECR through interface VPC endpoints — com.amazonaws.<region>.ecr.api and com.amazonaws.<region>.ecr.dkr — plus an S3 gateway endpoint (ECR stores image layers in S3 internally). Forgetting the ECR endpoints is the canonical "network isolation mode does not start" failure.
Network isolation mode is not the same as VPC mode, and combining them naively breaks training jobs. VPC mode places the container's network interface in your subnets, but the container can still reach the internet via NAT or VPC endpoints. Network isolation mode blocks the container from making any outbound network calls — even to AWS services. If you enable network isolation without also providing an S3 gateway endpoint, an ECR API endpoint, an ECR DKR endpoint, and a CloudWatch Logs interface endpoint, the SageMaker host cannot pull the image, log to CloudWatch, or upload model artifacts. The job fails before training starts. The MLA-C01 exam plants stems with "the security team enabled network isolation and now training jobs fail" — the answer is missing VPC endpoints, not the isolation flag itself.
VPC Configuration for SageMaker Endpoints
Endpoints are inference, not training, but the VPC story for IAM, KMS, and VPC for SageMaker is symmetric.
Endpoint VPC Config
Pass VpcConfig to CreateEndpointConfig with subnets and security groups. SageMaker provisions inference instances with ENIs in your subnets. Real-time endpoint clients (your application) reach the endpoint either via the public SageMaker Runtime API (which routes through AWS's network internally) or via a SageMaker Runtime VPC interface endpoint for fully private invocation paths.
PrivateLink Interface Endpoints for SageMaker API
Two distinct endpoints: com.amazonaws.<region>.sagemaker.api for the control-plane SageMaker API (CreateModel, CreateEndpoint, DescribeEndpoint), and com.amazonaws.<region>.sagemaker.runtime for the data-plane InvokeEndpoint API. For a fully VPC-isolated workload, both endpoints must exist in the consuming account's VPC. The runtime endpoint is what makes inference calls flow over PrivateLink instead of the public internet — required for many compliance regimes.
Endpoint Policies on the SageMaker VPC Endpoints
The VPC interface endpoint accepts an endpoint policy that limits which SageMaker resources can be invoked through it. A common pattern: the runtime endpoint policy allows only sagemaker:InvokeEndpoint on endpoints owned by the account, with an aws:ResourceTag condition restricting to endpoints tagged Environment=prod. This blocks accidental invocation of other accounts' endpoints even if a credential leak occurs.
Cross-VPC Inference
For multi-VPC architectures, the inference endpoint lives in one VPC but consumers in other VPCs. Two patterns: each consumer VPC runs its own SageMaker Runtime VPC endpoint and invokes by endpoint name (cleanest), or a Transit Gateway connects VPCs and consumers route to the endpoint VPC's internal IP. The endpoint policy and security group rules govern which consumers can connect.
Data Capture for Endpoints — Encrypted
Data capture (the request/response logging used by Model Monitor) writes to S3. The capture configuration accepts a KmsKeyId parameter. For IAM, KMS, and VPC for SageMaker, always encrypt capture data — request payloads can contain PII, output payloads can contain sensitive predictions. The endpoint execution role needs kms:Encrypt and kms:GenerateDataKey on the capture key.
CloudTrail Audit for SageMaker
CloudTrail captures every SageMaker API call — CreateTrainingJob, CreateEndpoint, UpdateEndpointWeightsAndCapacities, all configuration changes. For IAM, KMS, and VPC for SageMaker compliance audit, enable a multi-region organization trail and:
- Filter for
eventSource = sagemaker.amazonaws.comto find all SageMaker activity. - Watch for
eventName = UpdateEndpointoutside change windows — unscheduled updates are a security signal. - Alert on
eventName = DeleteTrainingJobor rapidCreateTrainingJobfrom unfamiliar principals. - Use CloudTrail Lake to SQL-query training job creation patterns over time.
CloudTrail does not capture data-plane events for SageMaker by default — InvokeEndpoint calls are not in CloudTrail. For inference audit, enable data capture on the endpoint and use Model Monitor's request log.
Use IAM condition keys to lock SageMaker training jobs to approved instance types, KMS keys, and VPC configurations. The condition keys sagemaker:InstanceTypes, sagemaker:VolumeKmsKey, sagemaker:NetworkIsolation, and sagemaker:VpcSubnets evaluate at job creation time. A statement of the form Deny CreateTrainingJob unless sagemaker:NetworkIsolation = true enforces network isolation across the entire account. A statement Deny CreateTrainingJob unless sagemaker:VolumeKmsKey is in [list] enforces customer-managed KMS keys. These conditions are how you turn IAM, KMS, and VPC for SageMaker into a guardrail rather than a documentation guideline.
Troubleshooting AccessDenied in SageMaker — The Decision Trees
This section is the heart of MLA-C01 Task 4.3. Memorize the symptom-to-cause mappings.
Training Job Fails Immediately with AccessDenied to S3
Symptom: training job state goes Failed within seconds, error message references S3 GetObject AccessDenied on the input data path.
Decision tree:
- Verify the execution role has
s3:GetObjectands3:ListBucketon the bucket and prefix. - Verify the S3 bucket policy does not explicitly deny the role principal.
- Verify the bucket's KMS key policy allows the role to call Decrypt — most common failure.
- Verify any SCP at the OU level does not deny S3 actions for the role.
- Verify the training job's VPC config (if used) includes an S3 gateway endpoint in the subnet route table — without it, the bucket policy's
aws:SourceVpccondition rejects the call.
Training Job Fails with KMSAccessDenied During EBS Volume Mount
Symptom: training job state Failed, error message mentions KMS or VolumeKmsKey.
Decision tree:
- Verify the execution role has
kms:CreateGrant,kms:Decrypt,kms:Encrypt,kms:GenerateDataKeyon the volume KMS key. - Verify the KMS key policy lists the execution role principal in the allowed-principals statement for the same operations.
- Verify the KMS key is in the same region as the training job — cross-region KMS for EBS does not work.
- Verify the key is enabled and not pending deletion.
Endpoint Creation Hangs or Fails
Symptom: endpoint state stays InService for unusually long, or fails with InternalServerError.
Decision tree:
- Check CloudWatch Logs for the endpoint container — log group
/aws/sagemaker/Endpoints/<endpoint-name>. The container's stdout/stderr will print the failure. - If logs show ECR pull failure: VPC config is set with network isolation but ECR VPC endpoints are missing.
- If logs show S3 model artifact download failure: execution role missing S3 GetObject on the artifact path, or KMS Decrypt missing on the artifact key.
- If logs show "Application failed to start within X seconds": the model server inside the container is failing health checks. Increase
ContainerStartupHealthCheckTimeoutInSecondsand inspect the container's startup logs.
Network Isolation Mode Causes Container to Hang
Symptom: training job state Starting for very long, no log output, eventually times out.
Decision tree:
- Verify VPC endpoints exist for
com.amazonaws.<region>.s3(gateway),com.amazonaws.<region>.ecr.api(interface),com.amazonaws.<region>.ecr.dkr(interface),com.amazonaws.<region>.logs(interface). - Verify subnet route tables associate the gateway endpoint.
- Verify the security group on the training ENI allows outbound HTTPS (443) to the VPC endpoints.
- Verify the interface endpoints' DNS is enabled (private DNS) so the container resolves the AWS service hostnames to private IPs.
CloudTrail Shows InvokeEndpoint Failures from Unexpected Source
Symptom: CloudTrail Lake query shows InvokeEndpoint denials sourced from public internet IPs when the policy expects only VPC-internal.
Decision tree:
- Confirm the SageMaker Runtime VPC interface endpoint exists in the consumer VPC.
- Confirm consumer applications are using the endpoint's regional DNS, not the public DNS.
- Confirm the endpoint policy on the runtime VPC endpoint restricts to internal principals.
- If using a SageMaker endpoint with
EndpointConfig.VpcConfig, confirm the consumer is in a peered VPC or shares the same Transit Gateway.
SageMaker Studio Security — A Different Surface
SageMaker Studio domains have their own IAM model — domain execution role, user profile execution role, and notebook execution role — that interacts with IAM, KMS, and VPC for SageMaker. Domain-level VPC config places Studio's app instances in a VPC. Studio supports the same network isolation mode toggle. KMS keys for Studio's EFS storage are configured at domain creation and cannot change later. The exam can plant a Studio scenario where the user profile role lacks iam:PassRole for the training execution role — the user can open notebooks but cannot launch training jobs.
Common Exam Traps for IAM, KMS, and VPC for SageMaker
Trap 1 — AmazonSageMakerFullAccess Is Production-Ready
Wrong. The managed policy grants s3:* on every bucket and iam:PassRole on *. Always replace with scoped inline policies for production.
Trap 2 — KMS Key Policy Updates Are Not Required When the IAM Policy Allows
Wrong. KMS key policies are the authoritative gate. IAM policy alone cannot grant KMS access if the key policy does not also list the principal.
Trap 3 — Network Isolation Mode Replaces VPC Mode
Wrong. They are independent toggles. Network isolation alone (without VPC config) blocks the container from internet access; VPC mode alone places the container in your subnets but allows internet via NAT. Combine both for maximum isolation.
Trap 4 — S3 VPC Endpoint Costs Money
Wrong for gateway endpoints (S3 and DynamoDB) — they are free. Interface endpoints (most other services) cost per hour and per GB. Always use the S3 gateway endpoint for SageMaker training data.
Trap 5 — Inter-Container Traffic Encryption Affects Single-Instance Training
Wrong. The flag only affects multi-instance distributed training. For single-instance jobs, the flag has no effect.
Trap 6 — CloudTrail Captures InvokeEndpoint Calls
Wrong. SageMaker InvokeEndpoint is a data-plane API and not in standard CloudTrail. Use endpoint data capture to log invocations.
Trap 7 — Customer-Managed KMS Keys Auto-Rotate Like AWS-Managed Keys
Partial. AWS-managed keys rotate yearly automatically. Customer-managed keys rotate yearly only if you enable EnableKeyRotation. Without it, the same key material is used forever.
Trap 8 — A SageMaker Endpoint in a VPC Cannot Be Invoked from the Internet
Wrong. The endpoint is still reachable via the public SageMaker Runtime API unless you also restrict via endpoint policy or use the runtime VPC interface endpoint exclusively. VPC config on the endpoint controls outbound network from the inference container, not inbound invocation paths.
Trap 9 — Training Jobs Inherit the Notebook's IAM Role
Wrong. Each training job has its own execution role passed via RoleArn. The notebook role calls CreateTrainingJob and must have iam:PassRole on the execution role ARN.
Trap 10 — Disabling Public Access on the Training Bucket Is Sufficient Network Isolation
Wrong. Bucket-level public access blocks inbound; it does not affect what the training container can reach outbound. Only network isolation mode and VPC endpoint policies control the container's outbound traffic.
For a fully isolated SageMaker training job — VPC mode plus network isolation mode plus customer-managed KMS plus inter-container encryption — you must provision exactly five VPC endpoints in the training subnets: S3 gateway endpoint, ECR API interface endpoint, ECR DKR interface endpoint, CloudWatch Logs interface endpoint, and (if calling other AWS services from the container) a STS interface endpoint. Skip any of these and the job fails: missing S3 endpoint blocks data download, missing ECR endpoints block image pull, missing CloudWatch Logs endpoint silently loses training logs, missing STS endpoint blocks any container code that calls AssumeRole. This five-endpoint pattern is the definitive answer for IAM, KMS, and VPC for SageMaker fully-isolated stems.
Key Numbers and Must-Memorize Facts for IAM, KMS, and VPC for SageMaker
IAM
iam:PassRoleis required by the principal that calls CreateTrainingJob to pass the execution role- AmazonSageMakerFullAccess is for prototyping only; never production
- Permission boundaries cap maximum permissions even if inline policy grants more
aws:SourceArnandaws:SourceAccountconditions defeat confused deputy
KMS
- Key policy is the authoritative gate; IAM cannot override
- Customer-managed keys do not rotate unless
EnableKeyRotation = true - VolumeKmsKeyId encrypts EBS for training instances
- OutputDataConfig.KmsKeyId encrypts model artifacts in S3
- KMS keys are regional; cross-region requires multi-region keys
VPC
- VPC mode and network isolation mode are independent toggles
- Network isolation requires S3 gateway, ECR API, ECR DKR, CloudWatch Logs interface endpoints
- S3 and DynamoDB gateway endpoints are free; interface endpoints cost per hour and per GB
- SageMaker API and SageMaker Runtime are separate VPC endpoints
- Inter-container encryption applies only to multi-instance distributed training
FAQ — IAM, KMS, and VPC for SageMaker Top Questions
Q1 — A training job fails immediately with KMSAccessDenied. The IAM policy on the execution role allows kms:Decrypt. Why?
The KMS key policy in the data-account does not list the Modelling-account execution role principal. KMS key policies are the authoritative gate, and IAM-level allow does not override key-policy silence. The fix: open the key policy, add a statement allowing the execution role ARN to call Decrypt and CreateGrant, and re-run. This is the single most common KMSAccessDenied scenario on the MLA-C01 exam — it tests whether you understand that KMS is policy-on-policy, not policy-or-policy.
Q2 — How do I enforce that all SageMaker training jobs in my account use customer-managed KMS keys, not AWS-managed keys?
Attach an IAM policy at the SCP level (organization) or the user level (account) with a Deny statement on sagemaker:CreateTrainingJob when sagemaker:VolumeKmsKey is null or not in an approved list. The condition key evaluates at API call time and rejects jobs missing the volume key parameter. Pair with a similar Deny on sagemaker:OutputKmsKey to enforce model artifact encryption. This pattern turns KMS into a guardrail rather than a guideline.
Q3 — A training job is in network isolation mode but the container hangs at startup. What did we forget?
The VPC endpoints for the AWS services SageMaker needs at host-level: S3 gateway endpoint (for the data download SageMaker performs before container start), ECR API and ECR DKR interface endpoints (for the image pull), and CloudWatch Logs interface endpoint (for log delivery). Without these endpoints, the SageMaker host cannot complete its bootstrap operations, and the container never starts. Check the subnet's route table for the gateway endpoint, the VPC's interface endpoints list for ECR and CloudWatch, and the security group on the host's ENI for outbound HTTPS to the endpoints' private IPs.
Q4 — How do I prevent a SageMaker training job from being created with an unapproved instance type?
Use the sagemaker:InstanceTypes IAM condition key in a Deny statement. Example: "Condition": {"ForAnyValue:StringNotEquals": {"sagemaker:InstanceTypes": ["ml.m5.large", "ml.m5.xlarge", "ml.p3.2xlarge"]}}. Attach to the role or to an SCP at the OU level. The condition evaluates at CreateTrainingJob time and rejects jobs requesting any instance type not in the list. Combine with sagemaker:MaxRuntimeInSeconds to cap job duration and prevent runaway costs.
Q5 — A SageMaker endpoint in a VPC can be invoked from the public internet via the SageMaker Runtime API. How do I block that?
VPC config on the endpoint controls outbound network from the inference container, not inbound invocation paths. To block public internet invocation: (1) provision a SageMaker Runtime VPC interface endpoint in your consumer VPC, (2) attach an endpoint policy to the runtime VPC endpoint that allows only sagemaker:InvokeEndpoint on your account's endpoints, and (3) use an SCP or IAM policy to deny sagemaker:InvokeEndpoint calls that do not go through the VPC endpoint by checking aws:SourceVpce condition. The combination forces all invocations to flow over PrivateLink and rejects public internet calls.
Q6 — What is the cleanest pattern for cross-account SageMaker training where the data lives in account A and the training runs in account B?
Three components: (1) account A's S3 bucket policy grants s3:GetObject and s3:ListBucket to account B's training execution role ARN, with aws:SourceArn and aws:SourceAccount conditions for confused-deputy protection. (2) account A's KMS key policy grants kms:Decrypt to the same role principal — both bucket policy and key policy must align. (3) Account B's training execution role's inline policy grants s3:GetObject and kms:Decrypt on the cross-account bucket and key ARNs. Test before production with a dry-run training job that simply downloads the data and exits.
Q7 — How do I audit who launched a SageMaker training job that exfiltrated data?
CloudTrail captures CreateTrainingJob events with the calling principal, the role ARN, the input data S3 path, and the VPC config. Query CloudTrail Lake with SQL: SELECT eventTime, userIdentity.arn, requestParameters FROM trail WHERE eventName='CreateTrainingJob' AND eventTime > TIMESTAMP'2026-04-01'. To audit the actual data download, S3 server access logs or S3 data events in CloudTrail show every GetObject call with the requesting role ARN. Cross-reference the training job creation event with the S3 GetObject events using the role ARN and the time window. For inference exfiltration, enable endpoint data capture and review the captured request/response payloads.
Further Reading — Official AWS Documentation for IAM, KMS, and VPC for SageMaker
The authoritative AWS sources are: the SageMaker Developer Guide security chapter (especially the IAM, encryption-at-rest, encryption-in-transit, train-vpc, and host-vpc sections), the IAM User Guide (especially confused deputy and condition keys), the KMS Developer Guide (key policy structure and grants), and the VPC User Guide (gateway and interface endpoints). The AWS Security Reference Architecture whitepaper codifies the multi-account pattern. The AWS re:Inforce conference archives include multiple deep dives on SageMaker security architectures. For IAM, KMS, and VPC for SageMaker compliance regimes — HIPAA, PCI-DSS, FedRAMP — the AWS Compliance Center documents the controls each service satisfies.