ML Data Ingestion and Storage Patterns - MLA-C01 ML Engineer Study Notes

ML data ingestion and storage patterns are the unglamorous half of every successful ML project — the model is only as good as the pipeline that fed it, and on the MLA-C01 exam roughly one third of Domain 1 questions live in this space. The MLA-C01 (Machine Learning Engineer Associate) is an engineering exam, not a data-science exam, so the questions are not "which feature scaling method is best" but "your team has 50 TB of training data sitting in an on-prem NetApp filer and the SageMaker training job needs it in S3 by Friday — pick the right ingestion service and quote the throughput." This guide takes that engineer perspective end to end.

We will cover the S3-as-data-lake foundation that every SageMaker training job assumes, the batch and streaming ingestion services and exactly when each is the right answer, the Lake Formation and Glue Data Catalog governance plane that keeps multi-team data lakes from collapsing into chaos, the high-throughput shared-storage options (FSx for Lustre and EFS) that change the math for very large training jobs, and the cross-account and cross-region patterns the exam loves to mix into ordering and matching questions. By the end you should be able to read an MLA-C01 stem like "1 PB of clickstream data in Snowflake, real-time recommendations needed, must train hourly" and know without looking that the answer involves Kinesis Data Streams, Glue streaming ETL, Feature Store online and offline, and SageMaker File mode for the batch retraining path.

What ML Data Ingestion Means in MLA-C01 Context

Data ingestion in MLA-C01 is the discipline of moving data from its system of record (operational databases, SaaS APIs, IoT devices, log streams, on-prem warehouses) into a storage layer where SageMaker training jobs and inference endpoints can consume it cheaply, repeatably, and with full lineage. This is broader than ETL — it includes the storage architecture choices (S3 prefix layout, file format, partitioning), the catalog (Glue Data Catalog as the metadata backbone), and the governance plane (Lake Formation permissions, KMS encryption, VPC endpoints).

Why Storage Architecture Decisions Are Made Before The First Model

A common rookie mistake is treating ingestion and storage as plumbing to defer until after the model works on a notebook sample. By the time the model is good and you scale to full data, you discover S3 prefix layout makes parallel reads impossible, your file format triggers Spark task skew, your VPC has no S3 endpoint and pays NAT egress on every training run, and your data is duplicated across four buckets with no canonical version. MLA-C01 questions probe this — "your team's training job takes 12 hours; how do you cut it to 2" usually has an ingestion-and-storage answer, not a model-architecture answer.

Three Things The Exam Tests About Ingestion

The first is service selection — given a latency, volume, and frequency profile, pick the right ingestion service from the AWS toolbox (Kinesis Data Streams, Kinesis Firehose, Glue, DMS, Snowball, Direct Connect, Transfer Family). The second is storage placement — pick the right S3 storage class, the right cross-account access pattern, and the right shared-storage option for SageMaker training. The third is failure-mode awareness — knowing the gotchas (Firehose buffering delay, FSx for Lustre vs EFS latency, Pipe mode unsupported by certain algorithms) that turn a plausible-looking answer into a wrong one.

Plain-Language Explanation: ML Data Ingestion and Storage

ML data ingestion is a topic where six AWS services fight for the same paragraph. Three concrete analogies make the structure stick.

Analogy 1 — The Restaurant Supply Chain and Walk-In Freezer

Picture a Michelin-starred restaurant (your SageMaker training job) that needs ingredients (training data) every morning. The walk-in freezer at the back of the kitchen is Amazon S3 — it is huge, it is cheap per kilogram, it is the shared source of truth that every chef pulls from. The freezer has shelves labelled by year, month, day (S3 prefixes) so the sous-chef finds the right batch fast. Some ingredients arrive on a daily 4 a.m. truck (batch ingestion via Glue or Snowball) — a fixed, predictable delivery you can schedule overnight. Other ingredients arrive minute-by-minute on a conveyor from the rooftop herb garden (streaming ingestion via Kinesis Data Streams) — continuous, low-latency, smaller volume. The dispatcher who buffers the conveyor for 60 seconds before bundling deliveries to the freezer is Kinesis Firehose — efficient, but introduces a known buffering delay. The high-end sashimi tank for ultra-fast access during dinner service (SageMaker training reading huge datasets) is FSx for Lustre — a special-purpose, high-throughput cache that is much faster than walking back to the freezer for every plate but more expensive to keep cold. The clipboard hanging on the freezer door listing every ingredient, supplier, expiry date, and shelf location is the Glue Data Catalog. The general manager who decides which chefs can pull which ingredients (Lake Formation) writes the access list. When the head chef wants a specific tuna shipment from last Tuesday, they consult the clipboard, walk to the right shelf, and pull the right batch — that is exactly how SageMaker training reads from S3 via the Glue catalog.

Analogy 2 — The Library Acquisitions and Reserve Stacks

Picture a national research library (your data lake) with eighty branch libraries (member accounts) feeding it. New books arrive in two ways: weekly bulk donations from publishers (batch ingestion via Glue jobs and Snowball) or daily journal subscriptions arriving by post (streaming ingestion via Kinesis). The acquisitions department checks every incoming book against an existing record (Glue Crawler updating the Data Catalog), assigns a Library of Congress classification (schema), and shelves it in the open stacks (S3 Standard) or the closed stacks (S3 Glacier Instant Retrieval) depending on how often it gets requested. A graduate student writing a dissertation (your ML training job) needs hundreds of references at the same time — the research carrel with a high-speed cart pulling stacks of books all day is FSx for Lustre, while the more general shared reading desk with shared bookshelves is EFS. The librarian who decides which scholars get into which sections (Lake Formation) writes the access cards. When a researcher from another university (cross-account access) walks in, the librarian checks their inter-library-loan agreement (IAM role and bucket policy), and either grants access or refuses. The catalog terminals throughout the library (Glue Data Catalog and Athena) let any scholar search "every book on machine learning published since 2020" without walking the stacks. This entire library system — acquisitions, classification, shelving, access control, search — is the AWS data-lake-for-ML reference architecture.

Analogy 3 — The Hospital Lab Specimen Pipeline

Picture a regional hospital with sixty clinics (member accounts) feeding lab specimens into a central pathology lab (your data lake). Specimens arrive in two ways: a daily courier route at 6 a.m. (batch ingestion) bundling overnight collections, and a real-time pneumatic tube system (streaming ingestion via Kinesis Data Streams) for STAT samples that need processing within 60 seconds. The receiving desk that batches incoming tubes for 60 seconds before forwarding to the lab is Kinesis Firehose. The cold-chain freezer warehouse holding ten years of specimens at minus-80 °C is S3 Glacier Deep Archive. The active lab bench refrigerator with this morning's specimens is S3 Standard. The high-speed centrifuge that processes hundreds of specimens at once for a research study (SageMaker training) reads from a temporary parallel file system — that is FSx for Lustre, hydrated from S3 only for the duration of the study, then released. The specimen barcoding system that records every specimen's origin, type, collection time, and storage location is the Glue Data Catalog. The hospital's HIPAA officer who decides which researchers can request which specimens is Lake Formation. When a researcher from another hospital network requests samples (cross-account access), the HIPAA officer checks the data-use agreement (IAM role + bucket policy + KMS key policy) and either grants access via a controlled tunnel (S3 VPC endpoint) or refuses. Every part of this — barcoding, freezer tiers, courier vs pneumatic tube, access control — maps directly to the MLA-C01 ingestion architecture.

S3 as the Primary ML Data Store

Almost every MLA-C01 ingestion question lands on S3. SageMaker training, Feature Store offline store, Ground Truth labeling, Data Wrangler imports, and Pipelines artifacts all default to S3. Master S3 layout patterns and you remove half the trick-question surface.

Bucket Layout and Prefix Strategy for ML

Single-bucket-per-team is fine; what matters is the prefix structure inside. The standard ML prefix pattern: s3://team-ml-data/<dataset>/<version>/<partition>/data.parquet where <partition> is Hive-style (year=2026/month=05/day=02/) so Athena and Spark can prune partitions. Dataset versioning lives in the prefix, not in S3 object versioning — object versioning is for accidental-deletion protection, not for training-set lineage. Avoid sequential prefixes (timestamps as the leading characters) for very high request rates because S3 partitions by prefix; modern S3 auto-scales partitions, but for sustained 5,000+ requests/second a hash-prefixed key (<hash4>/<dataset>/...) still helps.

File Formats — Parquet, ORC, JSON, CSV

For ML training, Parquet is the default — columnar, compressed, splittable, schema-embedded, and natively understood by Spark, Athena, and SageMaker built-in algorithms in tabular workloads. Use ORC if your team standardized on Hive. Use CSV only for small datasets or strict interoperability needs (Athena cost is 5-10x higher on CSV vs Parquet for the same query). Use JSON Lines for nested data that resists flattening. Image datasets stay as .jpg or .png with metadata in a sidecar manifest; audio stays as .wav or .flac; text for NLP often goes through Tokenized + Parquet for training.

S3 Versioning vs Dataset Versioning

S3 Object Versioning protects against accidental delete and overwrite — every version is retained until lifecycle expires it. This is operationally critical but is not training-set lineage. For training reproducibility — "rerun the model exactly as it ran on 2026-04-30" — you need explicit dataset versioning either in the prefix (s3://bucket/datasets/clickstream/v17/...) or in a tool like DVC (Data Version Control) that writes pointer files into Git and stores binary data in S3. SageMaker Pipelines records the input S3 URI per run, so locking the URI to a versioned prefix gives provable lineage.

S3 Storage Classes for Training Datasets

Hot training data lives in S3 Standard. Datasets accessed monthly drop to S3 Standard-IA (cheaper storage, retrieval fee per GB). Archived datasets older than 90 days move to S3 Glacier Instant Retrieval (millisecond retrieval, archive pricing) or S3 Glacier Flexible Retrieval (minutes-to-hours retrieval). Truly cold compliance archives at 7-10 year retention go to S3 Glacier Deep Archive at $1/TB/month. Lifecycle policies automate the transitions.

For SageMaker training, the dataset must be reachable in milliseconds — S3 Glacier Flexible and Deep Archive are NOT suitable for direct training input. A common MLA-C01 trap is "we stored last year's training data in Glacier Deep Archive to save cost; now retraining takes 24 hours to start." The fix is either lifecycle to Glacier Instant Retrieval (millisecond access at archive pricing) or pay the Glacier Flexible restore time and cost. Plan storage tiers around the retraining cadence — if you retrain monthly with last-12-months data, keep all 12 months in Standard or Standard-IA, not Glacier.

Batch Ingestion Patterns

Batch ingestion is scheduled, predictable, high-volume bulk transfer. MLA-C01 tests four scenarios.

Scheduled Glue Jobs from RDS, Redshift, DynamoDB

The most common pattern: an AWS Glue job runs nightly via EventBridge schedule, connects to a JDBC source (RDS, Redshift, on-prem DB via JDBC connection), reads with predicate pushdown, transforms, and writes Parquet to S3 partitioned by date. Glue handles the connection auth (IAM, Secrets Manager), the schema discovery (Crawler), and the parallelism (DPUs). Job bookmarks track which rows have been read to avoid reprocessing. Cost: $0.44/DPU-hour for Standard workers.

S3 Transfer Acceleration

For uploading large training datasets to S3 from a remote on-prem location, S3 Transfer Acceleration routes the upload through the nearest CloudFront edge to AWS backbone. Useful when datasets are tens of GB and the on-prem-to-S3 path goes over public internet. Adds a per-GB fee on top of standard S3 pricing — only enable when the speedup justifies the cost.

AWS DataSync

For bulk migration from on-prem NFS, SMB, HDFS, or another cloud, AWS DataSync moves TB-scale data over network at up to 10 Gbps with built-in encryption and integrity verification. Best when you have Direct Connect or a VPN; without dedicated bandwidth, DataSync still works but throughput is constrained by internet path.

Snowball / Snowmobile for Petabyte-Scale

For initial data lake seeding at PB scale where network transfer would take weeks, AWS Snowball Edge (80 TB rugged appliance) or the legacy AWS Snowmobile (100 PB on a truck) physically ships the data. Common MLA-C01 stem: "200 TB of historical telemetry on-prem, must ingest within 14 days, no Direct Connect available." Network transfer at 100 Mbps would take 200 days; Snowball ships and returns in 14. Cost is per-device fee plus shipping, typically a few thousand dollars per Snowball.

AWS Database Migration Service (DMS) For Continuous Replication

For ongoing replication from operational databases into S3, DMS does change-data-capture (CDC) — reads the source database transaction log, replicates inserts/updates/deletes to S3 as Parquet. Used when the ML pipeline needs near-real-time freshness on relational data without burdening the source DB with full-table scans.

For ML pipelines that need to keep training data in sync with an operational database, AWS DMS with CDC mode to S3 is the right answer over scheduled Glue jobs. Glue jobs do nightly snapshots — you lose changes that happened during the day, and the source DB takes a heavy read hit on every run. DMS CDC reads the transaction log continuously, applies inserts/updates/deletes incrementally to S3 with seconds-of-latency, and never touches the source DB beyond the log reader. The MLA-C01 stem signal is "minimize impact on production database" or "freshness within minutes" — answer with DMS, not Glue.

Streaming Ingestion Patterns

Streaming ingestion is event-driven, low-latency, continuous. The MLA-C01 exam tests three services and the trade-offs between them.

Amazon Kinesis Data Streams (KDS)

KDS is the foundational stream — producers write records, consumers (Lambda, Flink, KCL apps) read with sub-second latency. Throughput is provisioned in shards (1 MB/s in, 2 MB/s out per shard) or on-demand. Records are retained 24 hours by default, up to 365 days. For ML, KDS is the right answer when you need real-time feature computation (sliding window aggregations) or when multiple consumers need the same stream (KDS supports fan-out via enhanced fan-out).

Amazon Data Firehose

Firehose is the managed delivery stream — producers write records, Firehose buffers and delivers to S3, Redshift, OpenSearch, or HTTP endpoints. Buffering hint: 1-15 minutes, 1-128 MB. There is no consumer code; Firehose handles delivery, partitioning, format conversion (JSON to Parquet), and retries. The trade-off is the buffering delay — minimum 60 seconds before data lands in S3. Firehose is the right answer when you need raw events archived to S3 cheaply with no real-time processing.

Amazon Managed Service for Apache Flink

Flink is the right answer when you need stateful stream processing — windowed aggregations, session windows, joins between two streams, exactly-once semantics. For ML, this is feature computation: "compute the user's 5-minute click rate" requires a sliding window over the click stream, which Flink does natively. Flink reads from KDS or MSK, writes results to Feature Store online, S3, or another KDS.

KDS vs Firehose Decision

KDS when: sub-second latency, multiple consumers, custom processing logic, replay capability needed.
Firehose when: archive-to-S3 only, format conversion needed, buffering delay acceptable, no consumer code wanted.
Both together when: KDS is the primary stream, Firehose is a tee pipe writing to S3 archive while Flink does real-time processing.

Kinesis Data Firehose has a minimum buffering delay of 60 seconds. It is NOT a real-time pipeline. Engineers who confuse Firehose with Data Streams design "real-time fraud detection" with Firehose and discover they cannot react until at least 60 seconds after the transaction. Firehose buffers 1-128 MB or 1-15 minutes, whichever comes first. For real-time inference (sub-second decisions), you need KDS plus a consumer (Lambda, Flink, or KCL app). The MLA-C01 stem signal "real-time" or "sub-second" or "decision before the customer leaves the page" rules out Firehose. The signal "near-real-time archival" or "minute-level batches to S3" matches Firehose perfectly.

AWS Lake Formation — Centralized Data Lake Governance

Lake Formation is the access-control plane on top of Glue Data Catalog and S3. The MLA-C01 exam tests it as the answer to "how do multiple ML teams share data without duplication or chaos."

What Lake Formation Adds On Top Of S3 + Glue

S3 bucket policies and IAM policies let you grant access at the bucket and prefix level. Lake Formation adds table-level, column-level, row-level, and cell-level permissions on Glue Data Catalog tables. A data scientist can be granted SELECT on the users table excluding the pii_ssn column (column filter) and only rows where region = 'US' (row filter) — without writing custom views, without copying data.

Lake Formation Tags (LF-Tags)

LF-Tags attach key-value metadata to databases, tables, and columns; permissions are then granted on tags rather than on individual resources. "Anyone with team=ml-platform and confidentiality=internal can read this column." This scales better than per-resource grants in large organizations.

For cross-account ML data sharing, Lake Formation Resource Links pair with AWS Resource Access Manager (RAM) to share a Glue catalog table to another account without copying data. The other account creates a Resource Link, which appears as a local table; underlying S3 reads still go to the source bucket via Lake Formation's storage authorization layer.

Glue Data Catalog and Schema Evolution

Glue Data Catalog is the metadata backbone — every Athena query, every SageMaker training job that references a table, every Lake Formation permission resolves through the Catalog.

Glue Crawlers

A Crawler scans an S3 prefix (or JDBC source), infers schema from the file format, creates or updates the Glue table definition, and detects partitions. Schedule a Crawler nightly to catch new partitions and schema changes. Limitation: Crawlers can be flaky on heterogeneous data (mixed schemas in one prefix) — better practice is to enforce a single schema per table prefix.

Schema Evolution Patterns

Adding a column — Crawler picks it up, downstream Spark and Athena queries return null for older partitions.
Removing a column — Crawler keeps the old column in the table definition unless you delete and rerun; existing queries continue to work but may return null.
Renaming a column — destructive; consider this a new column and migrate data with a Glue ETL job.
Changing a type — risky; Athena strict mode rejects, Spark may coerce silently. Always test before pushing to prod.

The Catalog As The Permission Anchor

Glue table is the unit of permission in Lake Formation. Get the catalog right and downstream permission grants flow naturally. Get it wrong and you end up with bucket-policy spaghetti.

High-Throughput Shared Storage — FSx for Lustre vs EFS

For very large training jobs reading the same dataset across many GPUs, S3 read throughput becomes a bottleneck. Two AWS services solve this differently.

Amazon FSx for Lustre

FSx for Lustre is a purpose-built parallel file system for HPC and ML workloads. Throughput scales linearly with provisioned capacity — 200 MB/s per TiB on persistent SSD, 1 GB/s per TiB on the high-performance tier. FSx for Lustre integrates with S3 as a data repository — files in S3 appear in the Lustre file system on first access (lazy load) or via batch hydrate. Submillisecond latency on reads after warmup. This is the right answer for large-scale distributed training where many GPUs read the same dataset.

Amazon EFS

EFS is a general-purpose NFS file system for shared application access — good for SageMaker Studio home directories, notebook collaboration, multi-user workflows. Latency is single-digit milliseconds; throughput scales with stored data (Bursting mode) or is provisioned (Provisioned Throughput mode). EFS is NOT the right answer for high-throughput training — Lustre crushes it on parallel reads.

Decision Matrix

S3 with File mode — small to medium datasets, copied to instance ephemeral storage at training start. Default and cheapest.
S3 with Pipe mode — large datasets streamed during training, reduces startup time. Limited algorithm support.
S3 with FastFile mode — large datasets accessed via FUSE, lazy-loaded. The modern default.
FSx for Lustre — very large datasets (100s of GB to PB), distributed training across many GPUs, repeated training runs over the same data.
EFS — shared notebook storage, not for high-throughput training.

FSx for Lustre is the high-throughput, low-latency file system for SageMaker training; EFS is the general-purpose shared file system for notebooks and collaboration. The MLA-C01 trap is to treat them as interchangeable — they are not. Lustre throughput is 200-1000 MB/s/TiB; EFS bursting is single-digit to low-hundreds MB/s. For distributed training reading the same 5 TB dataset across 32 GPUs, Lustre completes a training run in hours where EFS takes days. The exam stem signal "very large dataset, distributed training, multiple GPUs reading the same data" maps to FSx for Lustre. "Shared notebook storage" or "SageMaker Studio home directory" maps to EFS.

SageMaker Training Input Modes — File, Pipe, FastFile

The mechanism by which SageMaker training jobs read S3 data has direct cost and latency implications.

File Mode

Default. SageMaker copies the entire dataset from S3 to the instance EBS volume before training starts. Simple, predictable, but slow startup for large datasets — a 500 GB dataset takes 30+ minutes to copy before training begins. Disk I/O after copy is local, fastest possible. Use for small to medium datasets.

Pipe Mode

SageMaker streams data from S3 directly into the algorithm via a Unix FIFO pipe. Training starts immediately; data is consumed as the algorithm reads. Saves the upfront copy time and EBS cost. Limitation: not all built-in algorithms support Pipe mode; XGBoost and Linear Learner do, but BlazingText and others may not. Custom containers must implement the Pipe protocol.

FastFile Mode

The modern default. Data is exposed via a POSIX-compliant FUSE mount; algorithm reads files as if local, but underlying reads stream from S3 on demand. Supports random access (which Pipe does not) and most algorithms. Almost universally the right answer for modern training jobs.

Choosing The Mode

File mode → small datasets, simplest path.
Pipe mode → very large sequential datasets, algorithm supports it.
FastFile mode → modern default for everything else.

Cross-Account and Cross-Region Data Access

Multi-account ML organizations need controlled, auditable data sharing.

IAM Roles For Cross-Account S3 Access

The producer account grants S3 access via bucket policy referencing the consumer account's IAM role ARN. The consumer's SageMaker training role assumes (or directly uses) the granted permission. KMS keys must also grant the consumer's role kms:Decrypt. Confused-deputy protection via aws:SourceArn and aws:SourceAccount conditions.

S3 Access Points For Multi-Tenant Buckets

S3 Access Points create named endpoints with their own policies. One bucket can have many access points, each scoped to a different consumer or VPC. Simpler than a 100-line bucket policy.

S3 VPC Endpoints

For SageMaker training jobs running in VPC mode (no internet access), an S3 Gateway VPC Endpoint is required to reach S3 without leaving the VPC. Without it, training jobs cannot read training data. The endpoint policy can restrict which buckets are reachable.

Cross-Region Replication

For datasets needed in multiple regions (DR, regional ML deployments), S3 Cross-Region Replication (CRR) replicates objects asynchronously. RPO is typically minutes. Lifecycle and KMS keys must be configured per region.

Common Exam Traps For ML Data Ingestion

Trap 1 — Firehose Is Real-Time

Wrong. Minimum 60-second buffering. Use KDS for sub-second.

Trap 2 — FSx For Lustre And EFS Are Interchangeable

Wrong. Lustre is high-throughput parallel; EFS is general NFS. For distributed training, Lustre.

Trap 3 — Glacier Deep Archive For Active Training Data

Wrong. Restore time is hours. Use Glacier Instant Retrieval if archive pricing is needed.

Trap 4 — Pipe Mode Works For Every Algorithm

Wrong. Some built-in algorithms (BlazingText certain modes, Object2Vec) do not support Pipe. FastFile is broader.

Trap 5 — S3 Object Versioning Is Dataset Versioning

Wrong. Object versioning is anti-deletion; dataset versioning is reproducibility. Use prefix versioning or DVC.

Trap 6 — Cross-Account S3 Access Without KMS Grant

Wrong. Bucket policy alone is insufficient when SSE-KMS is enabled. Consumer's role also needs kms:Decrypt on the producer's KMS key.

Trap 7 — DataSync For Petabyte Migration

Inefficient. At 10 Gbps DataSync moves 1 PB in ~10 days; Snowball is faster physical shipping for that scale.

ML Data Ingestion Architecture — operational sources (RDS, on-prem, IoT) flow via batch (Glue, DMS, Snowball, DataSync) and streaming (KDS, Firehose, Flink) paths into S3 Data Lake (governed by Lake Formation + Glue Catalog), with FSx for Lustre as a high-throughput cache layer for SageMaker training jobs.

Ingestion Decision Matrix — Service to Use-Case

Use Case	Right Answer	Wrong Answer Trap
Real-time fraud detection (sub-second)	Kinesis Data Streams + Lambda	Firehose (buffering delay)
Nightly training data refresh from RDS	Glue ETL + bookmarks, or DMS CDC	Athena Federated Query
200 TB on-prem migration in 14 days	Snowball Edge	DataSync over public internet
Continuous DB replication to S3	DMS CDC mode	Scheduled Glue snapshots
Streaming archive to S3 (tolerable lag)	Firehose to S3	KDS without consumer
Distributed training, 5 TB dataset, 32 GPUs	FSx for Lustre	EFS
SageMaker Studio shared notebooks	EFS	FSx for Lustre
Compliance archive, 7-year retention	S3 Glacier Deep Archive + Object Lock	Glacier Flexible without lock

Key Numbers To Memorize For MLA-C01

Kinesis Data Streams shard: 1 MB/s in, 2 MB/s out, 1000 records/s
Firehose buffering: minimum 60 seconds, 1-128 MB
FSx for Lustre throughput: 200 MB/s/TiB persistent SSD, up to 1 GB/s/TiB
S3 Glacier Deep Archive: $1/TB/month, 12-48 hour retrieval
S3 Glacier Instant Retrieval: archive price, milliseconds retrieval
Snowball Edge capacity: 80 TB usable per device
Glue DPU: 4 vCPU + 16 GB memory, $0.44/DPU-hour standard

FAQ — ML Data Ingestion and Storage Top Questions

Q1 — Our team has 50 TB of historical clickstream data on-prem and we need it in S3 by Friday for a model retraining run. Network is 100 Mbps shared. What is the right ingestion path?

At 100 Mbps shared (assume effective 40 Mbps), 50 TB takes 100+ days — completely unworkable. The right answer is AWS Snowball Edge, which physically ships an 80 TB device to your data center; you load it locally at 10 Gbps, ship it back, and AWS imports to S3. End-to-end is 7-10 days including shipping. DataSync would also fail on the bandwidth constraint. Direct Connect would take weeks to provision and then the 100 Mbps becomes the bottleneck again unless upgraded. The MLA-C01 stem signal "tens of TB or more, limited bandwidth, deadline in days-to-weeks" is the Snowball signature.

Q2 — Why does my SageMaker training job take 45 minutes to start before any actual training happens?

The most likely cause is File mode with a large dataset — SageMaker is copying the entire S3 dataset to the instance EBS volume before training begins. For a 500 GB dataset, this takes 30-60 minutes depending on instance type and S3 throughput. Switch to FastFile mode (POSIX FUSE mount, lazy load from S3) for the modern default, or Pipe mode if your algorithm supports it (XGBoost, Linear Learner, image classifier). FastFile is almost always the right answer; the algorithm sees a normal file path, but reads stream from S3 on demand and training starts in seconds.

Q3 — When should I choose FSx for Lustre over plain S3 access for SageMaker training?

FSx for Lustre is the right choice when you have (a) a very large dataset (hundreds of GB to PB), (b) distributed training across multiple instances reading the same data, and (c) repeated training runs over the same dataset (HPO sweeps, multiple experiments). The Lustre file system hydrates from S3 once, then serves reads at 200-1000 MB/s/TiB to many GPUs in parallel. For a single training run on a 100 GB dataset, S3 with FastFile mode is simpler and cheaper — the Lustre setup overhead and persistent storage cost only pay off at scale and with reuse. The MLA-C01 trap is to assume Lustre is always faster; for small workloads the S3 path wins on cost and simplicity.

Q4 — Our nightly Glue job reads from RDS and the DBA is complaining about query load. What's the alternative?

Use AWS DMS in CDC (change data capture) mode to replicate from RDS to S3. DMS reads the database transaction log, not the tables themselves, so the source DB takes near-zero query load — only the log reader thread runs. CDC streams inserts/updates/deletes to S3 as Parquet files in near-real-time. This both reduces source load and provides fresher data for ML retraining. The Glue job can still post-process the DMS output, but the heavy extract is now CDC. The MLA-C01 stem signal "minimize impact on production database" or "the DBA is unhappy" is DMS CDC.

Q5 — How do I version training datasets so I can reproduce a model trained six months ago?

Two reliable approaches. Approach 1: prefix versioning — write each dataset version to a unique S3 prefix (s3://bucket/datasets/clickstream/v17/) and lock the SageMaker training job input to that exact prefix. Simple, auditable, no extra tooling. Approach 2: DVC (Data Version Control) — DVC writes pointer files to Git, stores binary data in S3 with content-addressed paths, and gives you git checkout-style data versioning. Better when datasets evolve incrementally. Do NOT rely on S3 Object Versioning alone — it protects against accidental delete but does not provide first-class dataset lineage. SageMaker Pipelines records the input S3 URI per run; lock the URI to a versioned prefix and you have provable training reproducibility.

Q6 — What's the difference between Kinesis Data Streams and Kinesis Data Firehose for ML pipelines?

Kinesis Data Streams (KDS) is the foundational stream — sub-second latency, multiple consumers, replay support, custom processing logic. Use for real-time inference triggers, real-time feature updates to Feature Store online, fraud-detection pipelines. Kinesis Data Firehose is a managed delivery service — buffers 1-15 minutes or 1-128 MB, then writes to S3, Redshift, OpenSearch, or HTTP. No consumer code, no replay, minimum 60-second latency. Use for archiving raw events to S3 cheaply, format conversion (JSON to Parquet), and ETL prep. The most common architecture: KDS as the live stream with two consumers — Flink for real-time feature computation, Firehose tee'd off to S3 for archive. The MLA-C01 stem signal "real-time" rules out Firehose; "archive to S3" rules in Firehose.