SageMaker Feature Store — Online vs Offline - MLA-C01 ML Engineer Study Notes

SageMaker Feature Store is the centralized, purpose-built repository that holds machine learning features for both training and inference, and on the MLA-C01 exam it is one of the most heavily tested services in Domain 1 (Data Preparation for Machine Learning, 28 percent weight). Community study guides from Tutorials Dojo, ExamCert.App, and K21 Academy all flag the same pain point — candidates conflate the online store with the offline store, and AWS plants a steady stream of scenario questions that hinge on choosing the right one for the right access pattern. The wrong choice on the exam is also the wrong choice in production: pick the offline store for real-time fraud scoring and your model misses its 50ms SLA; pick the online store for training-set generation and you blow up your bill plus get the wrong dataset.

This guide is built for the ML engineer perspective. It covers what SageMaker Feature Store is, why both stores exist, the schema rules every feature group must follow, how to ingest features through the four supported paths, how to read for inference vs read for training, the point-in-time join mechanics that prevent training-serving skew, TTL and feature reuse patterns, cross-account sharing, encryption, and the canonical online-vs-offline exam traps that catch most candidates. By the end, the online-vs-offline distinction should feel as natural as choosing a hot kitchen workstation versus a walk-in pantry.

What Is SageMaker Feature Store?

SageMaker Feature Store is a fully managed, purpose-built repository for storing, sharing, and serving machine learning features. It exists because in any production ML system, the same feature — say, "customer's 30-day rolling purchase count" — needs to be available in two very different access patterns. During model training, you batch-read millions of historical feature rows from S3 to fit a model; during online inference, you single-key lookup the latest feature values for one customer in under 10 milliseconds to score a real-time fraud prediction. Without Feature Store, ML teams build separate pipelines for each access pattern and inevitably drift them out of sync — the dreaded training-serving skew where the model trained on one feature definition is served features computed by a slightly different definition.

Why a Centralized Feature Store Exists

Before Feature Store, every team rebuilt the same features in their own pipelines. Customer 30-day purchase count was computed in the fraud team's Glue job, the recommendation team's PySpark job, and the marketing team's Redshift query — three implementations, three subtle bugs, three sources of skew. Feature Store solves this by being the single source of truth: features are computed once, written to the store, and read by every model that needs them. Reuse cuts compute cost, removes drift, and turns features into shareable assets across teams.

The Two-Store Architecture

Every Feature Store feature group has two physical storage backends: the online store for low-latency single-record reads, and the offline store for high-throughput bulk historical reads. You can enable one, the other, or both — most production setups enable both because training and inference both happen for the same model. The online store is a managed in-memory or NoSQL key-value backend optimized for GetRecord calls returning the latest record for a record identifier in a few milliseconds. The offline store is your own S3 bucket holding append-only Parquet files partitioned by event time, optimized for Athena and Spark queries against historical feature snapshots.

Plain-Language Explanation: SageMaker Feature Store

The two-store split is the kind of concept where naming alone does not convey the trade-off. Three concrete analogies make the structure stick.

Analogy 1 — The Restaurant Kitchen and Walk-In Pantry

Picture a high-volume restaurant. The chef has two storage areas. On the line, within arm's reach, is the mise en place station — chopped onions, pre-portioned protein, sauces in squeeze bottles, all cold but instantly accessible. When an order ticket comes in for table seven, the chef grabs ingredients in seconds and plates the dish to the customer's 10-minute SLA. Behind the kitchen, in the walk-in pantry, is the bulk inventory — fifty-pound flour bags, crates of produce, vacuum-sealed proteins for the next two weeks. When the chef preps tomorrow's mise en place, they walk into the pantry and pull bulk ingredients to chop, portion, and stage on the line.

The mise en place station is the online store — small, fast, expensive per cubic foot, holds only the freshest batch of each ingredient. The walk-in pantry is the offline store — large, slow to access, cheap per cubic foot, holds every ingredient from every delivery for the full retention period. Cooking a single order is single-record online inference; preparing tomorrow's prep list is batch dataset building from the offline store. If the chef tries to cook a single order from the walk-in (offline-store-only), service is unbearably slow. If the chef tries to prep tomorrow's stations from mise en place (online-store-only), there is not enough ingredient on the line. Both stores exist because both access patterns exist, and Feature Store is what keeps the two synchronized so the chef always uses the same recipe.

Analogy 2 — The Library Reference Desk and Stacks

Picture a research library. The reference desk at the front holds a small collection of high-demand recent reference books — encyclopedias, current journals, the most-cited monographs. A patron walks up, asks a question, and the librarian hands them an answer in seconds. The stacks in the basement hold thirty years of accumulated material — every back issue, every superseded edition, every PhD dissertation. A graduate student writing a thesis spends a day in the stacks pulling sixty sources to build a literature review.

The reference desk is the online store — instant lookup, latest editions only, expensive shelf space. The stacks are the offline store — slow but cheap, complete historical record, perfect for batch retrieval. Real-time inference is "patron at the desk needs an answer now"; training-set construction is "graduate student needs every paper that ever discussed feature X with timestamp Y." If the librarian only had stacks, every patron would wait an hour. If the librarian only had a reference desk, no thesis could be written. Both shelves exist because both reading patterns exist, and the cataloguing system (Feature Store's record identifier and event time) is what lets a researcher reconstruct what the reference desk looked like on any given date in the past — exactly the point-in-time join semantics that prevent training-serving skew.

Analogy 3 — The Hospital Patient Bedside Chart vs Medical Records Archive

Picture a hospital. At the patient's bedside, a bedside chart holds the latest vital signs, current medications, and the active care plan — refreshed every hour by the nurse, instantly readable when the doctor walks in for rounds. In the basement, the medical records archive holds every chart for every patient for the past twenty years — every blood test, every prescription, every surgery note. When researchers do an outcome study or build a clinical risk model, they pull tens of thousands of historical records.

The bedside chart is the online store — current state only, sub-second access, supports clinical decision-making at the bedside. The archive is the offline store — append-only history, supports analytics, training a sepsis-risk model, and quality audits. The catch: when the researchers train a sepsis model on archived data, they must reconstruct what the bedside chart looked like at each historical decision point — not what we know now in hindsight, but what the doctor actually saw at 3am on that night. Reading the future into the past is label leakage, and point-in-time joins are the discipline that prevents it. SageMaker Feature Store offline reads with point-in-time join return only feature values whose event time was earlier than the prediction time — the bedside chart as of that exact moment, not as of today.

Feature Group Schema — The Building Block

A feature group is a collection of features that describe one entity, like Customer or Product or Transaction. Every feature group has a strict schema that the exam expects you to know cold.

Required Fields — Record Identifier and Event Time

Every feature group must declare exactly two structural fields. The record identifier is the unique key for a record — typically customer_id, product_id, or transaction_id. Online store reads use the record identifier as the key for GetRecord. The event time is a timestamp marking when the feature value was true — used by the offline store to maintain the append-only history and by point-in-time join to filter features to the correct as-of moment. Both fields are mandatory. You cannot omit either at feature group creation.

Feature Definitions — Type System

Beyond record identifier and event time, you declare each feature with a name and a type. Feature Store supports String, Integral, Fractional, and (with newer SDK versions) collection types. Type mismatches at ingestion fail the put — a record with a String value where the schema declared Integral is rejected. Schema evolution is supported through UpdateFeatureGroup to add new features, but you cannot change an existing feature's type or remove a feature without recreating the group.

Online Store Configuration — Standard vs In-Memory

When you enable the online store, you choose between Standard (the default, backed by a managed key-value service, single-digit millisecond reads, suited for most workloads) and In-Memory (backed by ElastiCache for Redis, sub-millisecond reads, for the most latency-sensitive use cases like high-frequency trading or real-time bidding). In-Memory costs more and is overkill for most fraud and recommendation workloads — exam questions about "lowest possible latency" point at In-Memory; questions about "real-time inference under 10ms" are answered by Standard.

Offline Store Configuration — Your S3 Bucket

The offline store writes to an S3 bucket you specify, in a Parquet format partitioned by year, month, day, and hour of event time. AWS Glue automatically registers the offline store as a table in the Data Catalog so Athena and Spark can query it. Storage cost is your S3 bill; query cost is your Athena scan cost. The IAM role passed at feature group creation must have s3:PutObject to the bucket and kms:GenerateDataKey if SSE-KMS is configured.

Every SageMaker feature group requires exactly two structural fields — record identifier (the unique key) and event time (the timestamp the value was true). These fields are not optional and cannot be added or changed after creation. The record identifier drives GetRecord lookups in the online store and join keys in the offline store; the event time drives append-only history in the offline store and point-in-time join semantics. On the MLA-C01 exam, any answer that proposes a feature group without one of these two fields, or that proposes a different name like "primary key" or "timestamp" without the structural meaning, is wrong.

Online Store — Sub-Millisecond Reads For Inference

The online store is the side of Feature Store that real-time inference uses.

Latency Characteristics

The Standard online store delivers reads in single-digit milliseconds for a single record. The In-Memory online store delivers reads in sub-millisecond. Both are designed for the inference path: a request comes in for customer 12345, the model serving code calls GetRecord(feature_group_name="customer_features", record_identifier="12345"), gets back the latest values for every feature in the group, runs the model, returns a prediction.

What "Latest" Means In The Online Store

The online store keeps only the most recent record per record identifier. If you PutRecord with customer 12345 at 10:00 and again at 11:00, a GetRecord at 11:30 returns the 11:00 record only. The 10:00 record still exists in the offline store (which is append-only), but the online store discards superseded values. This is the right behavior for inference — you want the freshest customer state to feed the model, not the history.

GetRecord vs BatchGetRecord

GetRecord returns one record for one record identifier from one feature group. BatchGetRecord returns up to 100 records across multiple feature groups in a single call — useful when an inference request needs features from Customer, Product, and Transaction groups all at once. Both are sub-second; BatchGetRecord reduces latency overhead from sequential calls.

Online Store TTL — Expiring Stale Features

Each feature group can configure a record-level TTL. Records older than the TTL are automatically expired from the online store but not from the offline store (which retains everything). TTL is the right way to age out customer state for inactive customers and reduce online store cost — set it to 30 or 90 days for typical retail-fraud use cases. The exam loves this configuration: a question of the form "online store keeps growing, how do we control cost without losing training history" is answered by TTL on the online store, retaining offline.

What The Online Store Is Not

The online store is not a feature store query engine. You cannot run a SQL query against the online store, you cannot filter records by feature value, you cannot join across feature groups in one call. It is a key-value lookup. Anything more complex requires going to the offline store, or it requires pulling records into your application memory and filtering there.

Offline Store — Historical Snapshots For Training

The offline store is the side of Feature Store that training and analytics use.

Storage Format And Layout

The offline store writes Parquet files to an S3 prefix you specify, partitioned by year=YYYY/month=MM/day=DD/hour=HH/ based on event time. Files are append-only — every PutRecord appends a new row, even if the same record identifier already has rows. This append-only nature is what enables historical queries and point-in-time joins.

Reading From The Offline Store

You read the offline store like any S3-backed table. Athena queries the Glue Data Catalog table that Feature Store auto-registers. Spark in EMR or Glue reads the Parquet directly. SageMaker Processing jobs read into Spark or pandas DataFrames. The query is your responsibility — Feature Store does not provide a query layer beyond the standard S3 + Athena + Glue stack.

Build Dataset API — Convenience Layer

For training-set construction, the SageMaker Python SDK provides a create_dataset builder that wraps Athena queries, joins multiple feature groups by record identifier and event time, and applies point-in-time semantics. The builder writes the result to S3 as a single CSV or Parquet ready for Estimator.fit(). Use it for typical training-set creation; for complex feature engineering, write your own Athena or Spark logic against the underlying tables.

What The Offline Store Is Not

The offline store is not real-time. Records ingested are not immediately visible — there is a buffering and delivery delay typically measured in minutes. If your model trains on data from minutes ago and that is acceptable, fine. If you need real-time analytics, the offline store is the wrong tool — pair it with a streaming pipeline that writes to both Feature Store and a real-time analytics destination separately.

Ingesting Features — The Four Paths

Features can be written through four supported paths, each with a different latency and throughput profile.

Path 1 — PutRecord API (Real-Time, Synchronous)

The PutRecord API call writes a single record synchronously. Both stores receive the record (online immediately, offline after buffering). Use this from a streaming consumer, a Lambda function, or any application code that needs immediate feature availability. Throughput limit is per-second per-account; for very high throughput, use batch ingestion instead.

Path 2 — Batch Ingestion Via SageMaker Processing

For large historical backfills or scheduled bulk recomputation, run a SageMaker Processing job that reads source data, computes features, and writes to Feature Store via the SDK or directly to the offline S3 bucket. The Processing pattern handles billions of rows efficiently and is what teams use for periodic feature refreshes.

Path 3 — Data Wrangler Export

SageMaker Data Wrangler flows can export their final transformed dataset directly to a feature group. The export job creates a Processing job under the hood and writes to both online and offline stores. This is the canonical path for a data scientist building features interactively in Data Wrangler and pushing them into the production feature store with no separate engineering step.

Path 4 — Direct Offline Store Write Plus Iceberg Or Glue

For very large-scale ingestion that bypasses the API limit, you can write Parquet files directly to the offline S3 bucket using the same partition layout. Newer Feature Store versions support Apache Iceberg as the offline format, which adds ACID semantics and performant time-travel queries. Direct writes skip the online store; if you need both, run an additional PutRecord flow for the online side or a hybrid pattern.

PutRecord writes to both online and offline stores when both are enabled, but with different visibility timing — online is immediate, offline has a buffering delay typically in minutes. Engineers building monitoring dashboards or expecting "what I just wrote is queryable now" misread this and chase phantom bugs. The correct mental model is: online store reflects writes within milliseconds and is the truth for inference; offline store reflects writes after buffering and is the truth for training and historical analysis. If a real-time inference path needs to see freshly written features, that path must read from the online store, never the offline store. Confusing the two timing semantics is one of the most common production bugs in Feature Store deployments and a popular MLA-C01 exam trap.

Point-In-Time Joins — Preventing Training-Serving Skew

Point-in-time join is the single most important concept in Feature Store and the highest-yield exam topic.

What Training-Serving Skew Is

Training-serving skew is the bug where a model trained on data prepared one way is served data prepared a different way at inference time. The classic skew: at training, you computed 30_day_purchase_count as of the day after the label was observed (so the count includes the labeled purchase itself), but at inference you compute it as of the day before the prediction. The model trained on artificially future-aware features and at inference is starved of that signal — performance crashes in production.

What Label Leakage Is

A close cousin: label leakage is when a feature value at training time encodes information from after the label was observed. If 30_day_purchase_count for a fraud-labeled transaction includes the fraud transaction itself, the count is a perfect predictor of fraud — but only at training time, where the label is known. At inference the count cannot include the unlabeled transaction (it has not happened yet from the model's perspective).

How Point-In-Time Join Solves Both

Point-in-time join enforces the rule: when you build a training set joining a label table to a feature group, only feature values whose event time is strictly less than the label's event time are eligible. The offline store's append-only history makes this query feasible — for each label row, the join scans the feature history for that record identifier, filters to event time < label time, and picks the latest qualifying row. The result is a feature value at training that exactly matches what would have been the latest value at inference time for that historical moment. Skew gone, leakage gone.

Implementing Point-In-Time Join In Athena Or Build Dataset

The Build Dataset API does point-in-time join automatically when you provide the label table and specify the join key plus event time column. In raw Athena, you write a correlated subquery: for each row in the label table, select the feature row with the maximum event time less than the label time. The Iceberg-backed offline store has native time-travel SQL syntax that simplifies the query to SELECT ... FOR TIMESTAMP AS OF <label_time>.

Reading from the offline store without point-in-time join produces label leakage and inflates training metrics that crash in production. Engineers regularly write SELECT * FROM feature_table JOIN label_table USING (customer_id) thinking the join is correct — it is not. That join attaches today's feature value to a historical label, leaking future information. The correct join filters feature event time to the label event time. If your training accuracy is suspiciously high and production accuracy is much lower, point-in-time join violation is the first thing to check. The MLA-C01 exam plants this pattern as an answer-pair: one option uses a naive join, another uses MAX(event_time) WHERE event_time < label_time semantics. The second is the right answer; the first is a leakage trap.

Feature Reuse Across Teams And Models

The reuse story is what makes Feature Store worth running in the first place.

Same Features, Many Models

Customer 30-day purchase count is a useful feature for fraud, churn, and recommendation models. Without Feature Store, three teams compute it three ways. With Feature Store, the data engineering team computes it once, writes to the customer feature group, and all three modeling teams GetRecord it. The cost saving compounds across hundreds of features and dozens of models in a mature ML organization.

Feature Discovery

The SageMaker Studio Feature Store browser lists every feature group in the account with searchable metadata, schema, and statistics. Teams discover existing features rather than rebuild duplicates — the discoverability is the second most important benefit after consistency.

Feature Lineage

Feature Store integrates with SageMaker ML Lineage Tracking, which records which feature group was read by which training job, which model the training produced, and which endpoint the model was deployed to. Lineage queries answer "which models depend on this feature" before a feature is deprecated or its computation logic changes.

In organizations with multiple AWS accounts, Feature Store supports cross-account access patterns that the exam asks about.

Resource Policies On Feature Groups

Feature groups support resource policies granting sagemaker:GetRecord, sagemaker:BatchGetRecord, sagemaker:PutRecord, and read access to the offline store to other AWS accounts. The pattern: data engineering account owns the feature group; modeling teams in separate accounts attach IAM policies in their own accounts plus the cross-account resource policy on the feature group grants access.

Offline Store Cross-Account S3

The offline store S3 bucket can be shared via standard S3 cross-account bucket policies plus the consuming account having S3 read permission. Glue Data Catalog cross-account access via Lake Formation completes the picture for Athena queries from another account.

Encryption Across Accounts

KMS keys used to encrypt the online store and offline store must have key policies allowing the consuming accounts to decrypt. Cross-account access fails silently with KMS denials if the key policy is missing the consumer account ID. The exam plants this as a "feature group is shared but the other team cannot read" troubleshooting question — the answer is the KMS key policy.

Encryption And Security

Both stores encrypt at rest by default with AWS-managed keys. For compliance workloads, configure customer-managed KMS keys.

Online Store Encryption

The online store accepts a KMS key at feature group creation. All records written to the online store are encrypted with that key. Reads decrypt automatically for principals with kms:Decrypt on the key.

Offline Store Encryption

The offline store S3 bucket can use SSE-S3, SSE-KMS, or SSE-C. SSE-KMS is the recommended option for centralized key management. The IAM role used by the feature group's offline store writer must have kms:GenerateDataKey on the key.

Network Isolation

For highly regulated workloads, configure the SageMaker Studio domain or notebook instance to access Feature Store via VPC endpoints (com.amazonaws.<region>.sagemaker.featurestore-runtime for online and com.amazonaws.<region>.sagemaker.api for control plane). VPC endpoint policies can restrict which feature groups are accessible from a given VPC.

Always pass an explicit KMS key and IAM role to feature group creation in production — do not rely on the AWS-managed key. Customer-managed KMS keys give you key-policy-level control over who can decrypt feature data, audit access via CloudTrail KMS events, rotate keys on a schedule, and revoke access by disabling the key in an incident. The AWS-managed default lacks all of these controls. The cost of a customer-managed key is one dollar per month plus per-API-call charges — trivial compared to the security and audit benefit. Exam questions about "how do we restrict access to PII features at the cryptographic layer" are answered by customer-managed KMS keys with feature-group-specific key policies.

Common Exam Traps For SageMaker Feature Store

The MLA-C01 exam plants a consistent set of traps around Feature Store. Memorize all five.

Trap 1 — Online Store For Training Set Generation

A scenario asks "we need to build a training dataset of one million rows from the customer feature group." Wrong answer: paginated GetRecord calls against the online store. Right answer: query the offline store via Athena. The online store cannot return historical records, only the latest, and even with paging the cost and latency are wrong.

Trap 2 — Offline Store For Real-Time Inference

A scenario asks "real-time fraud scoring needs feature lookup in under 10ms." Wrong answer: Athena query the offline store. Right answer: GetRecord on the online store. Athena queries take seconds, not milliseconds.

Trap 3 — Online Store TTL Affects The Offline Store

A candidate assumes setting TTL on the online store also expires offline records. Wrong. The offline store is append-only and ignores TTL. TTL only expires online records. This is intentional — you want to age out hot inference state while keeping the full training history.

Trap 4 — Naive Join Without Point-In-Time Semantics

The leakage trap covered above. Any answer that joins feature group to label table without filtering event time is wrong.

Trap 5 — Adding A Required Field After Creation

A candidate proposes "we forgot to add event time, let us update the feature group." Wrong. Record identifier and event time are structural and cannot be added or changed after CreateFeatureGroup. The correct path is recreate the feature group, ingest historical data into the new group, and migrate consumers.

Trap 6 — Standard vs In-Memory Online Store Confusion

A scenario asks "we need sub-millisecond reads for high-frequency trading." Right answer is In-Memory online store backed by ElastiCache. Standard online store delivers single-digit millisecond reads, which is enough for most workloads but not for the strictest latency SLAs.

Trap 7 — Forgetting Cross-Account KMS

A scenario describes a feature group shared across accounts with the resource policy granting access, but the consumer cannot read. Right answer is the KMS key policy on the encryption key — cross-account share requires KMS key policy plus IAM plus resource policy, all three.

Online store equals latest record only, sub-millisecond reads, suited for inference; offline store equals append-only history, S3 Parquet, suited for training and analytics. They share the same record identifier and event time but answer different questions. This is the one sentence to memorize for every Feature Store question on MLA-C01. If the scenario word is "real-time," "single record," "latency-sensitive," or "inference," answer online. If the scenario word is "historical," "training set," "Athena," "Spark," "millions of rows," or "analytics," answer offline. If the scenario word is "consistent across both," answer "use both stores enabled with PutRecord writing to both." The two-store architecture is what Feature Store is — the entire service is built around managing the consistency between them.

Key Numbers And Must-Memorize Feature Store Facts

Feature Group Schema

Required: record identifier and event time (cannot be changed after creation)
Feature types: String, Integral, Fractional, plus collection types in newer SDKs
Schema evolution: add features yes, change types or remove features no

Online Store

Standard: single-digit millisecond GetRecord latency
In-Memory (ElastiCache for Redis backed): sub-millisecond GetRecord latency
Stores latest record only per record identifier
TTL configurable to expire stale records
BatchGetRecord for cross-feature-group reads in one call

Offline Store

S3 Parquet, partitioned by year/month/day/hour of event time
Append-only history
Buffering delay from PutRecord, typically minutes
Athena and Spark queryable via auto-registered Glue table
Iceberg format option for ACID and time travel

Ingestion Paths

PutRecord API (real-time, synchronous)
SageMaker Processing batch ingestion
Data Wrangler flow export
Direct offline-store write plus Iceberg or Glue catalog

Point-In-Time Join

Filter feature event time strictly less than label event time
Prevents training-serving skew and label leakage
Build Dataset API does this automatically

Feature group resource policy
Offline store S3 cross-account bucket policy
KMS key policy granting decrypt to consumer accounts
All three required for cross-account read

FAQ — SageMaker Feature Store Top Questions

Q1 — When should I enable only the online store, only the offline store, or both?

Enable both as the default — most production ML systems need real-time inference and batch training, and the cost of both stores together is small relative to the cost of training-serving skew bugs caused by maintaining two separate pipelines. Enable only the online store if your features are derived from streaming sources and you have a separate analytics warehouse that does not need a copy of the same features (rare). Enable only the offline store if the feature group is purely for training-set construction and is never read at inference (for example, demographic features used only at training in a batch-scoring system that pre-computes predictions). When in doubt, enable both — Feature Store is designed around the assumption that both stores are present, and many SDK convenience APIs require both.

Q2 — How do I avoid training-serving skew when using Feature Store?

The mechanical answer: always use point-in-time join when building training sets, and always use the same feature group at inference that you used at training. The discipline answer: write the feature computation logic once, in one place, and have it write to Feature Store via PutRecord — do not duplicate the logic in a training pipeline and an inference pipeline. The Build Dataset API enforces point-in-time semantics by default; if you write your own Athena queries, validate them by computing the same feature at a known historical timestamp and comparing to what the offline store returns. Skew detection in production uses SageMaker Model Monitor's data quality monitor — feature distribution drift between training and inference is the production-side signal that skew is happening despite your best efforts.

Q3 — How do I expire old data from the online store without losing training history?

Configure record-level TTL on the feature group's online store. TTL applies only to the online store; the offline store is append-only and ignores TTL completely. Set TTL based on how stale a feature can be before inference reads should treat it as missing — 30 days for typical retail customer state, 90 days for slower-moving features, 1 day for very hot session-level features. The offline store retains every record regardless, so training-set construction at any historical horizon still works. This is the recommended pattern for cost control on the online store, and exam questions about "online store cost is growing, how do we control it without breaking training" are answered by TTL.

Three permission surfaces must all align. First, the feature group resource policy in the owner account grants the consumer account principal access to sagemaker:GetRecord, sagemaker:BatchGetRecord, and other relevant actions. Second, the offline store S3 bucket policy grants the consumer account read access to the bucket and prefix. Third, the KMS key policies on both online and offline encryption keys grant kms:Decrypt to the consumer account. Plus the consumer account must attach IAM policies giving its principals permission to call the actions. Forgetting the KMS key policy is the most common bug — the resource policy looks correct, IAM looks correct, but reads fail with a KMS access denied error.

Q5 — What ingestion pattern should I use for streaming features computed from a Kinesis stream?

A Lambda or Kinesis Data Analytics consumer that calls PutRecord per event is the canonical pattern. PutRecord writes to both online and offline stores when both are enabled, so the online side is updated within milliseconds for real-time inference. For very high throughput streams (tens of thousands per second), batch the events and use a Spark Structured Streaming job on EMR or Glue that writes via the SageMaker Feature Store Spark connector — connector batches PutRecord calls efficiently and avoids hitting per-second account limits. The trade-off is latency versus throughput: pure Lambda gives lowest latency for moderate throughput, Spark gives highest throughput for slightly higher latency. Either path uses Feature Store as the destination, not a side-by-side custom store.

Q6 — How do I prevent label leakage when training a model on offline-store data?

Use point-in-time join semantics: for each label row at time T, only include feature values whose event time is strictly less than T. The Build Dataset API in the SageMaker Python SDK does this automatically when you supply a label table with an event time column and the feature groups to join. If you write Athena queries directly, the pattern is SELECT ... FROM labels l LEFT JOIN features f ON l.id = f.id AND f.event_time < l.event_time followed by selecting the maximum-event-time feature row per label. The Iceberg-backed offline store simplifies this with native FOR TIMESTAMP AS OF syntax. Validate by spot-checking: pick a known historical label, compute what feature values existed at that label's time, compare to what your training set has — they must match exactly.

Q7 — Can I change the schema of an existing feature group?

Partially. You can add new features to an existing feature group via UpdateFeatureGroup, and the new features default to null for historical records. You cannot change the type of an existing feature, remove a feature, change the record identifier, or change the event time field — these are structural and require recreating the feature group. The recreation path: create a new feature group with the desired schema, run a backfill Processing job that reads from the old group's offline store and writes to the new group, update consumer applications to point at the new group, and decommission the old group after a transition period. This migration is non-trivial and is one reason exam guides emphasize getting the schema right at creation time.