Redshift RA3 Materialized Views Spectrum Serverless - DEA-C01 Data Engineer Study Notes

Q: Q1 — When should I use Redshift RA3 versus Redshift Serverless versus DC2?

Use RA3 provisioned for steady, predictable workloads over a few terabytes — you get the best price-performance at sustained utilization, plus access to features like Data Sharing and AQUA. Use Redshift Serverless for unpredictable or bursty workloads, ad hoc analytics, dev/test environments, or any case where idle time is a significant fraction of total time — Serverless's per-second billing eliminates the cost of paying for idle clusters. Use DC2 only for legacy small-and-stable workloads where the cost saving over RA3 justifies the storage-compute coupling. The decision tree: predictable equals RA3, bursty equals Serverless, legacy small equals DC2. Most new workloads should start on RA3 or Serverless.

Q: Q3 — When should I use Redshift Spectrum versus loading data with COPY?

Use COPY to load when data is queried frequently, you need the fastest possible query performance, the data joins heavily with cluster-resident tables, or the data fits comfortably in cluster storage. Loaded data benefits from cluster-local SSD, distribution and sort keys, and zone maps for skip-on-read. Use Spectrum to query in place when data is queried infrequently, the data volume is too large to load efficiently, you want to keep cold data on S3 for cost reasons, or the data has a different lifecycle from the cluster (multi-year history while the cluster has 90 days). The hybrid pattern is common: hot 90 days in cluster via COPY, full history on S3 queried via Spectrum, dashboards UNION ALL the two for unified reporting.

Q: Q4 — How do I keep a materialized view current without manual refresh commands?

Configure auto-refresh on the materialized view. Redshift detects when underlying base tables change and refreshes the MV in the background. Combine with automatic query rewrite so user queries against base tables are transparently rewritten to use the MV when matching. The combination eliminates manual refresh entirely: the user writes plain SQL, Redshift uses MVs when faster, and MVs stay current as base data changes. Caveats: auto-refresh runs as background work and may lag heavy write activity by minutes; some SQL constructs in the MV definition block incremental refresh and force full refresh, which can be expensive — review the documented incremental-refresh-supported patterns when designing MVs.

Q: Q5 — How does Redshift Data Sharing differ from UNLOAD plus COPY for replication?

Data Sharing is a live, zero-copy access pattern: the consumer cluster reads directly from the producer cluster's storage, with sub-second freshness, no ETL pipeline. UNLOAD plus COPY is the older replication pattern: producer UNLOADs to S3, consumer COPYs from S3, with significant ETL infrastructure and inherent staleness equal to the pipeline cadence. Data Sharing wins on freshness, operational simplicity, and storage cost (no duplicate). UNLOAD-COPY wins on consumer-side query performance for very hot data (data lives locally on the consumer cluster, no cross-cluster I/O) and on producer-cluster isolation (consumer queries do not touch the producer). For most use cases Data Sharing is the right answer; UNLOAD-COPY is appropriate only when consumer query patterns require local storage. Data Sharing requires RA3 on both producer and consumer.

Q: Q6 — When should I use compound sort keys versus interleaved sort keys?

Use compound sort keys when queries filter on a leading prefix of the sort key — for example, sort by (date, region, product), and queries filter on date alone or date plus region. Compound keys are the default and almost always the right choice for time-series fact tables. Use interleaved sort keys only when queries filter on any of the sort key columns with similar frequency and you cannot determine a clear leading column — for example, a multi-dimensional analytics table where users sometimes filter by region first, sometimes by product first, sometimes by date first. Interleaved keys are expensive to maintain (require periodic VACUUM REINDEX), so the bar to use them is high. The exam plants compound versus interleaved as a query-pattern question — equal-frequency multi-column predicates favor interleaved; otherwise compound.

Q: Q7 — How do I scale Redshift compute independently from storage?

Use RA3 with managed storage (RMS). RA3 nodes have small local SSD that caches hot data, with the bulk of storage on S3-backed RMS. Add or remove RA3 nodes to scale compute without affecting storage capacity; storage scales automatically as data grows without changing node count. Concurrency Scaling adds burst read capacity for additional read queries without resizing the main cluster — extra clusters spin up automatically when query queues form, with the first hour per day per cluster free. Redshift Serverless scales RPUs automatically with workload. The combination of RA3, Concurrency Scaling, and Serverless gives full elasticity in compute dimension while storage scales transparently. DC2 cannot do this — its storage is tied to node count.

Amazon Redshift is the AWS analytics warehouse, and on the DEA-C01 exam it shows up across all four domains in scenarios that hinge on cluster sizing, distribution and sort key choice, materialized view design, querying S3 in place via Spectrum, federated queries to operational databases, and the constant cost-versus-performance trade-off between RA3 provisioned and Serverless. Community study guides from Tutorials Dojo, Cesar Cordoba's AWS in Plain English series, and Collin Smith's December 2024 retrospective all flag the same pain points: candidates pick DC2 for new workloads when RA3 is the modern default, assume Redshift's PRIMARY KEY constraints are enforced (they are not — this is the most cited Redshift trap), and conflate Spectrum with Athena even though they target different cost profiles. The wrong choice on the exam is the wrong choice in production: deploy DC2 for a 100 TB warehouse and pay for compute you do not need plus storage you cannot scale; assume foreign keys are enforced and the warehouse silently accumulates orphan rows that crash dashboards months later.

This guide is built for the data engineer perspective. It covers Redshift's MPP architecture (leader and compute nodes, slices), RA3 managed storage versus the older DC2 node family, Redshift Serverless and when to choose it, distribution styles and sort keys, materialized views with auto-refresh, Redshift Spectrum for querying S3 directly, federated query for live joins to RDS and Aurora, data sharing across clusters and accounts, why constraints are metadata-only, and the canonical exam traps. By the end the RA3-versus-DC2-versus-Serverless decision should feel as natural as choosing the right truck for a hauling job.

Redshift Architecture — MPP, Leader, And Compute Nodes

Redshift is a massively parallel processing (MPP) columnar warehouse.

The Two-Tier Node Architecture

A Redshift cluster has one leader node plus one to many compute nodes. The leader receives SQL queries from clients, parses and optimizes them, generates a query plan, and distributes the plan to compute nodes. The leader itself does not store user data. Compute nodes hold the data, execute the distributed plan, and return intermediate results to the leader, which assembles the final result and returns it to the client.

Slices Within Compute Nodes

Each compute node is divided into slices. A slice is a logical processing unit that owns a portion of the data and executes plan fragments in parallel. Number of slices per node depends on node type — typical RA3 nodes have several slices each. Distribution style determines how rows map to slices, and the right distribution maximizes parallelism across all slices.

Columnar Storage And Compression

Redshift stores data column by column on disk, with per-column compression. Selecting two columns from a fifty-column fact table reads two columns of data, not all fifty. Combined with zone maps (per-1MB-block min/max statistics for skip-on-read) and run-length encoding (for sorted columns), Redshift achieves the cost-and-performance characteristics that make MPP warehouses competitive with specialized OLAP engines.

Concurrency Scaling And Burst Capacity

When the primary cluster is busy, Redshift can transparently spin up additional read clusters to handle queued read queries — concurrency scaling. The first hour per day per cluster is free; beyond that you pay per second. Use concurrency scaling for unpredictable read spikes; do not rely on it for steady-state load.

Plain-Language Explanation: Redshift RA3 And Materialized Views

Redshift's architectural choices benefit from concrete analogies.

Analogy 1 — The Warehouse With Forklifts And A Dispatcher

Picture a large warehouse fulfilling Amazon orders. The dispatcher at the front desk receives orders, decides which forklift drivers to assign each task, and consolidates results when forklifts return. The dispatcher does not pick items personally — they coordinate. That is the leader node.

The forklift drivers each have an assigned aisle. When the dispatcher hands out an order ("retrieve items 12, 34, and 56"), drivers in the relevant aisles fetch their items in parallel. That is the compute nodes, with each aisle being a slice. Distribution style is how items are placed in aisles — distribute by SKU and queries about one SKU hit one aisle (good for point lookups but bad for scans), distribute evenly and queries hit all aisles (good for scans but join-unfriendly), distribute by joined column and items that join together end up in the same aisle (best for joins).

RA3 managed storage is the equivalent of a warehouse where overflow inventory is kept in a regional storage facility ten miles away — drivers work mostly from local aisles but can pull from the regional facility when needed, automatically and without driver intervention. DC2 is a warehouse where every item must fit in the local aisles — once aisles are full, you must rent a bigger warehouse. RA3 lets you scale storage independently from compute; DC2 couples them.

Analogy 2 — The Restaurant With Pre-Plated Specials And Cooked-To-Order

Picture a restaurant with a daily special. The kitchen pre-plates the specials every morning during prep — the same dish, made once, ready to go out instantly when ordered. That is a materialized view: an expensive aggregate query computed once and stored, returning instantly when re-queried. The cost is morning prep time and refrigerator space; the benefit is sub-second customer service at peak.

Auto-refresh materialized views are like a kitchen automation system that re-pre-plates the special whenever the underlying ingredients change — new beef arrives, the morning special is regenerated automatically without the chef issuing a command. Incremental refresh is the optimization where only the changed batches are recomputed, not the full dish from scratch — if only the sauce changed, replate just the sauce, leave the protein and sides intact.

Analogy 3 — The Library With Open Stacks And Off-Site Archive

Picture a research library. The open stacks in the building hold actively used material — fast access, finite shelf space. An off-site archive ten miles away holds older or rarely accessed material — slower access (call it in, wait an hour for delivery), unlimited capacity. RA3 managed storage works like this: the SSD on each compute node is the open stacks (fast cache for hot data), the underlying S3-backed RMS is the off-site archive (cheap, unlimited, slightly slower). The system automatically promotes hot blocks to local SSD and demotes cold blocks to S3. Storage scales independently of how many compute nodes you provision.

Redshift Spectrum is a different concept — instead of bringing data into the library at all, it lets the librarian read books from a partner library across the street without checking them out. You pay per book read (per terabyte scanned). Spectrum is for cold data that you do not want to load into the warehouse but you still want to query through the same SQL interface.

RA3 Versus DC2 Versus Redshift Serverless

The node choice is the biggest exam decision.

RA3 Node Types And Managed Storage

RA3 nodes (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) decouple compute from storage. Storage is Redshift Managed Storage (RMS) — backed by S3 with hot data cached on local SSD on each compute node. You scale compute by changing node count or size; storage scales automatically based on data volume. Pay separately for compute (per-hour per-node) and managed storage (per-GB-month, similar to S3 Standard pricing).

RA3 is the modern default. Use it for any new cluster larger than a few terabytes, for any workload where storage and compute scale at different rates, and for any workload that needs cross-cluster data sharing (a feature only RA3 supports).

DC2 Node Types

DC2 nodes (dc2.large, dc2.8xlarge) are the older compute-and-storage-coupled family. Each node has fixed local SSD; total cluster storage equals sum of node storage. Cheaper per-node than RA3, but you must scale nodes to scale storage even if you do not need more compute. Use DC2 only for small steady-state workloads under a few terabytes where the cost saving justifies the coupling.

Redshift Serverless

Serverless removes cluster management entirely. You define a workgroup (compute capacity in RPUs — Redshift Processing Units) and a namespace (data and configuration). The workgroup auto-scales between minimum and maximum RPU based on workload. Pay per RPU-hour with sub-second billing.

Use Serverless for: unpredictable workloads where idle time is significant, ad hoc analytics where you cannot forecast capacity, dev/test environments. Do not use Serverless for: steady high-throughput workloads (provisioned RA3 is cheaper at fully utilized capacity), workloads that need specific node-level configuration, very latency-sensitive workloads where the cold start of a paused workgroup is unacceptable.

Decision Tree

For new workloads: start with RA3 if the warehouse is over a few terabytes and steady-state. Start with Serverless if the workload is bursty or unpredictable. Use DC2 only if a small fixed-cost steady warehouse fits all access patterns. Migrate from DC2 to RA3 at the next maintenance window for any growing warehouse — the storage decoupling alone usually pays for the migration.

RA3 nodes decouple compute from storage by using Redshift Managed Storage (RMS) — a tiered storage layer backed by S3 with hot data cached on local SSD — letting compute and storage scale independently, while DC2 nodes couple storage to compute and force you to add nodes to add storage. RA3 is the modern default for any warehouse over a few terabytes. RA3 also enables features that DC2 does not support: cross-cluster data sharing (live data accessible from other Redshift clusters without copying), AQUA (Advanced Query Accelerator), and Redshift Serverless integration. The DEA-C01 exam plants this as a "design choice for a growing data warehouse" question — RA3 is almost always the right answer; DC2 is appropriate only for niche small-and-stable workloads.

Distribution Styles And Sort Keys

How data is laid out on disk drives query performance.

Distribution Styles

Four styles: EVEN (round-robin across slices, good for tables that do not join), KEY (rows with the same value of a column go to the same slice, optimal for joins on that column), ALL (every slice has a full copy, optimal for small dimension tables joined to many fact tables), AUTO (Redshift decides — default for new tables, switches between EVEN, KEY, and ALL based on observed query patterns).

When To Use Each Distribution Style

KEY when two large tables join frequently on the same column — fact join with one foreign-key dimension. ALL when a small dimension table (under a few million rows) joins to many fact tables. EVEN when the table does not join (purely scanned for aggregates). AUTO is the safe default if you cannot predict the workload.

Sort Keys — Compound Versus Interleaved

A sort key determines the on-disk row order, which lets the zone maps skip blocks during predicate evaluation. Compound sort key (the default) sorts rows by the listed columns in order — first by column A, ties broken by B, then C. Optimal when queries filter on a leading prefix of the sort key. Interleaved sort key weights columns equally — optimal when queries filter on any of the sort key columns with similar frequency, but expensive to maintain (requires VACUUM REINDEX periodically).

Sort Key Selection

Pick a sort key on the column most queries filter by — usually a date or timestamp column for time-series fact tables, plus a high-cardinality dimension for filter-heavy workloads. The exam plants compound versus interleaved as a question of query pattern: leading-column predicates equal compound; multi-column equal-frequency predicates equal interleaved.

Materialized Views — Pre-Computed Aggregates

Materialized views (MVs) are stored query results that refresh on demand or automatically.

Why Materialized Views Exist

A dashboard query that joins a billion-row fact table to ten dimension tables and computes weekly aggregates takes minutes to run from raw data. The same query against a materialized view that has the aggregates pre-computed returns in milliseconds. MVs trade write-time refresh cost for read-time query speed.

Refresh Modes

Manual refresh: explicit REFRESH MATERIALIZED VIEW command. Auto-refresh: Redshift refreshes the view automatically when underlying tables change, in the background. Incremental refresh: only the rows affected by changes are recomputed (supported for SELECT, JOIN, GROUP BY, and aggregate functions on RA3 — not for all SQL constructs). Full refresh recomputes from scratch (always supported, more expensive).

Auto-Refresh And Auto-Rewrite

Auto-refresh keeps MVs current automatically. Automatic query rewrite rewrites incoming queries to use a matching MV without the user explicitly referencing the view — Redshift detects that a query's pattern matches an MV and substitutes the MV transparently. Combined, the user writes plain SQL against base tables and Redshift transparently uses MVs when faster.

MV Constraints And Limitations

MVs cannot reference external tables (Spectrum), system tables, or other materialized views. MVs cannot have OUTER JOINs in some incremental-refresh modes. Some functions block incremental refresh and force full refresh. The exam plants this as "materialized view query is unexpectedly slow" — the answer is often that the MV is doing full refreshes because the SQL pattern blocks incremental.

Redshift Spectrum — Querying S3 In Place

Spectrum extends the warehouse to S3 without loading.

What Spectrum Is

Spectrum is a fleet of AWS-managed compute resources that scan S3 in parallel and feed results back to your Redshift cluster's compute nodes for join, aggregation, and result return. You query S3 data through standard SQL using external schemas and external tables defined in the Glue Data Catalog. Pay per terabyte scanned (similar to Athena pricing) plus your normal Redshift cluster cost.

When To Use Spectrum Versus COPY

Use COPY to load data into Redshift cluster storage when: data is queried frequently, you want best query performance, queries join the data heavily with cluster-resident tables. Use Spectrum when: data is queried infrequently, the data volume is too large to load efficiently, you want to keep cold data on S3 for cost.

A common hybrid pattern: hot data in Redshift cluster (last 90 days), cold data on S3 queried via Spectrum (past 7 years), unified query using UNION ALL of cluster table and external table.

Spectrum Performance Optimization

Spectrum benefits from the same data lake optimizations as Athena: partitioning, columnar formats (Parquet preferred), compression, file size in the 128 MB to 1 GB range. Adding partitioning to a Spectrum table can reduce scan size and cost by an order of magnitude.

Spectrum Versus Athena

Spectrum and Athena both query S3 with SQL. Differences: Spectrum requires a Redshift cluster (no cluster, no Spectrum), Athena is serverless. Spectrum integrates with Redshift cluster joins (you can join external tables to cluster-resident tables in one query); Athena cannot join to Redshift directly. Pricing is similar — both per-TB-scanned. Use Spectrum when you have a Redshift cluster anyway and need to integrate S3 data with cluster data; use Athena when you do not have or do not need a Redshift cluster.

Federated Query — Live Joins To RDS And Aurora

Federated query lets Redshift query operational databases live without ETL.

What Federated Query Is

Configure a Redshift external schema pointing to an RDS or Aurora database (PostgreSQL or MySQL). Queries against the external schema are pushed down to the operational database, results stream back to Redshift, and the data is joined with cluster-resident tables in real time. No ETL, no data copy.

When To Use Federated Query

For dashboards that need the freshest possible operational data joined with historical analytics — for example, current open-orders count from the operational database joined with last-90-day order trends from the warehouse. Avoid for high-throughput queries (the operational database is not sized for analytics scans) or for queries that need to scan most of the operational data (load it instead).

Federated Query Versus Spectrum

Federated query goes to RDS or Aurora; Spectrum goes to S3. Federated query reads live operational data; Spectrum reads cold data lake files. They are complementary. The exam plants questions where the wrong answer suggests Spectrum for an operational database scenario; the right answer is federated query.

Redshift Data Sharing on RA3 lets multiple consumers read live data without copy.

A producer cluster creates a datashare containing specific schemas, tables, or views. The producer grants the datashare to one or more consumer clusters or AWS accounts. Consumers attach the datashare to their cluster and query the shared objects as if they were local. Reads on the consumer side hit the producer's storage directly — no replication, no staleness.

Cross-Account And Cross-Region

Data sharing supports cross-account (with AWS account ID grant) and cross-region (with the same datashare attached in any region). Cross-region replicates the datashare metadata; reads still happen against the producer's home-region storage with cross-region latency.

Multi-tenant analytics where each tenant has their own consumer cluster but reads from a shared producer. Centralized data team produces; downstream teams in separate accounts consume. ETL jobs in one cluster, BI dashboards in another.

Redshift Data Sharing on RA3 enables live, read-only access to producer-cluster data from consumer clusters or accounts without copying or replicating data — consumers read directly from producer storage with sub-second freshness. Data Sharing is RA3-only and is the modern alternative to UNLOAD-and-COPY-based replication patterns. Cross-account sharing requires the producer to grant the consumer account ID and the consumer to authorize the share. Data Sharing eliminates ETL pipelines whose only purpose was to copy data from one cluster to another. The DEA-C01 exam plants Data Sharing as the right answer for "share a single source of truth across multiple Redshift consumers" scenarios — UNLOAD-COPY pipelines are the wrong answer because they introduce staleness.

Redshift Constraints — The Non-Enforced Trap

Redshift's constraint behavior is the most-cited exam trap.

Constraints Are Metadata Only

PRIMARY KEY, UNIQUE, and FOREIGN KEY constraints exist in Redshift's syntax and store in the system catalog, but they are not enforced during INSERT, UPDATE, or COPY operations. You can insert duplicate primary keys, you can insert orphan foreign-key values, you can violate unique constraints — Redshift will not stop you.

Why Constraints Exist Despite Not Being Enforced

The query planner uses constraints as optimizer hints. If you declare a primary key, the planner trusts that values are unique and chooses a more efficient join plan. If your data violates the constraint, queries return wrong results because the planner's optimization assumed something that is false. So you must enforce constraints in your ETL pipeline, then declare them in the schema for the planner.

NOT NULL Is Enforced

Among the constraints, only NOT NULL is enforced. INSERTs of NULL into a NOT NULL column fail. Use NOT NULL liberally; treat PRIMARY KEY and FOREIGN KEY as documentation plus optimizer hints, with actual integrity enforced in upstream ETL.

Amazon Redshift PRIMARY KEY, UNIQUE, and FOREIGN KEY constraints are metadata only — they are not enforced during INSERT, UPDATE, or COPY operations and the query planner trusts them blindly, so violating data leads to incorrect query results. Only NOT NULL is enforced. Engineers from PostgreSQL, MySQL, Oracle, or SQL Server backgrounds assume Redshift behaves the same way and skip ETL deduplication, only to discover months later that aggregate dashboards report inflated counts because primary keys are duplicated. The DEA-C01 exam plants this as a single-question gotcha: "Redshift dashboard shows incorrect aggregate counts after recent loads — what is the cause?" The answer is duplicate primary keys not caught by ETL because Redshift does not enforce them. The fix is upstream deduplication in the ETL job, plus optionally adding a runtime UNIQUE check via Glue Data Quality or a downstream validation query.

Common Exam Traps For Redshift RA3 And Materialized Views

The DEA-C01 exam plants a consistent set of traps. Memorize all six.

Trap 1 — Constraints Enforced

Covered above. The most-cited Redshift trap. PRIMARY KEY, UNIQUE, FOREIGN KEY are metadata only.

Trap 2 — DC2 For New Workloads

A scenario describes designing a new warehouse and lists DC2 as the default. Right answer: RA3 for any non-trivial workload. DC2 is legacy.

Trap 3 — Spectrum For Operational Database Joins

A scenario asks how to join operational RDS data with warehouse data. Wrong answer: Spectrum (Spectrum reads S3, not RDS). Right answer: federated query.

Trap 4 — Materialized View With Full Refresh

A scenario describes a slow MV and the user expects it to be incremental. The answer: check if the SQL pattern allows incremental refresh — outer joins, certain functions, and references to other MVs force full refresh. Refactor the MV definition or accept the full-refresh cost.

Trap 5 — Serverless Always Cheaper

A scenario describes a steady high-throughput workload and asks if Serverless is cheaper. Right answer: provisioned RA3 is cheaper for steady high-utilization workloads; Serverless is cheaper for bursty or unpredictable workloads.

A scenario asks about cross-cluster data sharing on a DC2 cluster. Wrong because Data Sharing requires RA3. The answer is to migrate to RA3 first.

Trap 7 — VACUUM Required On RA3 Manual

A candidate assumes VACUUM is automatic on RA3 and stops running it. Right answer: VACUUM SORT is automatic for tables with sort keys on RA3, but VACUUM DELETE for reclaiming space after large DELETEs is still manual or part of automatic table optimization.

For dashboards and BI workloads on Redshift, design materialized views with auto-refresh and automatic query rewrite to keep aggregates current without explicit refresh commands and let the planner transparently substitute MVs into matching base-table queries. Auto-refresh runs in the background as underlying tables change, incremental refresh recomputes only affected rows when the MV definition supports it, and automatic query rewrite means analysts can write plain SQL against base tables while the planner uses MVs whenever they accelerate the query. The combination cuts dashboard latency from seconds-to-minutes to sub-second without changing analyst workflow. Avoid SQL constructs that block incremental refresh (OUTER JOIN in some cases, references to other MVs, certain non-deterministic functions) — those force full refresh which can be expensive on large MVs.

Key Numbers And Must-Memorize Redshift Facts

Node Architecture

Leader node parses, plans, distributes
Compute nodes hold data and execute
Slices are processing units within compute nodes
AUTO distribution lets Redshift decide

RA3 Versus DC2

RA3: managed storage (RMS, S3-backed), compute and storage scale independently
DC2: local SSD only, storage tied to node count
RA3 supports Data Sharing, AQUA, Serverless integration; DC2 does not
RA3 is the modern default

Redshift Serverless

RPU (Redshift Processing Unit) capacity
Workgroup defines compute, namespace defines data
Auto-scales between min and max RPU
Pay per RPU-hour, sub-second billing

Distribution And Sort Keys

EVEN, KEY, ALL, AUTO distribution styles
Compound sort key for leading-prefix predicates
Interleaved sort key for equal-frequency multi-column predicates
VACUUM REINDEX required on interleaved keys

Materialized Views

Manual, auto, incremental, full refresh modes
Automatic query rewrite substitutes MVs transparently
Outer joins and some functions block incremental refresh

Constraints

PRIMARY KEY, UNIQUE, FOREIGN KEY are metadata only
Only NOT NULL is enforced
Optimizer trusts declared constraints — violations cause wrong results

Spectrum And Federated Query

Spectrum: query S3 in place via external tables (per-TB pricing)
Federated query: live joins to RDS and Aurora
Data Sharing: cross-cluster live access (RA3 only)

DEA-C01 exam priority — Redshift RA3 Materialized Views Spectrum Serverless. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

Key facts to memorize. Pin the service-level limits, default behaviours, and SLA promises related to this topic. The exam often tests recall of specific defaults (durations, limits, retry behaviour, free-tier boundaries) where memorising the exact number is the difference between right and 'almost right'.

FAQ — Amazon Redshift RA3 And Materialized Views Top Questions

Q1 — When should I use Redshift RA3 versus Redshift Serverless versus DC2?

Use RA3 provisioned for steady, predictable workloads over a few terabytes — you get the best price-performance at sustained utilization, plus access to features like Data Sharing and AQUA. Use Redshift Serverless for unpredictable or bursty workloads, ad hoc analytics, dev/test environments, or any case where idle time is a significant fraction of total time — Serverless's per-second billing eliminates the cost of paying for idle clusters. Use DC2 only for legacy small-and-stable workloads where the cost saving over RA3 justifies the storage-compute coupling. The decision tree: predictable equals RA3, bursty equals Serverless, legacy small equals DC2. Most new workloads should start on RA3 or Serverless.

Q2 — Why are my Redshift PRIMARY KEY constraints not preventing duplicate inserts?

Because Redshift PRIMARY KEY, UNIQUE, and FOREIGN KEY constraints are metadata only — they are not enforced during INSERT, UPDATE, or COPY operations. Redshift inherits this behavior from its analytics-warehouse heritage where bulk-load throughput is prioritized over row-by-row constraint validation. The constraints serve as query optimizer hints: the planner trusts that declared primary keys are unique and chooses better join plans. If your data violates the constraints, queries return incorrect results because the planner's optimization assumed something false. The fix: enforce constraints in your ETL pipeline (Glue Data Quality rules, deduplication in COPY-then-MERGE patterns, or pre-load validation queries), then declare them in the Redshift schema for the planner. Only NOT NULL is enforced at insert time.

Q3 — When should I use Redshift Spectrum versus loading data with COPY?

Use COPY to load when data is queried frequently, you need the fastest possible query performance, the data joins heavily with cluster-resident tables, or the data fits comfortably in cluster storage. Loaded data benefits from cluster-local SSD, distribution and sort keys, and zone maps for skip-on-read. Use Spectrum to query in place when data is queried infrequently, the data volume is too large to load efficiently, you want to keep cold data on S3 for cost reasons, or the data has a different lifecycle from the cluster (multi-year history while the cluster has 90 days). The hybrid pattern is common: hot 90 days in cluster via COPY, full history on S3 queried via Spectrum, dashboards UNION ALL the two for unified reporting.

Q4 — How do I keep a materialized view current without manual refresh commands?

Configure auto-refresh on the materialized view. Redshift detects when underlying base tables change and refreshes the MV in the background. Combine with automatic query rewrite so user queries against base tables are transparently rewritten to use the MV when matching. The combination eliminates manual refresh entirely: the user writes plain SQL, Redshift uses MVs when faster, and MVs stay current as base data changes. Caveats: auto-refresh runs as background work and may lag heavy write activity by minutes; some SQL constructs in the MV definition block incremental refresh and force full refresh, which can be expensive — review the documented incremental-refresh-supported patterns when designing MVs.

Data Sharing is a live, zero-copy access pattern: the consumer cluster reads directly from the producer cluster's storage, with sub-second freshness, no ETL pipeline. UNLOAD plus COPY is the older replication pattern: producer UNLOADs to S3, consumer COPYs from S3, with significant ETL infrastructure and inherent staleness equal to the pipeline cadence. Data Sharing wins on freshness, operational simplicity, and storage cost (no duplicate). UNLOAD-COPY wins on consumer-side query performance for very hot data (data lives locally on the consumer cluster, no cross-cluster I/O) and on producer-cluster isolation (consumer queries do not touch the producer). For most use cases Data Sharing is the right answer; UNLOAD-COPY is appropriate only when consumer query patterns require local storage. Data Sharing requires RA3 on both producer and consumer.

Q6 — When should I use compound sort keys versus interleaved sort keys?

Use compound sort keys when queries filter on a leading prefix of the sort key — for example, sort by (date, region, product), and queries filter on date alone or date plus region. Compound keys are the default and almost always the right choice for time-series fact tables. Use interleaved sort keys only when queries filter on any of the sort key columns with similar frequency and you cannot determine a clear leading column — for example, a multi-dimensional analytics table where users sometimes filter by region first, sometimes by product first, sometimes by date first. Interleaved keys are expensive to maintain (require periodic VACUUM REINDEX), so the bar to use them is high. The exam plants compound versus interleaved as a query-pattern question — equal-frequency multi-column predicates favor interleaved; otherwise compound.

Q7 — How do I scale Redshift compute independently from storage?

Use RA3 with managed storage (RMS). RA3 nodes have small local SSD that caches hot data, with the bulk of storage on S3-backed RMS. Add or remove RA3 nodes to scale compute without affecting storage capacity; storage scales automatically as data grows without changing node count. Concurrency Scaling adds burst read capacity for additional read queries without resizing the main cluster — extra clusters spin up automatically when query queues form, with the first hour per day per cluster free. Redshift Serverless scales RPUs automatically with workload. The combination of RA3, Concurrency Scaling, and Serverless gives full elasticity in compute dimension while storage scales transparently. DC2 cannot do this — its storage is tied to node count.

Redshift Architecture — MPP, Leader, And Compute Nodes

The Two-Tier Node Architecture

Slices Within Compute Nodes

Columnar Storage And Compression

Concurrency Scaling And Burst Capacity

Plain-Language Explanation: Redshift RA3 And Materialized Views

Analogy 1 — The Warehouse With Forklifts And A Dispatcher

Analogy 2 — The Restaurant With Pre-Plated Specials And Cooked-To-Order

Analogy 3 — The Library With Open Stacks And Off-Site Archive

RA3 Versus DC2 Versus Redshift Serverless

RA3 Node Types And Managed Storage

DC2 Node Types

Redshift Serverless

Decision Tree

Distribution Styles And Sort Keys

Distribution Styles

When To Use Each Distribution Style

Sort Keys — Compound Versus Interleaved

Sort Key Selection

Materialized Views — Pre-Computed Aggregates

Why Materialized Views Exist

Refresh Modes

Auto-Refresh And Auto-Rewrite

MV Constraints And Limitations

Redshift Spectrum — Querying S3 In Place

What Spectrum Is

When To Use Spectrum Versus COPY

Spectrum Performance Optimization

Spectrum Versus Athena

Federated Query — Live Joins To RDS And Aurora

What Federated Query Is

When To Use Federated Query

Federated Query Versus Spectrum

Data Sharing — Cross-Cluster And Cross-Account

How Data Sharing Works

Cross-Account And Cross-Region

Data Sharing Use Cases

Redshift Constraints — The Non-Enforced Trap

Constraints Are Metadata Only

Why Constraints Exist Despite Not Being Enforced

NOT NULL Is Enforced

Common Exam Traps For Redshift RA3 And Materialized Views

Trap 1 — Constraints Enforced

Trap 2 — DC2 For New Workloads

Trap 3 — Spectrum For Operational Database Joins

Trap 4 — Materialized View With Full Refresh

Trap 5 — Serverless Always Cheaper

Trap 6 — Data Sharing On DC2

Trap 7 — VACUUM Required On RA3 Manual

Key Numbers And Must-Memorize Redshift Facts

Node Architecture

RA3 Versus DC2

Redshift Serverless

Distribution And Sort Keys

Materialized Views

Constraints

Spectrum And Federated Query

FAQ — Amazon Redshift RA3 And Materialized Views Top Questions

Q1 — When should I use Redshift RA3 versus Redshift Serverless versus DC2?

Q2 — Why are my Redshift PRIMARY KEY constraints not preventing duplicate inserts?

Q3 — When should I use Redshift Spectrum versus loading data with COPY?

Q4 — How do I keep a materialized view current without manual refresh commands?

Q5 — How does Redshift Data Sharing differ from UNLOAD plus COPY for replication?

Q6 — When should I use compound sort keys versus interleaved sort keys?

Q7 — How do I scale Redshift compute independently from storage?

Further Reading — Official AWS Documentation For Redshift

Official sources

More DEA-C01 topics