Redshift Query Tuning, COPY and UNLOAD - DEA-C01 Data Engineer Study Notes

Redshift query tuning, COPY, and UNLOAD form the operational backbone of every production Redshift workload, and on the DEA-C01 exam these commands are tested in roughly one out of every five Domain 3 questions. The exam does not just ask "what does COPY do" — it asks scenario questions like "the COPY job from a 10 TB S3 prefix takes eight hours, what changes the most" or "VACUUM has not been run in three months and queries are slow, which VACUUM mode is correct." Community study guides from Tutorials Dojo, Cesar Cordoba's AWS in Plain English series, and Laura Galera's open-source notes all flag the same gap: candidates know the commands exist but cannot pick the right options or sequence under exam scenarios.

This guide is built to put Redshift's operational commands into engineering muscle memory. It covers the COPY command and the file-count-equals-slice-multiple parallelism rule, manifest files for incremental loads, COMPUPDATE and STATUPDATE, the UNLOAD command and PARALLEL ON, VACUUM in its four modes, ANALYZE for planner statistics, automatic versus manual WLM queues, EXPLAIN plan reading for distribution skew and sort key misses, sort key and distribution key choice, Concurrency Scaling for burst read capacity, and the canonical COPY-UNLOAD-VACUUM exam traps that catch most candidates.

COPY — Parallel Bulk Loading From S3

COPY is the single most important command in operational Redshift. Every production data pipeline that lands S3 data in Redshift is COPY-driven.

COPY Syntax And Sources

COPY target_table FROM 's3://bucket/prefix/' IAM_ROLE 'arn:aws:iam::account:role/RedshiftLoad' FORMAT AS PARQUET; — that is the canonical shape. Sources can be S3 (most common), DynamoDB, EMR, SSH-accessible remote hosts, or another Redshift cluster via federated load. The IAM role is attached to the cluster and grants S3 read plus optional KMS decrypt; user-credentials COPY is supported but discouraged in production.

Supported Formats

Native formats: CSV, JSON, Avro, ORC, Parquet, plain TEXT. Parquet and ORC are columnar, load fastest, and integrate with Glue Data Catalog. CSV requires DELIMITER and QUOTE options; JSON requires either auto-mapping or a JSONPaths file. Avro is the recommended format when loading from streaming Avro producers because schema evolution semantics line up.

The Parallelism Rule — Files Equal To Slice Count Multiple

Redshift parallelism is bounded by the number of files in the source prefix. The cluster has a fixed total number of slices (compute units) — for example, an 8-node ra3.4xlarge cluster has 16 slices total, 2 per node. COPY assigns each input file to exactly one slice. If the source prefix has 16 files, all slices load in parallel. If the source has 1 huge file, only 1 slice loads and 15 slices sit idle — the load becomes 16x slower than necessary. The rule of thumb: split source data into a number of files equal to a multiple of the cluster slice count, ideally with each file in the 1 MB to 1 GB range after compression.

Manifest Files

A manifest is a JSON file listing the exact S3 keys to load — used when the source prefix contains files that should not all be loaded, or when atomically loading a known set of files (avoiding the partial-file race when new files arrive mid-load). Reference: COPY ... FROM 's3://bucket/manifest.json' MANIFEST;. The manifest pattern is the recommended approach for incremental nightly loads because it makes the input set deterministic.

COMPUPDATE And STATUPDATE

COMPUPDATE ON (default for empty target tables) tells COPY to analyze a sample of the input and choose the optimal compression encoding for each column. After the first load, the encodings are baked in and COMPUPDATE OFF is the right choice for subsequent loads to avoid re-analysis. STATUPDATE ON (default for empty tables) updates table statistics for the query planner during the load. After an initial bulk load, set STATUPDATE OFF for subsequent COPYs and run ANALYZE separately at a controlled time — the in-line statistics update can lengthen COPY duration significantly.

COPY From Other Sources

COPY can read directly from DynamoDB tables (full-table read, expensive on large tables), from EMR HDFS (for legacy migrations), and from another Redshift cluster via the FROM clause with an SSH manifest. For ongoing replication patterns, AWS DMS or Federated Query are usually better than recurring DynamoDB COPY operations.

Always split COPY source data into a number of files equal to a multiple of the cluster slice count before loading large datasets. A single 100 GB file loads on one slice while every other slice idles — a 16-slice cluster loading one file is 16x slower than the same cluster loading 16 evenly-sized files. The remediation: run an upstream Glue ETL or EMR step that writes the source as Parquet partitioned into many output files, or use the MAXFILESIZE parameter on UNLOAD to control output file count when staging to S3 from another Redshift cluster. For DEA-C01 exam scenarios about slow COPY, the answer is rarely "scale the cluster" — the answer is "split the source files." Pair this rule with COMPUPDATE OFF and STATUPDATE OFF on subsequent loads to avoid re-analysis cost.

UNLOAD — Parallel Export To S3

UNLOAD is the inverse of COPY — it exports query results from Redshift to S3 in parallel.

UNLOAD Syntax

UNLOAD ('SELECT * FROM table WHERE date=current_date') TO 's3://bucket/prefix/' IAM_ROLE 'arn:...' FORMAT AS PARQUET PARALLEL ON; — the SELECT runs against the cluster, results are partitioned across slices, and each slice writes its share to S3 as separate files.

PARALLEL ON vs PARALLEL OFF

PARALLEL ON (default) writes one or more files per slice for maximum throughput — fastest for large unloads but produces many output files. PARALLEL OFF writes a single sorted output file at the cost of all-but-one slices going idle — useful only for small extracts where downstream tooling cannot handle multiple files. The exam plants this as a "fastest unload of 1 TB" scenario; PARALLEL ON is the right answer.

Format Options

UNLOAD supports CSV, JSON, Parquet, and TEXT. Parquet is the recommended target format for downstream Athena or Spectrum reads — column pruning and predicate pushdown apply when the unloaded data is queried later. JSON is useful for downstream NoSQL stores; CSV for legacy systems.

MAXFILESIZE And ALLOWOVERWRITE

MAXFILESIZE 100 MB caps each output file at the specified size — the slice writes multiple files if its share exceeds the cap. This is the right control for downstream consumers that prefer a known file size band. ALLOWOVERWRITE permits writing to an S3 prefix that already contains files; without it, UNLOAD fails if the target prefix is non-empty.

Encryption On UNLOAD

ENCRYPTED writes the output with SSE-KMS using a specified key. The IAM role attached to the cluster must have kms:GenerateDataKey on the key. For compliance-bound workloads this is mandatory; the default behavior inherits the bucket default encryption.

VACUUM — Reclaiming Space And Resorting

VACUUM is the maintenance command that reclaims disk space from deleted rows and resorts unsorted regions. The exam tests VACUUM modes and when to use each.

Why VACUUM Is Needed

Redshift is column-oriented and append-mostly. DELETEs and UPDATEs do not physically remove rows — they mark them deleted. Inserts beyond the sorted region accumulate in an unsorted region. Without VACUUM, deleted rows continue to consume disk and the unsorted region grows, slowing query performance because the planner must scan unsorted data on every range query.

The Four Modes

VACUUM SORT ONLY resorts the unsorted region into the sorted region but does not reclaim deleted-row space. Fast when only resorting is needed.

VACUUM DELETE ONLY reclaims space from deleted rows but does not resort. Useful after large DELETEs where the sort order is unaffected.

VACUUM FULL (default if no mode specified) does both — resort and reclaim. Slowest but most thorough; canonical periodic maintenance.

VACUUM REINDEX rebuilds interleaved sort key indexes after large data changes. Required only for tables with INTERLEAVED sort keys; tables with COMPOUND sort keys do not need REINDEX.

Automatic VACUUM On RA3

RA3 nodes run automatic VACUUM and ANALYZE in the background — Redshift schedules maintenance during low-load periods. This does not eliminate the need for manual VACUUM, but it reduces the operational burden for typical workloads. DC2 and earlier node types do not run automatic VACUUM and require explicit scheduling. The exam plants this as a "VACUUM has not been run on RA3" scenario; the answer typically involves understanding that RA3 automation reduces but does not eliminate the need.

Boost Mode And Resource Allocation

VACUUM ... BOOST allocates more cluster resources to the VACUUM operation, completing faster at the cost of contention with concurrent queries. Used during off-hours maintenance windows.

VACUUM REINDEX is required only for INTERLEAVED sort keys, not for COMPOUND sort keys, and most Redshift tables should use COMPOUND. Engineers who use INTERLEAVED sort keys without need pay both the slower load cost and the periodic REINDEX cost — INTERLEAVED is correct only when query patterns filter on multiple sort columns with similar selectivity, a narrow use case. The DEA-C01 exam plants this as a sort key choice question or a maintenance question — if the scenario describes equal-weight filtering on three columns and asks about VACUUM, INTERLEAVED plus REINDEX may be correct; if the scenario is a typical date-partitioned fact table queried mostly by date, COMPOUND sort key with no REINDEX is the right answer.

ANALYZE — Updating Planner Statistics

ANALYZE collects column-level statistics that the query planner uses to choose join orders, distribution strategies, and plan shapes. Without current statistics, the planner can choose disastrously wrong plans.

When To ANALYZE

Run ANALYZE after large data changes (more than 10 percent of table rows added or modified), after schema changes, or on a scheduled basis for high-write tables. RA3 nodes auto-analyze in the background but explicit ANALYZE is still appropriate after major loads.

ANALYZE Predicate Columns

ANALYZE table_name PREDICATE COLUMNS analyzes only the columns used in WHERE, JOIN, and GROUP BY clauses — the columns that actually drive plan choice. Faster than full-table ANALYZE and recommended for wide tables.

Statistics Off Threshold

The system table STV_TBL_PERM has a stats_off column that reports the percentage of rows changed since the last ANALYZE. Tables with stats_off above 10 percent are candidates for ANALYZE.

WLM — Workload Management

WLM is the queue and concurrency model that isolates query workloads.

Automatic WLM

The default since 2018 — Redshift dynamically allocates memory and concurrency based on observed workload. Recommended for almost all workloads; it adapts faster than humans can tune manual queues. The exam asks about Automatic WLM as the recommended baseline.

Manual WLM

Static queue configuration with explicit memory percentage and concurrency per queue. Used when a specific queue must have guaranteed resources — for example, a critical dashboard queue that cannot be starved by ad hoc analyst queries. Manual WLM requires the operator to predict workload mix; misconfiguration causes underutilization or starvation.

Query Priority And Short Query Acceleration

Within Automatic WLM, queries can be assigned priority (HIGHEST, HIGH, NORMAL, LOW, LOWEST). Short Query Acceleration (SQA) is a built-in optimization that fast-tracks queries the planner predicts will run quickly, bypassing queue contention.

Concurrency Scaling

When a queue's concurrency limit is hit, Concurrency Scaling automatically provisions additional cluster capacity to absorb the burst. Read-only queries are eligible. Each cluster gets one free hour of concurrency scaling per 24 hours of main cluster runtime; beyond that the scaling cost is per-second-billed. The exam plants this as a "burst capacity for read queries without resizing" scenario.

Enable Concurrency Scaling on read-heavy workloads to absorb burst load without permanent cluster resize. A typical pattern: weekday business hours see 5x the query rate of nights and weekends. Sizing the main cluster for peak is wasteful (5x the cost for a 5-fold daily load); sizing for average means peak queries queue. Concurrency Scaling solves this — main cluster sized for average, peak burst absorbed by ephemeral scaling clusters at per-second billing. The free credit (one hour per 24 hours of main runtime) often covers normal peak patterns at zero additional cost. Configure scaling on the queue level in WLM. For DEA-C01 exam scenarios about variable read load or "users complain about waiting during business hours," Concurrency Scaling is the right answer; resizing the cluster is a distractor.

EXPLAIN — Reading The Query Plan

EXPLAIN prefixed before a SELECT returns the query plan without executing — the engineer's primary debugging tool.

Plan Operators

Seq Scan reads a full table — expected for unfiltered reads, problematic when a WHERE filter on a sort key should have made it a range scan. Index Scan reads via a sort-key range — good. Hash Join, Merge Join, Nested Loop are the three join algorithms; nested loop on large tables is a red flag. DS_BCAST_INNER (broadcast inner table) and DS_DIST_BOTH (redistribute both tables) on the data motion line indicate that joins are crossing the network because distribution keys do not align — the canonical performance bug.

Distribution Skew

When one slice receives many more rows than its peers, the slice becomes a bottleneck. The system view SVV_TABLE_INFO reports skew_rows (max-to-mean ratio) — values above 4.0 indicate significant skew. Caused by DISTSTYLE KEY with a low-cardinality distribution column or many NULL values in the key. Fix: choose a higher-cardinality column, switch to DISTSTYLE EVEN, or DISTSTYLE AUTO.

Sort Key Misses

The plan shows a Seq Scan with a filter that should have been a sort-key range scan. Causes: function applied to the sort column (WHERE DATE_TRUNC('day', ts) = '2026-05-02' does not use the sort key on ts; WHERE ts BETWEEN '2026-05-02' AND '2026-05-03' does), or the column was simply not chosen as a sort key.

Sort Keys And Distribution Keys

These two table-level choices drive query performance more than any other operational setting.

Distribution Styles

DISTSTYLE KEY distributes rows by hash of a column — best for join-heavy workloads where joining tables share the distribution column. DISTSTYLE EVEN round-robins rows across slices — uniform distribution, no co-location benefit. DISTSTYLE ALL replicates the full table on every slice — best for small dimension tables joined to large fact tables. DISTSTYLE AUTO is the default and lets Redshift choose; for production workloads, explicit choice usually beats AUTO.

Sort Keys

COMPOUND sort key (default) sorts rows by the listed columns in order — best when queries filter on the leading columns. INTERLEAVED sort key gives equal weight to every column in the key — useful only when queries filter on different sort columns with similar frequency. INTERLEAVED is rarely the right answer because it is slower to load, requires REINDEX, and rarely outperforms COMPOUND in practice.

Redshift Advisor

Redshift Advisor (in the console) automatically detects suboptimal sort key, distribution key, and compression encoding choices and recommends changes based on observed query patterns. A reasonable cadence: run Advisor monthly and apply recommendations on the highest-impact tables.

Non-Enforced Constraints — A Confirmed Trap

Redshift supports PRIMARY KEY, UNIQUE, and FOREIGN KEY syntax, but these constraints are metadata-only — Redshift does not enforce them on insert.

What This Means

A COPY into a table with PRIMARY KEY (id) declared does not reject duplicate id values. The constraint exists for the query planner — knowing that id is unique helps the planner avoid de-duplication steps in some plans. But the data integrity contract is the engineer's responsibility, enforced upstream in ETL.

Why This Trap Recurs

Engineers from PostgreSQL or SQL Server backgrounds assume constraint enforcement is built in. The DEA-C01 exam tests this with scenarios like "an ETL job loads 10 percent duplicates and the dashboard counts are wrong, why" — the answer is that Redshift accepted the duplicates because constraints are not enforced. The fix is upstream deduplication in the ETL or downstream aggregation with COUNT(DISTINCT).

Plain-Language Explanation: Redshift Query Tuning, COPY And UNLOAD

Three concrete analogies make Redshift's operational commands feel intuitive.

Analogy 1 — The Postal Sorting Center

Imagine a postal sorting center with 16 sorting stations (slices) running in parallel. A truck arrives with 10 tons of mail (the COPY source). If the mail is delivered in 16 evenly sorted bags, every station works in parallel and the entire truckload is processed in an hour. If the mail arrives as 1 giant pile, only 1 station can sort it while the other 15 watch — same volume, 16x the time. The COPY parallelism rule is exactly this: split the source into a number of files equal to a multiple of the slice count. Manifest files are the bag tags telling the dispatcher exactly which bags to pull off the truck — preventing accidental double-processing when new bags arrive mid-shift. UNLOAD is the inverse: at end of day the sorted center exports outbound mail in 16 bags so the next truck can carry them efficiently. VACUUM is the night-shift cleaning crew that picks up discarded envelopes (deleted rows), repacks half-full bins (resorting unsorted regions), and prepares the floor for tomorrow. Skip the night cleaning for a month and the floor is cluttered, sorting becomes slow, and the daily throughput drops.

Analogy 2 — The Restaurant Prep Kitchen With Shared Stoves

Picture a restaurant where 16 line cooks share a kitchen (the cluster) and every order goes through the same prep workflow. The dining room manager is the WLM — deciding how many cooks should work on appetizers (queue 1), mains (queue 2), and desserts (queue 3) at any moment. Automatic WLM is a smart manager who watches the queue lengths and reallocates cooks dynamically; manual WLM is a written schedule that fixes "always 4 cooks on mains, 4 on appetizers" regardless of actual demand. Concurrency Scaling is a phone line to the catering company next door — when the dinner rush exceeds capacity, the catering company sends temporary cooks to handle overflow orders, billed by the hour. The first hour a day is free (the credit). EXPLAIN is the menu engineer reading the recipes before the dish is cooked, predicting "this dish requires the saute station and that station is bottlenecked" — letting the chef redesign the recipe before service. Distribution skew is one cook receiving every appetizer order while the other cooks idle; the fix is reassigning customers across cooks (changing the distribution key) or running every cook through every dish in rotation (DISTSTYLE EVEN).

Analogy 3 — The Highway Toll System With Lanes And Maintenance

Imagine a multi-lane highway with toll booths. COPY is the entry ramp — vehicles (rows) flow in from a staging lot (S3) and the parallelism is determined by how many entry lanes the staging lot has open. One huge truck (one big file) blocks 15 of 16 lanes. UNLOAD is the exit ramp — vehicles leave the highway in parallel to a staging lot. VACUUM is the road crew that picks up debris from accidents (deleted rows) and repaints lane lines (resorting). Skip the road crew for a month and traffic slows. ANALYZE is the traffic survey team that counts cars at each intersection — without the survey the planner does not know the traffic patterns and might route trucks through residential side streets (a bad query plan). Sort keys are the highway exit signs — well-placed signs let you take the right exit at speed; missing or hidden signs (sort key missing or function applied) force every vehicle to stop and search. Distribution keys are the home addresses of vehicles — when two related vehicles share an address (same distribution key), they end up at the same exit; when they do not, they cross the highway repeatedly (DS_BCAST_INNER, the cross-network shuffle). Concurrency Scaling is rush-hour express lanes that open dynamically when the main highway is congested, billed per hour of express-lane operation.

Common Exam Traps For Redshift Operations

The DEA-C01 exam plants a stable set of traps. Memorize all five.

Trap 1 — One Huge File COPY

A scenario asks "the COPY of a 10 TB single CSV file is slow." Wrong answer: scale the cluster. Right answer: split the source into many smaller files (16, 32, or more depending on slice count).

Trap 2 — VACUUM REINDEX On Compound Sort Key

A candidate runs VACUUM REINDEX on every table. Wrong — REINDEX is needed only for INTERLEAVED sort keys. On COMPOUND sort keys it is wasted work.

Trap 3 — Constraint Enforcement Assumption

A candidate assumes PRIMARY KEY rejects duplicates on COPY. Wrong — Redshift constraints are metadata only. Deduplication is the ETL's responsibility.

Trap 4 — Manual WLM As Default

A candidate defaults to manual WLM with rigid queue allocations. Wrong for most workloads — Automatic WLM adapts and is the recommended baseline.

Trap 5 — UNLOAD With PARALLEL OFF For Speed

A candidate confuses PARALLEL OFF as faster because "no parallelism overhead." Wrong — PARALLEL OFF uses one slice and is much slower for large unloads. PARALLEL ON is the right choice for throughput.

COPY parallelism equals min(file_count, slice_count). Split source data into a number of files that is a multiple of the cluster slice count, ideally 1 MB to 1 GB compressed per file. UNLOAD with PARALLEL ON is the parallel default; PARALLEL OFF is single-slice and much slower. VACUUM SORT for resort, DELETE for reclaim, FULL for both, REINDEX only for INTERLEAVED sort keys. Constraints are metadata only — not enforced on insert. This is the one paragraph to memorize for every operational Redshift question on DEA-C01. If the scenario word is "slow COPY," answer file splitting. If "slow UNLOAD," PARALLEL ON. If "queries slow over time," VACUUM and ANALYZE. If "duplicates loaded," constraints are not enforced. If "burst read load," Concurrency Scaling. The operational commands are the bedrock; sort key and distribution key tuning is the second layer.

Key Numbers And Must-Memorize Redshift Operational Facts

COPY

File count should be a multiple of cluster slice count
File size: 1 MB to 1 GB compressed each
Formats: CSV, JSON, Avro, ORC, Parquet, TEXT
COMPUPDATE and STATUPDATE default ON for empty tables, set OFF for subsequent loads
IAM_ROLE attached to cluster for S3 read

UNLOAD

PARALLEL ON default — fastest for large exports
PARALLEL OFF — single sorted output file, slow
MAXFILESIZE caps per-file size
ALLOWOVERWRITE for non-empty target prefixes
ENCRYPTED for SSE-KMS output

VACUUM

SORT ONLY: resort only
DELETE ONLY: reclaim only
FULL: resort plus reclaim (default)
REINDEX: INTERLEAVED sort keys only
BOOST allocates more cluster resources
RA3 runs automatic VACUUM in background

ANALYZE

Updates planner statistics
Run after large changes (more than 10 percent of rows)
PREDICATE COLUMNS variant analyzes only WHERE/JOIN/GROUP BY columns
RA3 auto-analyzes

WLM

Automatic WLM is recommended default
Manual WLM for guaranteed-resource queues
Query priority HIGHEST to LOWEST
Short Query Acceleration fast-tracks predicted-short queries
Concurrency Scaling absorbs burst read load with one free hour per 24 hours

Constraints

PRIMARY KEY, UNIQUE, FOREIGN KEY are metadata only
NOT enforced on insert
Used by query planner for plan optimization
Data integrity is ETL's responsibility

DEA-C01 exam priority — Redshift Query Tuning, COPY and UNLOAD. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

Definition — Redshift Query Tuning, COPY and UNLOAD. This DEA-C01 topic covers a domain-specific AWS service or pattern. Confirm the canonical definition from official AWS documentation before relying on third-party summaries — service names and feature scoping have shifted over time.

FAQ — Redshift Query Tuning, COPY, And UNLOAD Top Questions

Q1 — What is the single biggest factor in COPY performance from S3?

The number of input files relative to the cluster slice count. Each file is processed by exactly one slice; if the file count is less than the slice count, slices sit idle. The recommendation: split source data into a number of files equal to a multiple of the slice count, with each file in the 1 MB to 1 GB range after compression. For an 8-node ra3.4xlarge cluster (16 slices), split into 16, 32, 48 files. Combine with COMPUPDATE OFF and STATUPDATE OFF for subsequent loads after the initial load has set encodings, and run ANALYZE separately at a controlled time. The DEA-C01 exam tests this directly with "COPY is slow, what changes performance the most" — the answer is file splitting, not cluster scaling.

Q2 — How do I choose between COMPOUND and INTERLEAVED sort keys?

Use COMPOUND for almost all production tables. COMPOUND sorts by the listed columns in order — perfect when queries filter or join on the first one or two columns (the typical pattern for date-partitioned fact tables). Use INTERLEAVED only when queries filter on multiple sort columns with similar selectivity and similar frequency, and when the dataset is large enough that the loading overhead is justified. INTERLEAVED requires periodic VACUUM REINDEX which COMPOUND does not, and INTERLEAVED loads slower because Redshift must compute the interleaved order. In practice, less than 10 percent of production tables benefit from INTERLEAVED. The DEA-C01 exam tests this with sort key choice scenarios — if the workload describes "filtering primarily on date," COMPOUND on date is the answer.

Q3 — When and how should I run VACUUM on RA3 nodes?

RA3 runs automatic VACUUM and ANALYZE in the background, scheduled during low-load periods, which handles most maintenance for typical workloads. Manual VACUUM is still appropriate after very large data changes (large bulk DELETE, large UPDATE, large INSERT into a sorted table), or when the system view indicates significant unsorted region or deleted-row space. For typical workloads on RA3, scheduling a VACUUM FULL during a maintenance window once a week or once a month is reasonable belt-and-suspenders coverage. For tables with INTERLEAVED sort keys, schedule VACUUM REINDEX after large changes — automatic VACUUM does not handle interleaved REINDEX. On non-RA3 (DC2 and earlier), automatic VACUUM does not run and explicit scheduling is required.

Q4 — How do I diagnose a slow query in Redshift?

Start with EXPLAIN on the slow query. Look for: Seq Scans where you expected Index Scans (sort key miss), DS_BCAST_INNER or DS_DIST_BOTH on the data motion line (distribution key mismatch causing network shuffle), nested loop joins on large tables (planner picked wrong join algorithm, often due to stale statistics), and high cost values on operators that should be cheap. Cross-reference with STL_QUERY for actual execution time and SVL_QUERY_REPORT for per-step duration. If statistics look stale, run ANALYZE PREDICATE COLUMNS. If distribution skew is visible in SVV_TABLE_INFO, change the distribution key or move to DISTSTYLE EVEN. If the workload is concurrency-bound, enable Concurrency Scaling. The DEA-C01 exam tests these diagnostic steps with EXPLAIN-output snippets in the question stem.

Q5 — What is the difference between Automatic WLM and Manual WLM, and which should I use?

Automatic WLM dynamically allocates memory and concurrency across queues based on observed workload patterns; Manual WLM uses static memory percentages and concurrency limits per queue. Automatic WLM is the recommended baseline for almost all clusters because it adapts faster than humans can re-tune manual queues, especially for variable workloads. Use Manual WLM only when a specific queue requires guaranteed resources — for example, a critical executive dashboard queue that cannot be starved by ad hoc analyst queries — or when an ML-driven cost allocation requires deterministic per-queue billing. Within either model, enable Short Query Acceleration to fast-track predicted-short queries and Concurrency Scaling to absorb read burst load. The DEA-C01 exam tests this with "the team has unpredictable workload mix" scenarios where Automatic WLM is the right answer.

Q6 — Why does my COPY load duplicate rows even though the table has a PRIMARY KEY?

Because Redshift does not enforce PRIMARY KEY, UNIQUE, or FOREIGN KEY constraints — they are metadata only. The query planner uses constraint metadata to skip de-duplication in some plans (a planner optimization), but inserts and COPYs do not check the constraint. Data integrity is the ETL's responsibility. Common patterns: deduplicate upstream in the ETL with ROW_NUMBER() OVER (PARTITION BY id ORDER BY load_time DESC) = 1 to keep only the latest row, or use a staging-table-plus-merge pattern (COPY into staging, DELETE matched rows from target, INSERT from staging). The DEA-C01 exam plants this with "duplicate rows in dashboard counts" or "primary key is declared but duplicates loaded" scenarios.

Q7 — When should I use Concurrency Scaling?

Enable Concurrency Scaling on workloads where read query volume varies significantly across the day or week — typical patterns include dashboards that spike during business hours, batch reporting that runs nightly, or analyst-driven exploration that bursts unpredictably. Concurrency Scaling automatically provisions additional cluster capacity when the main cluster's queues fill, charges per second when scaling clusters are active, and provides one free hour of scaling per 24 hours of main cluster runtime. Read-only queries are eligible. Configure on the WLM queue level. For DEA-C01 exam scenarios about "users complain of waiting during peak" or "main cluster CPU is fine but queries queue," Concurrency Scaling is the right answer; permanent cluster resize is a distractor that pays for peak capacity 24/7 when the actual peak is a few hours.

COPY — Parallel Bulk Loading From S3

COPY Syntax And Sources

Supported Formats

The Parallelism Rule — Files Equal To Slice Count Multiple

Manifest Files

COMPUPDATE And STATUPDATE

COPY From Other Sources

UNLOAD — Parallel Export To S3

UNLOAD Syntax

PARALLEL ON vs PARALLEL OFF

Format Options

MAXFILESIZE And ALLOWOVERWRITE

Encryption On UNLOAD

VACUUM — Reclaiming Space And Resorting

Why VACUUM Is Needed

The Four Modes

Automatic VACUUM On RA3

Boost Mode And Resource Allocation

ANALYZE — Updating Planner Statistics

When To ANALYZE

ANALYZE Predicate Columns

Statistics Off Threshold

WLM — Workload Management

Automatic WLM

Manual WLM

Query Priority And Short Query Acceleration

Concurrency Scaling

EXPLAIN — Reading The Query Plan

Plan Operators

Distribution Skew

Sort Key Misses

Sort Keys And Distribution Keys

Distribution Styles

Sort Keys

Redshift Advisor

Non-Enforced Constraints — A Confirmed Trap

What This Means

Why This Trap Recurs

Plain-Language Explanation: Redshift Query Tuning, COPY And UNLOAD

Analogy 1 — The Postal Sorting Center

Analogy 2 — The Restaurant Prep Kitchen With Shared Stoves

Analogy 3 — The Highway Toll System With Lanes And Maintenance

Common Exam Traps For Redshift Operations

Trap 1 — One Huge File COPY

Trap 2 — VACUUM REINDEX On Compound Sort Key

Trap 3 — Constraint Enforcement Assumption

Trap 4 — Manual WLM As Default

Trap 5 — UNLOAD With PARALLEL OFF For Speed

Key Numbers And Must-Memorize Redshift Operational Facts

COPY

UNLOAD

VACUUM

ANALYZE

WLM

Constraints

FAQ — Redshift Query Tuning, COPY, And UNLOAD Top Questions

Q1 — What is the single biggest factor in COPY performance from S3?

Q2 — How do I choose between COMPOUND and INTERLEAVED sort keys?

Q3 — When and how should I run VACUUM on RA3 nodes?

Q4 — How do I diagnose a slow query in Redshift?

Q5 — What is the difference between Automatic WLM and Manual WLM, and which should I use?

Q6 — Why does my COPY load duplicate rows even though the table has a PRIMARY KEY?

Q7 — When should I use Concurrency Scaling?

Further Reading — Official AWS Documentation For Redshift Operations

Official sources

More DEA-C01 topics