Amazon Athena is the serverless interactive SQL service that AWS data engineers reach for whenever they need ad hoc analysis directly on S3 data, and on the DEA-C01 exam Athena queries and workgroups are tested in roughly one out of every six Domain 3 questions. The trap is rarely about whether to use Athena — the trap is almost always about cost. Athena bills per terabyte scanned, and a careless engineer who forgets to partition or pick a columnar format can run an Athena query that scans ten terabytes and produces a five-figure monthly bill from a single dashboard. Community study guides from Tutorials Dojo, ExamCert.App, and Camille Chang's dev.to write-up all flag the same pattern: candidates know Athena exists, but they cannot answer the scenario question that asks "Athena cost is exploding, what do you change first" with the right ordered list.
This guide is built to put Athena queries and workgroups into operational muscle memory. It covers what Athena is, the per-TB scan pricing model that drives every architectural decision, the cost trifecta of partitioning plus columnar format plus compression, partition projection versus MSCK REPAIR TABLE, workgroups for query isolation and data scan limits, CTAS for materializing intermediate results, federated queries through Lambda connectors, Athena Notebooks for Spark, and the canonical Athena versus S3 Select versus Redshift Spectrum exam traps that hide in scenario questions.
What Is Amazon Athena?
Amazon Athena is a fully managed, serverless interactive query service that runs SQL against data sitting in S3 (and via federated query, against many other sources). Under the hood Athena uses an open-source distributed SQL engine — Trino for engine version 3 (formerly Presto for engine version 2) — that AWS operates on your behalf. You write standard SQL, AWS allocates compute, runs the query, returns results to an S3 location, and bills you for the bytes scanned. There is no cluster to provision, no nodes to size, no software to patch.
Why Serverless SQL On S3 Matters
Before Athena, querying S3 data meant either spinning up an EMR cluster, loading data into Redshift, or running a Glue job to extract a sample. Each path took at least minutes and many of them required engineers who knew Hadoop or Spark. Athena collapses the entire pattern into a SELECT statement against an S3 prefix that has been registered as a table in the Glue Data Catalog. The serverless model means a single ad hoc query has no minimum cost beyond the bytes scanned, which is what makes Athena the canonical tool for one-off investigations, dashboard backends, and pipeline validation.
Trino Engine Versions
Athena exposes two engine versions. Engine version 2 is based on Presto and is the legacy default for old workgroups; engine version 3 is based on Trino, supports more SQL functions, has better performance on Iceberg tables, and is required for newer features like SQL UDFs in Lambda. New workgroups default to engine version 3. The exam expects you to know the difference exists; specific function-level differences are out of scope.
Athena Pricing — Per-TB Scanned
Every architectural decision around Athena queries and workgroups traces back to one fact: AWS bills you for the data Athena scans, not the data it returns.
The Pricing Formula
Athena charges per terabyte scanned, with the bytes-scanned counter rounded up to the nearest 10 MB per query and a 10 MB minimum per query. A query that returns a single row but scans a 100 GB table costs the same as a query that returns 100 GB. Cancelled queries still bill for what was scanned before cancellation. There is no per-query base fee beyond the 10 MB minimum.
What "Scanned" Means
Athena reads from S3 the file ranges it needs to satisfy the query. With raw CSV, "needed" means the entire file because CSV cannot be range-read. With Parquet, "needed" means only the column files referenced in the SELECT clause, only the row groups whose statistics indicate matching rows, and only the partitions whose Hive prefixes match the WHERE clause. The same logical query against CSV versus Parquet on the same dataset can scan ten times more bytes against CSV — and pay ten times more. This is the core reason DEA-C01 candidates must know columnar formats cold.
Compressed File Sizes
Athena bills on the compressed bytes it reads, not the uncompressed logical size. A 1 TB CSV that compresses to 200 GB with gzip costs 0.2 TB to scan, not 1 TB. Parquet with Snappy compression typically delivers 5x to 10x compression on real-world data, compounding with column pruning and predicate pushdown to deliver the dramatic cost reductions that make Athena economical at scale.
Athena bills per terabyte scanned with a 10 MB per-query minimum, where scanned means compressed bytes that Athena actually reads from S3 to satisfy the query. Engineers who think Athena bills on rows returned, on query duration, or on uncompressed logical size will mis-estimate by an order of magnitude. The DEA-C01 exam plants this as both a cost-calculation question and an architecture-recommendation question: "Athena costs are increasing and the data volume is growing, what action reduces cost the most?" The answer is always the same triad — partition the data, convert to Parquet, and apply Snappy or ZSTD compression. None of these change the data; all three reduce bytes scanned.
The Cost Trifecta — Partitioning, Columnar Format, Compression
Three optimizations multiply together to deliver the Athena cost reductions that make the service economically viable in production.
Partitioning — Hive-Style Layout
Athena understands Hive-style partition layouts where folder prefixes encode partition column values, like s3://bucket/orders/year=2026/month=05/day=02/. When a query has WHERE year=2026 AND month=05 AND day=02, Athena reads only the matching partition prefix and skips every other day, week, and year. A typical date-partitioned dataset can deliver 99 percent scan reduction on a single-day query against a multi-year history.
Columnar Format — Parquet Or ORC
Parquet and ORC store data column by column rather than row by row. A SELECT customer_id, order_total FROM orders against Parquet reads only the customer_id and order_total column chunks — it does not read product_description, shipping_address, or any of the dozens of other columns. Against CSV the same query reads every byte of every row. Parquet plus column pruning typically delivers 5x to 20x scan reduction depending on column count.
Compression — Snappy, ZSTD, GZIP
Within a Parquet file, individual column chunks are compressed. Snappy is the default and is fast to decompress; ZSTD compresses tighter at slightly higher CPU cost; GZIP is the highest compression ratio but slowest to decompress. For Athena, ZSTD is increasingly the recommended choice because Athena's compute is amortized in the per-TB price and ZSTD reduces the bytes-read counter that drives the bill.
Combining All Three
A real-world production transform: a 10 TB CSV dataset partitioned by date, converted to Parquet with Snappy, queried with WHERE date='2026-05-02' AND customer_segment='premium' selecting three columns. Naive scan: 10 TB. Partitioned scan: 30 GB (one day). Plus columnar pruning: 6 GB (three of fifteen columns). Plus Snappy compression: 1.2 GB. From 10 TB to 1.2 GB is roughly 8000x reduction — the cost goes from fifty dollars per query to less than one cent per query.
Partition Projection vs MSCK REPAIR TABLE
Both are mechanisms for telling Athena what partitions exist; they have very different operational characteristics.
MSCK REPAIR TABLE
MSCK REPAIR TABLE is the legacy SQL command that scans the S3 prefix tree, finds Hive-style partition folders, and registers them as partitions in the Glue Data Catalog. It works but it is slow — the scan time grows linearly with partition count, and very-high-cardinality datasets (hourly partitions for several years) can take minutes per repair. It must be re-run every time new partitions arrive. Glue crawlers are the managed alternative but suffer the same scan-and-register pattern.
Partition Projection
Partition projection eliminates the catalog round trip entirely. Instead of asking the catalog "what partitions exist for this table," Athena computes the partition values from a configuration declared on the table: integer ranges, date ranges, enumerated values, or injected values from the WHERE clause. A query against a 10-year hourly-partitioned table with traditional partitions makes thousands of catalog calls; with partition projection it makes zero. Partition projection is configured by table properties like projection.enabled=true, projection.year.type=integer, projection.year.range=2020,2030.
When To Choose Which
Partition projection is the right answer for any table where partition values follow a predictable pattern (date ranges, hourly buckets, region codes, tenant IDs from a known set). MSCK REPAIR TABLE or Glue crawlers are necessary when partition values are unpredictable or arrive sporadically. The exam plants this as a performance trap: "the team runs MSCK REPAIR TABLE hourly and queries are slow," answer is partition projection.
MSCK REPAIR TABLE is correct but expensive at scale, and partition projection makes it unnecessary for predictable partition layouts. Engineers who default to MSCK REPAIR TABLE on every query or on a schedule are paying a hidden cost — both the SQL command itself takes time, and the catalog grows with thousands of partition entries that slow every subsequent query plan. The DEA-C01 exam plants this with phrases like "the table has years of hourly partitions and queries are slow to plan." If the answer choices include partition projection, that is correct; choices that suggest scaling up the catalog or running MSCK REPAIR TABLE more often are distractors. Convert any production table with predictable partition values to partition projection on day one.
Athena Workgroups — Query Isolation And Cost Control
Workgroups are the multi-tenant control plane for Athena. Every query runs inside a workgroup, and workgroups are how data engineering teams enforce cost limits, isolate billing, and apply governance.
Per-Query And Per-Workgroup Data Limits
A workgroup can declare a per-query data scan limit (cancels any single query that would scan more than the limit) and a per-workgroup data scan limit per time window (cancels new queries when the workgroup has scanned more than X TB this hour or this day). These limits are the primary brake against runaway costs from poorly written queries or analyst sandboxes.
Result Location And Encryption
Each workgroup specifies an S3 location where query results are written. Workgroups can enforce that this location is the only allowed result location (preventing analysts from leaking results to public buckets) and can require server-side encryption of results with SSE-S3, SSE-KMS, or CSE-KMS. The encryption setting at the workgroup level overrides any client-side setting, which is the right enforcement model for compliance.
IAM-Based Access
A workgroup is an IAM resource. Policies grant athena:StartQueryExecution on a specific workgroup ARN, letting you assign the marketing team to its own workgroup, the data engineering team to another, and the finance team to a third — with separate cost tracking, separate scan limits, and separate result locations.
Query Engine Version Pinning
A workgroup pins its engine version. If your dashboards depend on engine version 2 query semantics, set the dashboard workgroup to engine version 2 and let the engineering workgroup use engine version 3 for new development.
Always create separate Athena workgroups per team or per use case in production — never let everyone share the default workgroup. Workgroups are how you scope cost limits, billing tags, result encryption, and IAM access in Athena. The default workgroup has no scan limits and no enforced encryption, which is the wrong baseline for any production environment. The recommended pattern: a data-engineering workgroup with high limits and Parquet result format, a bi-dashboards workgroup with moderate limits and engine version pinning for stability, and an analysts-sandbox workgroup with a tight per-query limit (say, 10 GB) to stop accidental table-scans from junior analysts. Tag each workgroup for cost allocation. The DEA-C01 exam tests this with scenarios about runaway costs, isolating compliance teams, or separating production from development.
CTAS — CREATE TABLE AS SELECT
CTAS is Athena's mechanism for materializing query results as a new table.
What CTAS Does
CREATE TABLE new_table WITH (format='PARQUET', external_location='s3://bucket/path/') AS SELECT ... runs the SELECT, writes the results to S3 in Parquet (or ORC, JSON, TEXTFILE, AVRO) at the specified location, and registers the new table in the Glue Data Catalog. Subsequent queries against new_table read the materialized data instead of recomputing from sources.
CTAS For Pipeline Stages
The canonical pipeline pattern: raw CSV ingestion to S3, then a CTAS that converts CSV to partitioned Parquet, registering the curated table for downstream queries. Every analyst query hits Parquet, not CSV — the conversion is paid once at CTAS time and amortized over thousands of subsequent queries.
CTAS With Partitioning And Bucketing
CTAS supports partitioned_by to write Hive-style partitions on the output and bucketed_by to bucket files within partitions for join optimization. A CTAS that writes partitioned Parquet from a single-file CSV is the simplest possible Athena ETL pipeline.
Limits And Caveats
CTAS writes up to 100 partitions per query by default; the limit is configurable but bounded. INSERT INTO can append additional partitions beyond CTAS, and partition projection on the destination table is recommended once partition counts grow.
Federated Queries — Athena Beyond S3
Athena's federated query feature uses Lambda connectors to read non-S3 sources within the same SQL.
How It Works
You install a connector — a Lambda function published in the AWS Serverless Application Repository — that knows how to read DynamoDB, RDS PostgreSQL, RDS MySQL, Aurora, Redshift, ElastiCache, DocumentDB, OpenSearch, or any other source AWS or community has authored. Athena registers the connector as a "data source" in the catalog. Queries can then JOIN an S3 table to a DynamoDB table to a Postgres table in one SELECT.
When To Use Federated Queries
Federated queries are correct when you need a one-off cross-source join and ETLing the data into S3 first is overkill. They are wrong when the same join runs repeatedly — the per-query connector latency and Lambda cost compound, and the right answer becomes a Glue or DMS job that lands the source into S3 on a schedule.
Connector Pricing
Federated queries cost per-TB-scanned plus the Lambda invocation cost of the connector. For very small auxiliary tables (a Postgres lookup table of fifty country codes), the cost is trivial; for large source tables the Lambda execution time and memory dominate the cost.
Athena Notebooks — Spark For Interactive Analysis
Athena Notebooks is the Apache Spark interactive notebook environment built into the Athena console.
What It Is
A Jupyter-style notebook backed by Athena-managed Spark workers. You write PySpark, Spark SQL, or Spark Scala in cells, and the notebook executes against a Spark session that reads from the Glue Data Catalog and S3. The session auto-scales workers based on workload.
When To Use Notebooks Over SQL
SQL is sufficient for 90 percent of Athena workloads. Notebooks are correct when the analysis requires Python libraries (pandas, NumPy, scikit-learn for descriptive stats), iterative DataFrame operations, or large-table joins that benefit from Spark's distributed shuffle. Notebook sessions are billed per DPU-hour, similar to Glue Spark jobs.
Athena Notebooks vs Glue Notebooks vs SageMaker Studio
Athena Notebooks are tightly coupled to the Glue Data Catalog and are the path of least resistance for an analyst already in the Athena console. Glue Notebooks are part of the Glue ETL development experience and are the right answer when the notebook prototype will be saved as a Glue ETL job. SageMaker Studio Notebooks are the right answer when ML workflow steps are part of the analysis. The exam asks about the first two; the third is more an MLA-C01 topic.
Athena vs S3 Select vs Redshift Spectrum
Three services can query S3 data with SQL semantics. Knowing when each applies is the highest-yield decision tree on DEA-C01.
Athena
Best for: ad hoc SQL across one or many S3 objects, joins across tables, complex aggregations, federated queries to non-S3 sources. Pay model: per-TB scanned.
S3 Select
Best for: filtering rows or columns from a single S3 object before retrieval. Use when an application needs to read part of a CSV or Parquet file and wants to push the filter down to S3 to save bandwidth. Cannot join across objects, cannot aggregate, cannot UNION. Pay model: per-GB returned plus per-GB scanned, smaller per-byte cost than Athena for the single-object filter case.
Redshift Spectrum
Best for: querying S3 data from a Redshift cluster, especially when the same query needs to join S3 data with cluster-resident tables. Useful for data lake plus data warehouse hybrid patterns. Pay model: per-TB scanned (similar to Athena) plus the underlying cluster cost.
Decision Rule
If the workload is a single-object filter from inside an application, S3 Select. If the workload is interactive ad hoc SQL or a dashboard backed by S3-only data, Athena. If the workload is a Redshift-based analytics platform that needs occasional joins to data lake S3, Redshift Spectrum. The exam plants confusion by describing scenarios where two services could work and only one is best — read carefully for "from within an application" (S3 Select), "ad hoc analyst" (Athena), "from Redshift" (Spectrum).
Plain-Language Explanation: Amazon Athena Queries And Workgroups
Three concrete analogies make Athena's cost model and workgroup discipline intuitive.
Analogy 1 — The Restaurant That Charges By Walk-Through
Imagine a warehouse-sized buffet restaurant where the bill is not based on what you eat but on how much of the buffet floor your serving spoon traverses. If the chicken curry station is in aisle 3 and you walk straight there, grab a serving, and leave, you pay a small fee. If you wander through every aisle scanning what is available, you pay for the whole walk even if you eat the same single serving. Athena is this restaurant — the bill is the "walk-through" (bytes scanned from S3), not the food on the plate (rows returned). Partitioning is putting up signs that route you straight to aisle 3 without traversing the others. Columnar format is breaking the buffet into single-dish stations so picking chicken does not require walking past the desserts. Compression is shrinking the aisles. The trifecta of partitioning plus Parquet plus compression is the maitre d' walking you straight to your dish in five steps instead of letting you wander for an hour. Workgroups are the ticket booths at the door that say "the engineering team can spend up to 10 TB of walking per day, the analyst team up to 100 GB" — preventing one diner from wandering the entire warehouse and bankrupting the restaurant.
Analogy 2 — The Library With An Index Card Catalog
Picture an old-fashioned library with millions of books arranged in stacks. Without an index card catalog, finding "all books published in 2025 about machine learning" means walking every aisle and reading every spine — slow, exhausting, expensive in librarian time. The card catalog (the Glue Data Catalog) tells the librarian which aisle and shelf each book is on, but the catalog itself takes time to maintain — every new book needs a card filed (MSCK REPAIR TABLE re-scans the stacks to file new cards, slow at scale). Partition projection is replacing the card catalog with a labeling rule: "every book is shelved by year then by topic, no card needed, the rule tells you the location." Athena with partition projection on a date-partitioned table goes straight to year=2025 month=05 without consulting any catalog. Workgroups are the library memberships — the casual-reader card lets you check out 10 books a month, the researcher card 1000, the school-group card has a strict cap so 30 students cannot accidentally drain the library budget on a single field trip.
Analogy 3 — The Toll Highway With An EZ-Pass And Express Lanes
Athena is a toll highway and bytes scanned is the distance driven. CSV is the slow rural backroad — every mile is a mile of toll because the car has to stop at every house to check the address. Parquet is the express toll lane — at every exit there is a sign telling you whether the answer is in this section, and you can skip whole stretches you do not need. Partitioning is multiple highways, one per state, so a query scoped to California does not drive through Texas. Compression is shrinking the road itself so the same destination is half the toll distance. Workgroups are the EZ-Pass account tiers — the personal account can rack up 100 dollars in tolls before triggering an alert, the corporate fleet account 10000, and the rental car account zero (cancel at first toll attempt). CTAS is the company that builds a private express road from your warehouse to your most-frequent destination, paying the construction cost once and never paying tolls again on that route. Federated queries are the side roads — useful for an occasional detour into Postgres or DynamoDB, expensive if you take them every day. The DEA-C01 exam asks you to read a scenario, identify which lane the workload should be on, and pick the answer that minimizes the toll.
Common Exam Traps For Athena Queries And Workgroups
The DEA-C01 exam plants a stable set of traps. Memorize all five.
Trap 1 — Athena For Real-Time
A scenario asks for sub-second query response on streaming data. Wrong answer: Athena. Right answer: depending on the rest of the scenario, Kinesis Data Analytics for Apache Flink, OpenSearch, or a real-time database. Athena queries take seconds at minimum and scale to many seconds for large scans.
Trap 2 — Athena Charging On Result Size
A candidate calculates Athena cost from SELECT COUNT(*) returning a single row and assumes near-zero cost. Wrong. The query still scans every row of every file in the table (unless predicate-pushdown applies). The scan, not the result, is the bill.
Trap 3 — MSCK REPAIR TABLE As The Only Partition Mechanism
A scenario describes high partition count and slow query planning. Candidates trained only on legacy patterns suggest running MSCK REPAIR TABLE more often. Wrong — partition projection is the right answer.
Trap 4 — Federated Query For High-Volume Joins
A scenario has Athena joining S3 to a 1 TB Postgres table on every dashboard refresh. Federated query technically works but the per-query Lambda cost and latency are wrong. Right answer: Glue or DMS into S3, then Athena queries S3-resident copies.
Trap 5 — Default Workgroup In Production
A scenario describes runaway costs and asks for the next step. Wrong answer: educate analysts to write better queries. Right answer: create per-team workgroups with per-query and per-workgroup data scan limits, plus tag-based cost allocation.
Use CTAS to convert CSV ingestion zones to partitioned Parquet curated zones once, and let every downstream Athena query benefit forever. The pattern: an ingestion pipeline lands raw CSV in s3://bucket/raw/dataset/, a daily Glue or Athena CTAS job converts the previous day's CSV to partitioned Snappy-Parquet at s3://bucket/curated/dataset/year=YYYY/month=MM/day=DD/, and analyst queries hit only the curated zone. The CTAS run costs Athena scan dollars once per day; analyst queries pay 5x to 20x less per query forever. The exam asks "Athena queries are slow and costly on raw CSV, what is the most cost-effective long-term fix" — answer is CTAS-driven Parquet conversion plus partitioning, not query tuning, not workgroup limits, not engine version upgrades.
Key Numbers And Must-Memorize Athena Facts
Pricing
- Per terabyte scanned, 10 MB minimum per query
- Bytes-scanned counter rounds up to 10 MB
- Cancelled queries still bill for scanned bytes
- Compressed bytes count, not uncompressed logical size
Cost Trifecta
- Partitioning (Hive-style year/month/day prefixes)
- Columnar format (Parquet preferred for Athena)
- Compression (Snappy default, ZSTD often better, GZIP highest ratio)
Partition Mechanisms
- MSCK REPAIR TABLE — legacy, scans S3, slow at scale
- Glue crawlers — managed scanning, similar pattern
- Partition projection — zero catalog calls, recommended for predictable layouts
Workgroups
- Per-query data scan limit (cancels single query)
- Per-workgroup data scan limit per time window
- Result location and encryption enforcement
- Engine version pinning per workgroup
- IAM resource for fine-grained access
CTAS
- Up to 100 partitions per query default (configurable)
- Outputs Parquet, ORC, JSON, TEXTFILE, AVRO
- Registers new table in Glue Data Catalog
- INSERT INTO appends partitions beyond CTAS
Federated Query
- Lambda connectors from Serverless Application Repository
- DynamoDB, RDS, Aurora, Redshift, ElastiCache, DocumentDB, OpenSearch
- Per-TB plus Lambda cost
- Best for one-off cross-source joins, not high-volume repetition
DEA-C01 exam priority — Amazon Athena Queries and Workgroups. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.
Key facts to memorize. Pin the service-level limits, default behaviours, and SLA promises related to this topic. The exam often tests recall of specific defaults (durations, limits, retry behaviour, free-tier boundaries) where memorising the exact number is the difference between right and 'almost right'.
FAQ — Amazon Athena Queries And Workgroups Top Questions
Q1 — How do I reduce Athena query cost the fastest?
The fastest impact comes from converting CSV or JSON data to partitioned Parquet with Snappy compression and using WHERE clauses that match the partition columns. The conversion is done once via a CTAS or Glue ETL job; from that point every query reads only the relevant partitions and only the requested columns at compressed byte sizes. Typical reductions are 10x to 1000x depending on the workload selectivity. Workgroup data scan limits are a safety net, not an optimization — they cancel runaway queries but do not make legitimate queries cheaper. The wrong answer is "tune the SQL" — Athena's planner cannot help if the underlying storage forces a full scan.
Q2 — When should I use partition projection versus MSCK REPAIR TABLE or Glue crawlers?
Use partition projection whenever partition values follow a predictable pattern — date ranges, hourly buckets, region codes from a known set, tenant IDs from an enum. Partition projection eliminates catalog round-trips entirely, scales to millions of logical partitions without slowing the query planner, and removes the operational burden of running MSCK REPAIR TABLE or scheduling crawlers. Use MSCK REPAIR TABLE or Glue crawlers when partition values are unpredictable, when the data is delivered by an external system that cannot promise partition naming conventions, or when the table is small enough that operational simplicity outweighs the catalog cost. For DEA-C01 exam scenarios involving "high partition count" or "slow query planning," partition projection is the right answer.
Q3 — How do I prevent analysts from running queries that scan terabytes accidentally?
Create a dedicated Athena workgroup for each analyst team and configure a per-query data scan limit at a sensible threshold — 10 GB for sandboxes, 100 GB for analyst teams, higher for engineering workgroups. Athena cancels any single query that would scan more than the limit, returning an error to the analyst before the bill explodes. Combine with a per-workgroup time-windowed limit (for example, 1 TB per day per workgroup) as a second safety net for users running many small queries that aggregate to a large bill. Tag each workgroup for cost allocation so finance can see per-team spend in Cost Explorer. Force-disable the default workgroup in production accounts.
Q4 — When should I use Athena versus Redshift Spectrum versus S3 Select?
S3 Select is for an application reading part of a single S3 object — pushing the filter down to S3 to save retrieval bandwidth. It cannot join, cannot aggregate, cannot span multiple objects. Athena is for ad hoc SQL across one or many S3 objects with full SQL semantics including joins, aggregations, window functions, and federated queries. Redshift Spectrum is for a Redshift-based analytics platform that needs to join S3 data lake tables to Redshift cluster tables in the same query — Spectrum runs as part of a Redshift query, not standalone. Decision rule: application reading one object => S3 Select; ad hoc analyst SQL on data lake => Athena; existing Redshift cluster joining to S3 data => Redshift Spectrum.
Q5 — How does CTAS fit into a data lake architecture?
CTAS is the most common pattern for converting raw ingestion zones to curated query zones. The pipeline: ingestion lands CSV or JSON in s3://bucket/raw/, a scheduled CTAS job runs CREATE TABLE curated_dataset WITH (format='PARQUET', partitioned_by=ARRAY['year','month','day']) AS SELECT ... FROM raw_dataset WHERE ..., the curated table is registered in the Glue Data Catalog, and dashboards plus analysts query only the curated table. The CTAS run is a one-time scan cost paid by the pipeline; downstream queries pay the lower Parquet cost forever. For very large datasets that exceed the 100-partition CTAS limit, use INSERT INTO to append additional partitions in subsequent queries.
Q6 — Can Athena query data in another AWS account?
Yes, through cross-account Glue Data Catalog access. The data-owning account registers the S3 prefix in its Glue Data Catalog and grants the consuming account permission via Lake Formation cross-account grants or Glue Data Catalog resource policies. The consuming account creates a resource link pointing at the source database, and Athena queries the resource link as if it were a local table. The S3 bucket policy must also grant the consuming account read access to the underlying S3 objects, and KMS key policies must allow decryption if SSE-KMS is in use. The DEA-C01 exam plants this as a "shared data lake across business units" scenario; the answer is Lake Formation cross-account grants plus S3 bucket policy plus KMS key policy.
Q7 — How do Athena Notebooks relate to Glue and SageMaker?
Athena Notebooks are Apache Spark notebooks managed by Athena, billed per DPU-hour, and tightly integrated with the Glue Data Catalog. They are the path of least resistance for an analyst who is already in the Athena console and needs Python plus Spark. Glue Notebooks are part of the Glue ETL development experience, run on the same Spark backend, and are designed for prototyping ETL code that will become a Glue Job. SageMaker Studio Notebooks are general-purpose ML and data science notebooks that can read from Glue Catalog but are oriented around ML workflows. For DEA-C01, focus on Athena Notebooks for ad hoc Spark in the analyst context and Glue Notebooks for ETL development; SageMaker is largely out of scope per the exam guide.
Further Reading — Official AWS Documentation For Athena
The authoritative AWS sources are the Athena User Guide (concepts, partition projection, workgroups, CTAS), the Athena Federated Query documentation (connectors, data sources, IAM), the Athena Notebooks documentation (Spark sessions, notebook lifecycle), and the Glue Data Catalog documentation for catalog management. The AWS Big Data Blog has multiple deep-dive posts on partition projection, CTAS pipeline patterns, and workgroup cost-control case studies. The AWS Well-Architected Data Analytics Lens covers Athena in the analytics phase. The AWS Samples GitHub repository contains end-to-end sample architectures for Athena-based data lake patterns. Finally, the Skill Builder DEA-C01 Exam Prep Standard Course has a dedicated module on Athena that walks through the canonical exam traps in scenario form.