S3 Partitioning and File Formats Parquet ORC Avro - DEA-C01 Data Engineer Study Notes

Q: Q1 — How do I choose between Parquet and ORC for a new analytics table?

Choose Parquet for almost all new AWS workloads. Parquet has broader integration across AWS services (Athena, Glue, EMR, Redshift Spectrum, Quicksight all default to Parquet), better tooling, and the same column-pruning and predicate-pushdown benefits as ORC. Choose ORC only when the workload is specifically Hive on EMR with heavy use of Hive ACID transactions and when migration cost matters. The exam treats Parquet as the default; ORC is a niche alternative for legacy Hive shops.

Q: Q2 — When should I use Avro instead of Parquet?

Use Avro at the streaming ingestion boundary . Kafka producers and Kinesis producers write Avro records with a schema registered in the Glue Schema Registry, the registry enforces schema compatibility, and Avro's row-oriented append-friendly format suits one-record-at-a-time writes. Once the data lands in S3 and crosses into the analytics layer, convert it to Parquet through a Glue, EMR, or Lambda job — Athena queries against Avro on S3 cost ten times more than against Parquet on the same data because Avro is row-oriented and Athena cannot do column pruning.

Q: Q3 — What partition strategy should I use for a daily-arriving event stream?

Partition by date components: year=YYYY/month=MM/day=DD/ . Optionally add hour for very high-volume streams (over one terabyte per day). Avoid partitioning by user ID, transaction ID, or other high-cardinality columns — those should be sort keys within files, not partition keys. For predictable date partitions, configure Athena partition projection so MSCK REPAIR TABLE is never needed and the Glue Data Catalog is not consulted for partition metadata at query time. Target partition size in the gigabyte range — if your stream is small, partition by month or week instead of day.

Q: Q4 — How do I migrate a plain Parquet table to Apache Iceberg?

Two paths. In-place migration : use Athena or Spark to register the existing Parquet files as the initial Iceberg snapshot via ALTER TABLE ... CONVERT TO ICEBERG (or the equivalent Spark procedure) — fast but requires the Parquet layout to be Iceberg-compatible. Rewrite migration : create a new Iceberg table and copy data via INSERT INTO new_table SELECT * FROM old_table — slower but cleaner, and lets you change the partition spec or schema at the same time. After migration, downstream queries continue to work because Iceberg tables expose the same SQL interface as Hive tables.

Q: Q5 — How do I prevent the small-file problem in a streaming pipeline?

Three layers. First, configure buffering at the writer — Kinesis Firehose buffers up to 128 MB or 900 seconds before writing to S3, producing larger files. Second, run a compaction job periodically — Iceberg's RewriteDataFiles procedure or a custom Glue job reads many small files in a partition and rewrites them as one or a few larger files. Third, use Glue groupFiles="inPartition" when reading staging data with many small files, so the Glue job groups inputs in memory before processing. The exam pattern is "queries are slow on a streaming-derived table" — the answer is compaction.

Q: Q6 — When should I use partition projection versus a Glue crawler?

Use partition projection when partition values are predictable from a rule — time-based partitions, integer ranges, enumerated values. Projection costs nothing per query, never needs MSCK REPAIR TABLE, and serves partition values from the table definition itself. Use a Glue crawler when partition values are unpredictable or when the table schema itself can change — the crawler scans S3, infers schema, and updates the Data Catalog. For typical analytics tables with date partitions, projection is the right answer; for less structured exploratory data, a crawler is appropriate.

S3 partitioning and file format choice are the two highest-leverage decisions a data engineer makes in an AWS data lake, and on the DEA-C01 exam they show up across Domains 2, 3, and 4 in scenarios that hinge on Athena query cost, Glue ETL throughput, Redshift Spectrum scan size, and schema evolution under streaming ingestion. Community study guides from Tutorials Dojo, ExamCert.App, and AWS in Plain English all flag the same pain point: candidates over-partition small datasets and create the partition-explosion anti-pattern, pick CSV because they recognize it and pay ten times more than Parquet on every Athena query, and run MSCK REPAIR TABLE on a thousand-partition table when partition projection would have been free. The wrong choice on the exam is the wrong choice in production: pick a high-cardinality partition key like user-id and a single Athena query scans millions of partition metadata entries before reading any data; pick row-oriented Avro for a billion-row analytics fact table and Athena reads every column even when the query selects two.

This guide is built for the data engineer perspective. It covers what partitioning is and why it matters, the Hive-style partition layout that powers Athena and Glue, partition cardinality trade-offs, the three columnar formats Parquet and ORC and the row-based Avro, when each format is preferred, the Apache Iceberg open table format with ACID and time travel, Apache Hudi and Delta Lake on EMR, partition projection as the alternative to MSCK REPAIR TABLE, the small-file problem and compaction strategies, and the canonical exam traps. By the end the partition-and-format decision should feel as natural as choosing the index on a relational database table.

Why Partitioning Matters For S3 Data Lakes

Partitioning is the practice of organizing object storage so that query engines can skip data they do not need to read. Without partitioning, Athena scans every byte of every file in the table prefix to answer a query, even when the query has a tight predicate like "where year equals 2024 and month equals 04." With partitioning, Athena reads only the relevant prefix and skips the rest entirely.

Query Cost And Performance

Athena charges five dollars per terabyte scanned. A naive table with five terabytes of data costs twenty-five dollars to query once with no partitioning, even if the query needs only one day's worth of data. With daily partitions, the same query reads one out of three hundred sixty-five partitions and costs about seven cents. The factor of three hundred sixty-five savings is what makes partitioning the single highest-impact optimization in a data lake.

Data Locality And Parallelism

Beyond cost, partitions let query engines parallelize. EMR Spark with one hundred executors can read one hundred partitions simultaneously, each executor pulling files from its assigned prefix. Partitions are also the natural unit of incremental processing — Glue job bookmarks and Iceberg snapshot semantics both operate at the partition or file granularity.

Lifecycle Management

Partitions provide natural boundaries for lifecycle and retention. A daily-partitioned table can have a Glue job that deletes partitions older than ninety days, an S3 lifecycle policy that transitions specific date prefixes to Glacier, or an Iceberg expire-snapshots operation that removes obsolete versions. Without partitions, all of these operations would require object-by-object scanning and tagging.

Plain-Language Explanation: S3 Partitioning And File Formats

The partition-and-format decision tree resists naming-only intuition. Three concrete analogies make the trade-offs stick.

Analogy 1 — The Library Card Catalog And Stack Layout

Picture a research library with five hundred thousand books. The librarian designed the building so books are shelved by subject, then by author, then by year. A patron asking "any book on cloud computing published in 2024" walks straight to the Computer Science aisle, finds the Cloud Computing sub-section, and pulls the 2024 shelf — the library skipped four hundred ninety-nine thousand books that were not on cloud computing or not from 2024. That shelving is partitioning — physical organization that lets the search skip irrelevant material.

Now picture the books themselves. Each book has a table of contents at the front and an index at the back. A reader looking up "EC2 instance types" jumps to the index, finds the page references, and reads only those pages — they did not read every word of the book. That index is column-oriented storage — physical organization within a file that lets a query read only the columns it needs. Parquet and ORC are libraries where every book has an index and a chapter-by-chapter structure; CSV and JSON are books with no index where the reader must turn every page in order.

Analogy 2 — The Warehouse Aisle System

Picture an Amazon fulfillment center. Items are stored by category, then by SKU, then by lot date. When an order comes in for "blue running shoes size 10 from lot dated April 2024," the picker walks to Footwear, then to Athletic, then to the April-2024 row. Aisle signs (partition keys) tell the picker which aisles to skip. If items were instead stored in arrival order with no aisle structure, the picker would walk every aisle for every order — exactly what Athena does on an unpartitioned table.

But there is a trade-off. If the warehouse partitions by SKU plus lot date plus serial number, every individual unit gets its own bin — billions of bins, half empty most of the time, with the picker spending more time reading bin labels than picking items. That is partition explosion — too many partitions of too-low cardinality. The right partitioning matches expected query patterns: partition on the columns queries filter by, at a granularity that produces partitions of meaningful size (gigabytes to tens of gigabytes per partition is the rule of thumb).

Analogy 3 — The Newspaper Archive

Picture a newspaper archive on microfilm. Each year is its own reel; within a reel, articles are organized by date then section then page. A historian looking for "all sports articles from 2003" loads the 2003 reel and scrolls to the sports section — they did not load 2002 or 2004, and within 2003 they did not scroll the news section. The reel-per-year structure is the partition layout; the section-on-reel structure is the column-oriented layout within each file.

Now imagine the newspaper publishes a correction. With Parquet immutable files, the correction means rewriting the entire affected file — clumsy. With Apache Iceberg, the correction is a transactional update that writes a new file plus metadata pointing the table snapshot at the new version, while old readers can still read the prior snapshot via time travel. Iceberg adds the equivalent of a "what did the archive look like on April 30, 2026" query that microfilm cannot do without keeping multiple complete copies. That is the leap from Parquet-on-S3 to managed table formats.

Hive-Style Partition Layout

The standard partition layout in AWS is the Hive convention.

The Year-Month-Day Pattern

Hive-style partitions use key=value prefix segments. A daily-partitioned events table looks like s3://bucket/events/year=2024/month=04/day=15/file.parquet. Athena, Glue Data Catalog, and Spark all recognize this convention automatically — the key parts of the prefix become partition columns in the table schema, and queries with WHERE year=2024 AND month=04 push the predicate down to the prefix scan and skip every other partition.

Partition Key Selection

The right partition keys are the columns most queries filter by — typically date or timestamp components for time-series workloads. Country, region, customer ID, or product category can also be partition keys, but only if the cardinality is bounded and queries actually filter on them. The wrong partition keys are high-cardinality columns like user ID, transaction ID, or any column with more than ten thousand distinct values — partitioning on those creates partition explosion.

Partition Cardinality Trade-Offs

The rule of thumb is: aim for partitions in the gigabyte-to-tens-of-gigabytes range. Smaller partitions increase metadata overhead and small-file problems; larger partitions reduce parallelism and prevent Athena from skipping irrelevant data. For a typical event stream of one terabyte per day, daily partitions of one terabyte each work; for a stream of one gigabyte per day, weekly or monthly partitions are better. The exam plants over-partitioning scenarios — "partition by user ID" is almost always wrong.

Multi-Level Partitions

Year, month, day is the most common multi-level layout. Adding hour as a fourth level works for high-volume streams (trillions of events) but introduces the small-file problem if individual hours produce tiny files. Mixing date with another dimension (year/month/day/region) is fine if region cardinality is bounded (under fifty values).

A partition is a subdirectory of an S3 prefix whose name encodes a column value, allowing query engines to read only the partitions matching the query's filter predicates. The Hive convention key=value/key=value/... is the standard layout AWS analytics services recognize automatically. Partitions are not data — they are pruning instructions for the query planner. Partitioning a table by columns that queries do not filter on adds zero value and adds metadata overhead. Partitioning a table by columns of very high cardinality (thousands or millions of distinct values) creates partition explosion that slows the query planner more than it speeds the data scan. The DEA-C01 exam tests this with scenarios where a candidate is asked to design a partitioning strategy for a workload — the right answer balances expected query patterns with expected partition size.

Parquet — The Default Columnar Format

Parquet is the default file format for AWS data lake workloads.

Why Parquet Is Preferred

Parquet stores data column by column instead of row by row. A query that selects two of fifty columns reads only the bytes for those two columns — the other forty-eight are physically skipped on disk. Combined with predicate pushdown (the file's row-group statistics let the reader skip row groups whose value range cannot match the predicate) and column-level compression (each column is compressed with the codec best suited to its type), Parquet delivers ten-times-or-more cost reduction on Athena versus row-oriented formats like CSV or JSON.

Parquet Internal Structure

A Parquet file is divided into row groups (typically one hundred twenty-eight megabytes each). Within each row group, data is laid out one column at a time, with statistics (min, max, null count) per column per row group. The file footer holds metadata about row groups and columns. Athena reads the footer first, decides which row groups can be skipped using statistics versus predicates, then reads only the column chunks needed for the query.

Compression Codecs

Snappy is the default — fast, moderate compression ratio. GZIP and ZSTD provide higher compression ratios at higher CPU cost. ZSTD has emerged as the best general-purpose choice for new pipelines (better compression than GZIP, faster decompression than GZIP). Choose codec based on storage cost versus query CPU cost — typically Snappy is fine, ZSTD if storage is the bottleneck.

Parquet With Glue, Athena, EMR, Redshift Spectrum

Glue ETL writes Parquet by default when you use the glueparquet format. Athena queries Parquet natively with column pruning and predicate pushdown. EMR Spark reads Parquet with the Spark SQL Parquet reader. Redshift Spectrum scans Parquet from S3 in parallel from compute nodes. Every AWS analytics service is optimized for Parquet first.

ORC — The Hive Optimization Format

ORC (Optimized Row Columnar) is the second columnar format AWS supports.

When ORC Is Preferred Over Parquet

ORC was designed for Hive specifically and remains slightly more efficient on Hive workloads — better compression, more compact statistics, ACID transaction support natively. EMR Hive jobs writing to ORC see modest gains over Parquet. Newer Hive versions on EMR run well on Parquet too, so the gap has narrowed.

ORC Internal Structure

ORC files have stripes (the equivalent of Parquet row groups), each stripe has indexes for column statistics, and the file footer holds metadata. The structure is conceptually similar to Parquet but optimized for Hive's specific access patterns.

ORC With AWS Services

Glue ETL supports ORC as input and output. Athena reads ORC natively. Redshift Spectrum supports ORC. The exam treats Parquet as the default and ORC as a Hive-specific alternative — pick Parquet unless the workload is specifically Hive-driven.

Avro — The Row-Based Streaming Format

Avro is fundamentally different from Parquet and ORC.

Why Avro Is Row-Based

Avro stores data row by row, with a schema embedded in the file header. Reading any column requires reading the entire row. Writing a row requires no random access — append the row to the end of the file and you are done. This makes Avro ideal for streaming writes where rows arrive one at a time and the writer cannot batch them into columnar layout efficiently.

When Avro Is Preferred

Avro is the canonical format for streaming ingestion with Kafka and Kinesis. Producers write rows as they arrive; consumers read schema from the file header and parse rows. Avro pairs naturally with the Glue Schema Registry which enforces schema compatibility for streaming pipelines — every message in a Kafka topic conforms to a registered Avro schema, with backward, forward, or full compatibility rules preventing breaking changes.

Schema Evolution

Avro's biggest strength: schema evolution. Adding a new field with a default value is backward compatible — old readers can read new data because the new field has a default. Removing an optional field is forward compatible — new readers can read old data. Full compatibility requires both. The Schema Registry enforces these rules at message publish time, preventing producers from breaking consumers.

Avro Versus Parquet — When To Convert

A common pipeline pattern: streaming events arrive in Avro to Kinesis or Kafka, are buffered to S3, and then a Glue or Spark job converts them to Parquet for analytics. Avro is the writer-friendly format; Parquet is the reader-friendly format. The conversion is usually scheduled or triggered when a partition closes (e.g., end of day).

Storing analytics tables in CSV, JSON, or Avro instead of Parquet or ORC inflates Athena scan costs by ten times or more — Athena charges per byte scanned, and row-oriented formats force reading every column even when the query selects two. The DEA-C01 exam tests this with scenarios describing an analytics dashboard that "is too expensive on Athena" and asks for the lowest-cost optimization. The wrong answers add Athena workgroup limits or query result caching; the right answer is converting the underlying data to Parquet (typically through a Glue or EMR job that runs once per partition close). Combine Parquet conversion with partitioning and the cost reduction is two to three orders of magnitude. Engineers familiar with relational databases sometimes resist this because CSV is "easier to inspect" — that is fine for staging but wrong for production analytics.

Apache Iceberg — Open Table Format On S3

Iceberg is the modern table format AWS has standardized on.

What Iceberg Adds Over Plain Parquet

Plain Parquet on S3 is a file layout — there is no concept of "the current state of the table." Iceberg adds a metadata layer on top of Parquet (or ORC or Avro) that tracks which files are part of which table snapshot, enabling ACID transactions, schema evolution, partition evolution, and time-travel queries. Iceberg is to Parquet what Git is to plain files in a folder — same underlying storage, vastly more powerful semantics.

Iceberg ACID Transactions

Inserts, updates, deletes, and merges in Iceberg are atomic — either the operation completes and the new snapshot becomes current, or it fails and the old snapshot is unchanged. Concurrent writers serialize through optimistic concurrency control on the metadata file. This solves the classic data lake problem where readers see partial writes during long-running ETL jobs.

Schema And Partition Evolution

Iceberg lets you add, drop, or rename columns without rewriting historical data. More importantly, it lets you change the partition specification — from monthly to daily, for example — without rewriting old data. Old data keeps the old partition spec; new data uses the new spec; queries handle both transparently. Plain Parquet cannot do this.

Time Travel

Every Iceberg snapshot is preserved (subject to expire-snapshots policy). Queries can SELECT ... FOR TIMESTAMP AS OF '2024-04-01' and read the table state at that historical moment. Use cases include reproducing model training datasets, auditing data changes, and rolling back accidental mutations.

Iceberg In AWS Glue, Athena, EMR, Redshift

Glue Data Catalog supports Iceberg tables natively. Athena reads and writes Iceberg with full ACID semantics. EMR Spark and EMR Serverless support Iceberg. Redshift Spectrum can read Iceberg tables. AWS has standardized on Iceberg as the preferred open table format — Hudi and Delta Lake remain supported on EMR but Iceberg is the default recommendation for new lakehouse workloads.

Apache Hudi And Delta Lake On EMR

Two other open table formats appear on EMR.

Apache Hudi

Hudi (Hadoop Upserts Deletes and Incrementals) was designed for CDC-heavy workloads. Two storage modes: Copy on Write (rewrites files on every update — read-optimized) and Merge on Read (writes deltas, merges at read time — write-optimized). Hudi is strong for streaming upsert pipelines but has more operational complexity than Iceberg.

Delta Lake

Delta Lake is the Databricks-originated table format. EMR supports it for compatibility with Spark workloads designed for Databricks. Functionally similar to Iceberg in ACID semantics, schema evolution, and time travel.

Choosing Among Iceberg, Hudi, Delta Lake

For new AWS workloads choose Iceberg — it is AWS's recommended format with the best service integration. Choose Hudi only for very high-frequency streaming upsert workloads. Choose Delta Lake only when porting existing Databricks Delta tables to EMR.

Partition Projection — Eliminating MSCK REPAIR TABLE

Partition discovery is the second-largest performance pain point in data lakes.

The MSCK REPAIR TABLE Problem

When new partitions land in S3, Athena does not automatically know about them. The classic fix is MSCK REPAIR TABLE table_name, which scans the S3 prefix, discovers partition directories, and adds them to the Glue Data Catalog. The problem: at table sizes of tens of thousands of partitions, MSCK REPAIR TABLE takes minutes to hours and costs significant Athena scan budget every time it runs.

How Partition Projection Works

Partition projection tells Athena how partition values are computed without consulting the catalog at all. You declare the projection rule — "year is integer 2020 to 2030, month is integer 1 to 12, day is integer 1 to 31, format is year=$1$/month=$2$/day=$3$" — and Athena computes the matching prefix list at query time directly from the predicate. No catalog lookup, no MSCK REPAIR TABLE.

When To Use Partition Projection

Use projection when partition values are predictable from a rule — date partitions, integer ranges, enumerated values. Do not use projection when partitions are sparse or have irregular naming. Projection is configured in the table properties when the table is created in Athena or Glue.

Projection Vs Crawler

Glue crawlers also discover partitions, scheduled to run periodically. Projection is faster (no scan needed) and free (no crawler DPU charges). Use projection for time-partitioned analytics tables; use crawlers for tables with unpredictable schema or non-trivial partition discovery.

Partition projection eliminates MSCK REPAIR TABLE and Glue crawler costs for time-partitioned tables by computing partition values from a declared rule at query time — Athena reads only the prefixes matching the query's filter without consulting the Glue Data Catalog for partition metadata. For tables with thousands or tens of thousands of date partitions, projection cuts query planning from seconds to milliseconds and removes the operational burden of running MSCK REPAIR TABLE on every new partition. Configure projection in the table's TBLPROPERTIES at creation time, declaring the type (integer, enum, date, injected) and range of each partition column. The DEA-C01 exam plants this as a "how do we eliminate MSCK REPAIR TABLE in our pipeline" scenario — the right answer is partition projection, not a more frequent crawler schedule.

The Small-File Problem And Compaction

Small files are the silent performance killer in data lakes.

Why Small Files Hurt

Athena, Spark, and EMR have per-file overhead — opening each file requires an S3 GET, a metadata parse, and reader initialization. A query reading ten thousand 1-MB files takes far longer than the same query reading one hundred 100-MB files even though both contain the same data. The rule of thumb is: target file size 128 MB to 1 GB per file in a Parquet table.

Why Small Files Happen

Streaming pipelines that flush partitions on time intervals (every minute) produce many small files. Glue ETL jobs that do not group input files emit one output per input partition. Spark with high parallelism and low data volume per task emits one file per task.

Glue groupFiles Option

Glue has a groupFiles="inPartition" option that automatically groups small input files into single in-memory partitions before processing. Use for Glue jobs that read many small files — typical for streaming-derived staging data.

Compaction Strategies

For tables that accumulate small files, run periodic compaction jobs that read all files in a partition and rewrite them as one or a few large files. Iceberg has a built-in RewriteDataFiles procedure that does exactly this. Glue and EMR Spark can run compaction as a scheduled job. The exam plants compaction as a remediation strategy for "Athena performance has degraded over time" scenarios.

Common Exam Traps For S3 Partitioning And File Formats

The DEA-C01 exam plants a consistent set of traps. Memorize all six.

Trap 1 — High-Cardinality Partition Key

A scenario describes partitioning by user ID, transaction ID, or any column with thousands or millions of distinct values. Right answer: partition by date or another low-cardinality column; user ID can be a sort key inside files but not a partition key.

Trap 2 — CSV For Analytics Tables

A scenario describes a costly Athena dashboard with underlying CSV data. Right answer: convert to Parquet through a Glue or EMR job and rerun the dashboard.

Trap 3 — Avro For Athena Reads

A scenario describes Avro stored on S3 queried by Athena. Right answer: keep Avro for Kafka/Kinesis ingestion staging, but convert to Parquet for the query layer. Avro is row-based and Athena cannot do column pruning on it.

Trap 4 — MSCK REPAIR TABLE On Large Tables

A scenario describes running MSCK REPAIR TABLE every hour and complains about the cost. Right answer: configure partition projection or Glue crawler with incremental crawls.

Trap 5 — Plain Parquet When ACID Is Required

A scenario describes concurrent writers overwriting plain Parquet on S3 and seeing partial reads. Right answer: migrate to Iceberg (or Hudi or Delta Lake) for transactional semantics.

Trap 6 — Small Files In Streaming Pipelines

A scenario describes a streaming-derived table where queries have gotten slower over months. Right answer: run a compaction job that rewrites small files into larger ones, ideally using Iceberg's RewriteDataFiles or a scheduled Glue job.

Trap 7 — Partition Without Pushdown

A scenario describes partitioning by a column that queries do not filter on. Right answer: re-partition by columns the queries actually filter on — partitioning is wasted unless queries push the filter down to the partition layer.

For AWS data lake analytics: Parquet is the preferred columnar format for reads, Avro is the preferred row format for streaming writes, ORC is a Hive-specific alternative, and Iceberg is the preferred open table format that adds ACID semantics on top of Parquet files. Memorize: Parquet for Athena, ORC for Hive, Avro for Kafka/Kinesis. Memorize: partition by date columns, not by high-cardinality IDs. Memorize: target 128 MB to 1 GB file sizes, compact small files when they accumulate. Memorize: use partition projection instead of MSCK REPAIR TABLE for time-partitioned tables. These five rules answer the majority of partitioning and format scenarios on the DEA-C01 exam.

Key Numbers And Must-Memorize Format Facts

Format Comparison

Parquet: column-oriented, default for analytics, predicate pushdown, column pruning
ORC: column-oriented, Hive-optimized, ACID native, slightly more efficient than Parquet on Hive
Avro: row-oriented, schema in file header, ideal for streaming with Schema Registry
CSV/JSON: row-oriented, no statistics, ten times more expensive on Athena than Parquet

Compression Codecs

Snappy: default, fast, moderate compression
GZIP: higher compression, slower
ZSTD: best general-purpose, better than GZIP on both axes
LZO: legacy, deprecated in newer pipelines

Partitioning Rules

Hive convention: key=value/key=value/
Target partition size: 1 GB to 100 GB per partition
Avoid partitioning by columns with more than 10,000 distinct values
Partition keys must be columns queries filter on

File Size Targets

Parquet target: 128 MB to 1 GB per file
Smaller files: increase metadata overhead and Athena scan time
Larger files: reduce parallelism

Iceberg Capabilities

ACID transactions on S3
Schema evolution without rewriting
Partition evolution without rewriting
Time-travel queries with FOR TIMESTAMP AS OF
Native support in Glue, Athena, EMR, Redshift Spectrum

DEA-C01 exam priority — S3 Partitioning and File Formats Parquet ORC Avro. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

Exam-day tip. When a DEA-C01 scenario stem mentions cost, latency, or operational overhead constraints together, eliminate options that violate any one constraint before comparing trade-offs among the remaining ones. This filter alone resolves a high fraction of DEA-C01 multi-correct scenarios.

FAQ — S3 Partitioning And File Formats Top Questions

Q1 — How do I choose between Parquet and ORC for a new analytics table?

Choose Parquet for almost all new AWS workloads. Parquet has broader integration across AWS services (Athena, Glue, EMR, Redshift Spectrum, Quicksight all default to Parquet), better tooling, and the same column-pruning and predicate-pushdown benefits as ORC. Choose ORC only when the workload is specifically Hive on EMR with heavy use of Hive ACID transactions and when migration cost matters. The exam treats Parquet as the default; ORC is a niche alternative for legacy Hive shops.

Q2 — When should I use Avro instead of Parquet?

Use Avro at the streaming ingestion boundary. Kafka producers and Kinesis producers write Avro records with a schema registered in the Glue Schema Registry, the registry enforces schema compatibility, and Avro's row-oriented append-friendly format suits one-record-at-a-time writes. Once the data lands in S3 and crosses into the analytics layer, convert it to Parquet through a Glue, EMR, or Lambda job — Athena queries against Avro on S3 cost ten times more than against Parquet on the same data because Avro is row-oriented and Athena cannot do column pruning.

Q3 — What partition strategy should I use for a daily-arriving event stream?

Partition by date components: year=YYYY/month=MM/day=DD/. Optionally add hour for very high-volume streams (over one terabyte per day). Avoid partitioning by user ID, transaction ID, or other high-cardinality columns — those should be sort keys within files, not partition keys. For predictable date partitions, configure Athena partition projection so MSCK REPAIR TABLE is never needed and the Glue Data Catalog is not consulted for partition metadata at query time. Target partition size in the gigabyte range — if your stream is small, partition by month or week instead of day.

Q4 — How do I migrate a plain Parquet table to Apache Iceberg?

Two paths. In-place migration: use Athena or Spark to register the existing Parquet files as the initial Iceberg snapshot via ALTER TABLE ... CONVERT TO ICEBERG (or the equivalent Spark procedure) — fast but requires the Parquet layout to be Iceberg-compatible. Rewrite migration: create a new Iceberg table and copy data via INSERT INTO new_table SELECT * FROM old_table — slower but cleaner, and lets you change the partition spec or schema at the same time. After migration, downstream queries continue to work because Iceberg tables expose the same SQL interface as Hive tables.

Q5 — How do I prevent the small-file problem in a streaming pipeline?

Three layers. First, configure buffering at the writer — Kinesis Firehose buffers up to 128 MB or 900 seconds before writing to S3, producing larger files. Second, run a compaction job periodically — Iceberg's RewriteDataFiles procedure or a custom Glue job reads many small files in a partition and rewrites them as one or a few larger files. Third, use Glue groupFiles="inPartition" when reading staging data with many small files, so the Glue job groups inputs in memory before processing. The exam pattern is "queries are slow on a streaming-derived table" — the answer is compaction.

Q6 — When should I use partition projection versus a Glue crawler?

Use partition projection when partition values are predictable from a rule — time-based partitions, integer ranges, enumerated values. Projection costs nothing per query, never needs MSCK REPAIR TABLE, and serves partition values from the table definition itself. Use a Glue crawler when partition values are unpredictable or when the table schema itself can change — the crawler scans S3, infers schema, and updates the Data Catalog. For typical analytics tables with date partitions, projection is the right answer; for less structured exploratory data, a crawler is appropriate.

Q7 — What is the right number of partitions for a billion-row daily fact table?

Target one to ten gigabytes per partition. A billion-row fact table with one-kilobyte rows is one terabyte total — daily partitions of one terabyte each work, monthly partitions of thirty terabytes each are too coarse, hourly partitions of forty gigabytes each are reasonable but increase metadata overhead by 24x. The decision is driven by two factors: how queries filter (if queries always specify a day, daily partitioning is fine; if queries sometimes specify ranges of weeks, daily is still fine but monthly groups would also work), and what compaction strategy is in place (if you compact small files routinely, hourly partitions are workable). The exam plants over-partitioning anti-patterns; under-partitioning is the rarer mistake.

Why Partitioning Matters For S3 Data Lakes

Query Cost And Performance

Data Locality And Parallelism

Lifecycle Management

Plain-Language Explanation: S3 Partitioning And File Formats

Analogy 1 — The Library Card Catalog And Stack Layout

Analogy 2 — The Warehouse Aisle System

Analogy 3 — The Newspaper Archive

Hive-Style Partition Layout

The Year-Month-Day Pattern

Partition Key Selection

Partition Cardinality Trade-Offs

Multi-Level Partitions

Parquet — The Default Columnar Format

Why Parquet Is Preferred

Parquet Internal Structure

Compression Codecs

Parquet With Glue, Athena, EMR, Redshift Spectrum

ORC — The Hive Optimization Format

When ORC Is Preferred Over Parquet

ORC Internal Structure

ORC With AWS Services

Avro — The Row-Based Streaming Format

Why Avro Is Row-Based

When Avro Is Preferred

Schema Evolution

Avro Versus Parquet — When To Convert

Apache Iceberg — Open Table Format On S3

What Iceberg Adds Over Plain Parquet

Iceberg ACID Transactions

Schema And Partition Evolution

Time Travel

Iceberg In AWS Glue, Athena, EMR, Redshift

Apache Hudi And Delta Lake On EMR

Apache Hudi

Delta Lake

Choosing Among Iceberg, Hudi, Delta Lake

Partition Projection — Eliminating MSCK REPAIR TABLE

The MSCK REPAIR TABLE Problem

How Partition Projection Works

When To Use Partition Projection

Projection Vs Crawler

The Small-File Problem And Compaction

Why Small Files Hurt

Why Small Files Happen

Glue groupFiles Option

Compaction Strategies

Common Exam Traps For S3 Partitioning And File Formats

Trap 1 — High-Cardinality Partition Key

Trap 2 — CSV For Analytics Tables

Trap 3 — Avro For Athena Reads

Trap 4 — MSCK REPAIR TABLE On Large Tables

Trap 5 — Plain Parquet When ACID Is Required

Trap 6 — Small Files In Streaming Pipelines

Trap 7 — Partition Without Pushdown

Key Numbers And Must-Memorize Format Facts

Format Comparison

Compression Codecs

Partitioning Rules

File Size Targets

Iceberg Capabilities

FAQ — S3 Partitioning And File Formats Top Questions

Q1 — How do I choose between Parquet and ORC for a new analytics table?

Q2 — When should I use Avro instead of Parquet?

Q3 — What partition strategy should I use for a daily-arriving event stream?

Q4 — How do I migrate a plain Parquet table to Apache Iceberg?

Q5 — How do I prevent the small-file problem in a streaming pipeline?

Q6 — When should I use partition projection versus a Glue crawler?

Q7 — What is the right number of partitions for a billion-row daily fact table?

Further Reading — Official AWS Documentation For Partitioning And Formats

Official sources

More DEA-C01 topics