Glue Data Catalog and Lake Formation - DEA-C01 Data Engineer Study Notes

Q: Q4 — What is the difference between Glue Data Catalog and Glue Schema Registry?

The Glue Data Catalog stores table metadata for analytics services — column names, types, partition keys, S3 paths. Athena, EMR, Redshift Spectrum, Glue ETL all read from it. The Glue Schema Registry stores schemas for streaming data — Avro, JSON Schema, Protobuf — used by Kafka and Kinesis producers and consumers, with compatibility rules to prevent breaking changes. They are separate components inside the Glue service. Confusing the two is a common exam trap: questions about "where do Athena tables live" answer Catalog; questions about "where does Kafka schema live" answer Schema Registry.

Q: Q7 — How do I configure Lake Formation strict mode versus IAMAllowedPrincipals?

The IAMAllowedPrincipals setting on a catalog object means Lake Formation falls back to IAM permissions — it is "transparent" and existing IAM-based access works. To enforce Lake Formation strict mode, remove IAMAllowedPrincipals from the resource (database or table) and grant explicit Lake Formation permissions to the appropriate principals. Strict mode is required to use column, row, or cell-level filtering — those features have no IAM equivalent. Most organizations keep IAMAllowedPrincipals on legacy catalogs, enable strict mode for new sensitive datasets, and gradually migrate. Going to strict mode without granting Lake Formation permissions first breaks existing access for everyone — always grant first, then remove IAMAllowedPrincipals.

AWS Glue Data Catalog and AWS Lake Formation are the two-headed governance layer of every AWS data lake, and on the DEA-C01 exam they show up across Domains 2 and 4 in scenarios that hinge on metadata management, schema evolution under streaming, fine-grained access at column and row level, and cross-account data sharing. Community study guides from Tutorials Dojo, ExamCert.App, and Camille Chang's dev.to walkthrough all flag the same pain point: candidates conflate Glue Data Catalog with Lake Formation as if they were one product, miss the two-layer access model where IAM bucket policy and Lake Formation must both allow access, and forget that Lake Formation tag-based access is the only way to scale permissions past a few hundred tables. The wrong choice on the exam is the wrong choice in production: build IAM-only access on a bucket Lake Formation has registered and queries silently fail with an unhelpful access-denied error; share a database to a consumer account without the resource-link pattern and the consumer cannot find the catalog at all.

This guide is built for the data engineer perspective. It covers what the Glue Data Catalog is, the catalog-database-table-partition hierarchy, Glue Schema Registry for streaming-data schema management, what Lake Formation adds on top of the catalog, the four levels of Lake Formation permissions (database, table, column, row, and cell), how Lake Formation interacts with S3 bucket policies and IAM, tag-based access control with LF-Tags, cross-account sharing with resource links and AWS RAM, registered locations and the governance enforcement boundary, and the canonical exam traps. By the end the catalog-versus-Lake-Formation distinction should feel as natural as the difference between a library card catalog and the library's lending policy.

What Is The Glue Data Catalog?

The Glue Data Catalog is a centralized, region-scoped metadata repository for tables, databases, and connections used by every AWS analytics service.

Why The Catalog Exists

Before the Glue Data Catalog, every AWS analytics service had its own metadata store. Athena had its own internal table definitions, EMR had Hive Metastore, Redshift Spectrum needed its own external schema. Each had to be configured separately and they drifted out of sync. The Glue Data Catalog is the single metadata store that all of these services read from — define a table once, query it from Athena, EMR, Redshift Spectrum, Glue ETL, SageMaker, Quicksight, and Lake Formation without redefinition.

Catalog Hierarchy

The catalog has a four-level hierarchy: catalog (one per AWS account per region by default), database (a logical grouping of tables, equivalent to a schema in relational databases), table (a metadata record for a Parquet/Avro/CSV/Iceberg dataset on S3 or a JDBC source), and partition (a sub-directory of an S3 prefix corresponding to a partition value). Each table has columns, types, partition keys, storage location, file format, and SerDe configuration.

What The Catalog Stores Versus What S3 Stores

The catalog stores metadata only — column names, types, partition specs, S3 paths. The actual data lives on S3 (or in JDBC sources). Updating the catalog does not change the data; updating the data does not automatically change the catalog. This decoupling is the source of many bugs and exam questions: the catalog can drift out of sync with S3 if files arrive without partition metadata being added to the catalog.

Plain-Language Explanation: Glue Data Catalog And Lake Formation

The catalog-versus-governance split benefits from concrete analogies.

Analogy 1 — The Library Card Catalog And The Library's Borrowing Rules

Picture a research library. The card catalog at the entrance lists every book in the building — title, author, subject, shelf location. A patron browses the catalog, finds a book, and walks to the shelf. The card catalog does not control access — anyone who walks into the library can browse it, and anyone who finds a book can pick it up off the shelf if no other rule prevents them. That is the Glue Data Catalog — a discoverability index that says "these tables exist, here are their schemas, here are their S3 locations."

The library's borrowing policy is a separate document. It says faculty members can borrow rare books, undergraduates cannot, only PhD students can borrow from the closed stacks, and reserves must be read in the building. The policy is what the librarian at the desk enforces when a patron tries to check out a book. That is Lake Formation — the policy layer on top of the catalog that says "you can read these columns, not those, only these rows, only from this account." The catalog tells you what exists; Lake Formation tells you what you are allowed to read.

Analogy 2 — The Hospital Medical Records Index And HIPAA Access Controls

Picture a hospital. The medical records index lists every patient chart, test result, and imaging study by patient ID — the index lets a clinician find the record they need. The index is comprehensive and lists everything, regardless of who is allowed to view what. That is the Glue Data Catalog.

The hospital's HIPAA access control system decides which clinicians can read which charts. A nurse on the cardiac floor can read cardiac patients' charts but not psychiatric patients'. A research analyst can read de-identified aggregate data but not patient-identifying columns. A pharmacy technician can see medication lists but not diagnosis history. That is Lake Formation with column and row and cell-level permissions. The index says the records exist; the access control system says which records this user can actually open.

The two-layer security model maps cleanly: the building's keycard reader at the front door (IAM and S3 bucket policy) decides who is allowed in the building at all; the chart-level access control (Lake Formation) decides which charts this person can read once inside. Both must allow access — get past the keycard but be denied at the chart, you still cannot read.

Analogy 3 — The Office Building Directory And Floor Access

Picture a corporate office tower. The lobby directory lists every tenant — company name, floor, suite number. Anyone walking through the lobby can read it; the directory is purely informational. That is the Glue Data Catalog.

The building has two security layers: the lobby turnstile (IAM/S3) lets through anyone with a valid keycard for any floor, and each floor has its own access reader (Lake Formation) that decides which tenants on that floor this person can enter. A consultant might have a lobby keycard plus permission to enter floors 5 and 7 but not floor 8. The directory is the same for everyone; the floor-by-floor access is per-person. The exam plants this exact pattern: a query fails because S3 bucket policy allows but Lake Formation denies, or vice versa — both layers must allow the action for it to succeed.

Catalog Components — Database, Table, Partition, Connection

The catalog has structured components every data engineer must know.

Database

A database is a logical grouping of tables — typically one per business domain or data zone. Common patterns: raw_landing, silver_curated, gold_marts, or customer_data, transactions, inventory. Database names are case-insensitive in the catalog. Permissions can be granted at the database level (covers all tables in the database).

Table

A table is a metadata record describing a dataset. Properties include columns (name, type, comment), partition keys, storage location (S3 path or JDBC URL), input/output format (Parquet, ORC, CSV), SerDe (serializer/deserializer for the format), table type (EXTERNAL_TABLE, GOVERNED, ICEBERG_TABLE), and table parameters (compression, projection rules, etc.). Tables can be created by Glue crawlers, by Glue ETL jobs writing to a target, by Athena DDL, by Lake Formation, or by direct Glue API calls.

Partition

A partition is a metadata record for a partition value within a table. For a daily-partitioned table, each day is one partition. Partitions have their own storage location (the S3 prefix for that partition) and column statistics. The Glue Catalog Partition API supports up to ten million partitions per table. Beyond a few hundred thousand partitions, partition projection is preferred over catalog-stored partitions for query planner performance.

Connection

A connection stores credentials and network configuration for non-S3 data sources — JDBC databases, MongoDB, Kafka, Salesforce. Connections include VPC, subnet, security group settings for jobs that need to reach the source, plus IAM credentials or AWS Secrets Manager references for the secrets.

The AWS Glue Data Catalog is a region-scoped, account-level metadata repository organized as catalog → database → table → partition, used by every AWS analytics service (Athena, EMR, Redshift Spectrum, Glue ETL, SageMaker) as the single source of table definitions and storage locations. The catalog stores metadata only — schemas, types, partition keys, storage paths, table parameters — not the actual data, which lives on S3 or in JDBC sources. The catalog is the backbone of the AWS data lake architecture; without it, every analytics service would maintain its own table definitions and they would drift out of sync. On DEA-C01, scenarios that ask "where does Athena get its table definitions from" or "how do EMR and Redshift Spectrum share schema" are answered by Glue Data Catalog.

Glue Schema Registry — Streaming Schema Management

The Schema Registry is a separate but related Glue feature for streaming data.

What The Schema Registry Is

The Schema Registry stores Avro, JSON Schema, and Protobuf schemas for streaming data sources — Kafka topics, Kinesis Data Streams. Producers register schemas; consumers fetch them. The registry enforces compatibility rules so producers cannot publish schemas that break existing consumers.

Compatibility Modes

Five modes from least to most restrictive: NONE (no enforcement, any change allowed), BACKWARD (new schema readers can read old data), BACKWARD_ALL (new readers compatible with all prior versions), FORWARD (old readers can read new data), FORWARD_ALL, FULL (both directions), and FULL_ALL. The most common production mode is BACKWARD or FULL — backward means new consumers can read old messages still in the topic; full means both producers and consumers can be deployed in any order.

Schema Evolution Rules

Adding a new field with a default value: backward compatible. Removing an optional field: forward compatible. Adding a non-optional field without default: breaks both compatibility modes. Renaming a field: breaks both. Changing field type from int to long: breaks forward; broadens int to long is sometimes backward. The registry validates these rules at schema-publish time, rejecting changes that violate the configured compatibility.

Schema Registry Versus Data Catalog

The Glue Data Catalog and Glue Schema Registry are separate components. The Data Catalog stores table metadata for analytics (Athena, EMR, Spectrum). The Schema Registry stores schemas for streaming data with compatibility rules. They are both part of the Glue service but they answer different questions.

The Glue Schema Registry stores and validates Avro, JSON Schema, and Protobuf schemas for streaming data, with compatibility modes (BACKWARD, FORWARD, FULL) that prevent producers from publishing schema changes that would break existing consumers. The Schema Registry is the contract between streaming data producers (Kafka, Kinesis, MSK) and consumers — every message in a registered topic conforms to a registered schema, and changes are validated before they propagate. Without the Schema Registry, schema-evolution bugs are caught only at consumer runtime, often in production. On DEA-C01, the Schema Registry is tested in scenarios about streaming pipeline reliability and schema compatibility — the answer is to register schemas with the appropriate compatibility mode, not to rely on producer-side discipline alone.

Glue Crawlers — Catalog Population

Crawlers are the automated way to populate the catalog from S3 or JDBC sources.

How Crawlers Work

A crawler scans an S3 prefix (or a JDBC schema), infers the schema from sample data, detects partition structure, and writes table definitions to a target Glue database. Run a crawler on a new S3 prefix and you get a table you can immediately query in Athena. Run it again after new files arrive and it adds new partitions and updates schema if necessary.

Crawler Configuration

Crawlers have a target (S3 paths or JDBC URLs), a schedule (on-demand, hourly, daily, custom cron), classifiers (built-in for common formats like Parquet, JSON, CSV; custom Grok patterns for log formats), and exclusion patterns (to skip files like _SUCCESS markers or .tmp files). Permissions: the crawler runs with an IAM role that needs s3:GetObject on the source bucket and glue:CreateTable/glue:UpdatePartition on the catalog.

The Multiple-Tables Gotcha

A common bug: a crawler scans an S3 prefix and creates multiple tables instead of one because subdirectories have different file structures or schemas. Configure TableLevel in the crawler grouping settings — set to a specific path depth so the crawler treats the whole prefix as one table. The exam plants this as "the crawler created 50 tables for what should have been one table" — the answer is the table-level grouping setting.

Crawlers Versus Partition Projection

Crawlers add partitions to the catalog by scanning S3 — slow at scale, costs DPU minutes per run. Partition projection computes partition values from a rule at query time — free, instant. For time-partitioned tables with predictable partition structure, projection is preferred over crawlers. For unpredictable schema or non-trivial partition discovery, crawlers are appropriate.

What Lake Formation Adds On Top Of The Catalog

Lake Formation is a governance layer that sits on top of the Glue Data Catalog.

The Governance Layer

Lake Formation provides centralized permission management for data lake resources. Without Lake Formation, access control is configured table-by-table via IAM policies on Athena, S3 bucket policies, and Glue resource policies — each in a different place, hard to audit. With Lake Formation, all of these are unified under one permission model with one console.

Lake Formation Permissions Model

Lake Formation grants are at multiple levels: database (cover all tables and future tables in the database), table (specific table), column (specific columns within a table), row (filter expression that returns subset of rows), and cell (combine column and row filters for fine-grained masking). Grants use familiar SQL semantics — SELECT, INSERT, ALTER, DELETE, DESCRIBE, DROP, plus Lake Formation-specific permissions like CREATE_TABLE, GRANTABLE_PRINCIPAL.

Registered Locations

For Lake Formation to enforce permissions on data, the underlying S3 path must be registered with Lake Formation. Registration tells Lake Formation "I am authoritative for this S3 path." Once registered, Lake Formation issues short-term S3 credentials to authorized queries — services like Athena and Redshift Spectrum receive these credentials and use them to access S3 instead of their own IAM identities.

LF-Tags For Scaling Permissions

Direct table-level grants do not scale past a few hundred tables. LF-Tags are key-value tags attached to databases, tables, columns. Permissions are granted on tag values rather than specific resources — "grant SELECT on all resources tagged Department=Finance to the finance-analyst role." Tag a new table with the right tags and it inherits the existing permissions automatically. LF-Tags are the only practical way to manage permissions in catalogs with thousands of tables.

Lake Formation permissions and S3 bucket policies form a two-layer access model — both must allow the operation for queries to succeed, and conflicts default to deny. Engineers commonly grant Lake Formation SELECT on a table but forget to update the S3 bucket policy when Lake Formation is in IAMAllowedPrincipals fallback mode, or migrate to Lake Formation but leave restrictive bucket policies that block the new flow. The DEA-C01 exam plants this as a troubleshooting scenario: "the user has Lake Formation SELECT on the table but Athena returns access denied" — the answer is to check the S3 bucket policy, the registered-location IAM role, and the KMS key policy if encryption is in use. All four (Lake Formation grant, S3 bucket policy, registered location role, KMS key policy) must allow access. Conflating "I granted Lake Formation" with "access works" is the most cited Lake Formation exam mistake.

Lake Formation Fine-Grained Access — Column, Row, Cell

Fine-grained access is Lake Formation's biggest differentiator versus plain IAM.

Column-Level Security

Grant SELECT on a table with an Included Columns list and the user can read only those columns. Other columns are filtered out at query time — Athena and Redshift Spectrum honor the column filter and the user never sees the excluded data. Use for PII column masking: grant analysts access to all columns except customer-name and email; grant the customer-service team access to all columns including those.

Row-Level Security With Data Filters

Lake Formation data filters define a row-filter expression (SQL boolean) per (table, principal). Grant SELECT with a data filter where region equals 'us-east-1' and the user reads only those rows. Use for tenant isolation in multi-tenant data lakes — each tenant sees only their own rows, all from the same table.

Cell-Level Security

Combine column-level and row-level filters and the result is cell-level. The user reads only specific columns, only on specific rows. The most common use case: the analyst sees the salary column for their own department's employees but not for other departments — the column filter excludes salary for non-allowed departments, the row filter scopes down to allowed departments.

How Athena, Spectrum, EMR Honor These Filters

Athena and Redshift Spectrum natively integrate with Lake Formation — they ask Lake Formation what columns and rows the calling identity can see, and apply the filter at query plan time. EMR Spark integration requires the EMR cluster to be configured with the Lake Formation EMR security configuration. Glue jobs respect Lake Formation when running with a job role that has Lake Formation grants.

Sharing catalog resources across AWS accounts is a frequent exam topic.

Resource Links

A resource link is a pointer in the consumer account to a database or table in the producer account. The consumer account's principals see the resource link in their own catalog; queries through the link route to the underlying shared resource in the producer account. The resource-link pattern means the consumer does not have to switch accounts or assume cross-account roles to query the shared data.

AWS RAM Integration

Lake Formation cross-account sharing uses AWS Resource Access Manager (RAM) under the hood. The producer creates a RAM share for the database or table and invites the consumer account or organization. The consumer accepts the invitation, then creates resource links pointing to the shared resources. Without RAM, there is no cross-account share; the resource link alone is not sufficient.

Cross-Account With LF-Tags

LF-Tag-based grants can be made across accounts: "grant SELECT on all resources tagged Department=Finance to the consumer-account-finance-role." The consumer account creates resource links and queries through them. This pattern scales to thousands of tables shared across dozens of accounts, where individual table-level grants would be unmanageable.

Cross-Account S3 And KMS

The S3 bucket holding the data must allow the consumer account access via bucket policy. If the bucket is encrypted with SSE-KMS, the KMS key policy must grant decrypt to the consumer account. Lake Formation cross-account share without these two underlying configurations fails silently with permission errors that are hard to diagnose.

Lake Formation Versus IAM — When To Use Each

A key DEA-C01 exam distinction.

IAM-Only For Coarse Access

IAM is appropriate for coarse boundaries: which AWS accounts can reach the data lake, which IAM roles can assume specific service roles, which network principals can use which VPC endpoints. IAM can grant s3:GetObject on bucket prefixes, glue:GetTable on the catalog, athena:StartQueryExecution for query execution. For organizations with simple needs, IAM-only is enough.

Lake Formation For Fine-Grained And Centralized

Lake Formation is appropriate when the data lake needs column or row-level security, when permission management must be centralized across hundreds of tables, when cross-account sharing needs catalog-aware access, or when LF-Tags are needed for scaling. Lake Formation does not replace IAM — it adds a finer-grained governance layer on top.

Hybrid — IAMAllowedPrincipals

The default IAMAllowedPrincipals setting on a catalog object means Lake Formation falls back to IAM permissions for that object — Lake Formation is "transparent" and IAM rules apply. Removing IAMAllowedPrincipals enables Lake Formation strict mode where Lake Formation grants are required. Most organizations migrate gradually: keep IAMAllowedPrincipals for legacy tables, enable strict mode for new sensitive tables, eventually migrate all tables.

Common Exam Traps For Glue Data Catalog And Lake Formation

The DEA-C01 exam plants a consistent set of traps. Memorize all six.

Trap 1 — IAM S3 Allow Without Lake Formation Grant Fails

A scenario describes IAM granting s3:GetObject on a Lake Formation-registered bucket and the query failing. The answer: Lake Formation also requires a SELECT grant on the table; both layers must allow.

Trap 2 — Lake Formation Grant Without S3 Access Fails

The reverse: Lake Formation grants SELECT but the underlying S3 bucket policy denies access. The answer: register the S3 location with Lake Formation so Lake Formation can issue short-term S3 credentials, or grant the IAM role direct S3 access.

Trap 3 — Cross-Account Without Resource Link

A scenario describes sharing a database to another account but the consumer cannot find it. The answer: the consumer must accept the AWS RAM share and create resource links in their own catalog.

Trap 4 — Schema Registry Versus Catalog

A scenario describes Avro streaming messages and asks where the schema is stored. Wrong answer: Glue Data Catalog. Right answer: Glue Schema Registry. They are separate components.

Trap 5 — Crawler Created Multiple Tables

A scenario describes a crawler creating fifty tables for what should have been one. The answer: configure table-level grouping in the crawler settings.

Trap 6 — LF-Tags Versus Direct Grants For Scale

A scenario describes a catalog with five thousand tables and asks how to manage permissions. The answer: LF-Tags. Direct table grants do not scale past a few hundred.

Trap 7 — Lake Formation Does Not Encrypt

A candidate assumes Lake Formation encrypts data. Wrong. Lake Formation governs access; encryption is configured separately on S3 (SSE-KMS) and the Glue Catalog (KMS key for catalog metadata).

For data lakes with more than a few hundred tables or more than a handful of consumer principals, use Lake Formation tag-based access control (LF-TBAC) instead of direct table or column grants — LF-Tags scale linearly with the number of distinct policy patterns rather than with the number of tables. Define a small set of tag keys (Department, Sensitivity, DataDomain), tag every table with appropriate values, and grant permissions on tag combinations rather than individual resources. New tables inherit policies automatically when they are tagged correctly. Direct grants on five thousand tables produce thousands of grant records that are impossible to audit; LF-Tags produce dozens of grant records that map cleanly to organizational policy. The DEA-C01 exam plants scaling scenarios where LF-Tags are the only correct answer.

Key Numbers And Must-Memorize Catalog And Lake Formation Facts

Glue Data Catalog

One catalog per AWS account per region by default
Catalog hierarchy: catalog → database → table → partition
Up to 10 million partitions per table
Supports Iceberg tables natively as ICEBERG_TABLE type
Crawler creates tables, updates partitions, infers schemas

Glue Schema Registry

Supports Avro, JSON Schema, Protobuf
Compatibility modes: NONE, BACKWARD, FORWARD, FULL (with _ALL variants)
Separate from Glue Data Catalog
Used by Kafka, Kinesis, MSK producers and consumers

Lake Formation Permissions

Database, table, column, row, cell-level grants
Data filters for row-level security (SQL boolean expressions)
LF-Tags for scaling permissions (key-value attribute-based)
IAMAllowedPrincipals default for transparent IAM passthrough
Strict mode requires explicit Lake Formation grants

AWS RAM under the hood
Resource links in consumer account catalog
LF-Tag grants work cross-account
Underlying S3 bucket policy must allow consumer account
KMS key policy must grant decrypt to consumer account

The Two-Layer Access Model

IAM and S3 bucket policy (coarse boundary)
Lake Formation grants (fine-grained governance)
Both must allow for the operation to succeed
Default deny when conflict

DEA-C01 exam priority — Glue Data Catalog and Lake Formation. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

Key facts to memorize. Pin the service-level limits, default behaviours, and SLA promises related to this topic. The exam often tests recall of specific defaults (durations, limits, retry behaviour, free-tier boundaries) where memorising the exact number is the difference between right and 'almost right'.

FAQ — Glue Data Catalog And Lake Formation Top Questions

Q1 — When should I use Glue Data Catalog versus Lake Formation?

The Glue Data Catalog is required — every AWS analytics service uses it. Lake Formation is a governance layer added on top of the catalog when you need fine-grained access control (column or row level), centralized permission management, cross-account data sharing with catalog awareness, or LF-Tags for scaling permissions. Small data lakes with simple access requirements run on Glue Data Catalog plus IAM and S3 bucket policies. Larger data lakes with sensitive data and many consumers benefit from Lake Formation. The decision is not "which one" — it is "do I need Lake Formation in addition to the catalog."

Three steps. First, in the producer account, use Lake Formation to grant the database (or specific tables, or LF-Tags) to the consumer account principal. Lake Formation creates an AWS RAM resource share under the hood. Second, in the consumer account, accept the RAM resource share invitation. Third, in the consumer account, create a resource link in the consumer's Glue Data Catalog pointing to the shared database. Consumer-account principals can then query through the resource link as if it were a local database. Underlying S3 bucket policies must also allow the consumer account access, and KMS key policies must grant decrypt if encryption is enabled.

Q3 — Why does my user have Lake Formation SELECT but Athena says access denied?

Three possible causes. First, the S3 location is not registered with Lake Formation, so Lake Formation cannot issue short-term S3 credentials and the user's IAM identity is checked against S3 directly — IAM denies because the bucket policy is restrictive. Second, the registered location is registered with the wrong IAM role that lacks S3 access. Third, the KMS key on the S3 bucket does not grant the user's role decrypt permission. The fix path: check Lake Formation grant, check registered location and its IAM role, check S3 bucket policy, check KMS key policy. All four must align.

Q4 — What is the difference between Glue Data Catalog and Glue Schema Registry?

The Glue Data Catalog stores table metadata for analytics services — column names, types, partition keys, S3 paths. Athena, EMR, Redshift Spectrum, Glue ETL all read from it. The Glue Schema Registry stores schemas for streaming data — Avro, JSON Schema, Protobuf — used by Kafka and Kinesis producers and consumers, with compatibility rules to prevent breaking changes. They are separate components inside the Glue service. Confusing the two is a common exam trap: questions about "where do Athena tables live" answer Catalog; questions about "where does Kafka schema live" answer Schema Registry.

Q5 — How does Glue crawler partition handling work and why does it sometimes create multiple tables?

A crawler scans S3 prefixes, infers schema, and groups files into tables based on directory structure and file similarity. By default, the crawler tries to create one table per distinct schema it encounters. If subdirectories within a prefix have different file structures (different columns, different partitioning), the crawler creates multiple tables. Configure the table-level grouping setting (TableLevel parameter) in the crawler's grouping configuration — set to the path depth where you want all subdirectories grouped into one table. The classic anti-pattern: a crawler scanning s3://bucket/events/ with subdirectories per service creates one table per service when the intent was one table for all events; setting TableLevel to 2 fixes it.

Q6 — How do LF-Tags scale Lake Formation permissions?

Without LF-Tags, you grant SELECT on each individual table, column, and row filter to each principal — for ten principals across one thousand tables, that is ten thousand grant records. With LF-Tags, you tag each table with a small number of attribute values (Department=Finance, Sensitivity=Public) and grant permissions on tag combinations ("grant SELECT to finance-role on resources where Department=Finance"). Now ten principals plus three tag keys with two values each produces sixty grants instead of ten thousand. New tables tagged correctly inherit existing permissions automatically. LF-Tags are the only practical way to manage permissions on large catalogs and the only way to map technical permissions cleanly to organizational policy.

Q7 — How do I configure Lake Formation strict mode versus IAMAllowedPrincipals?

The IAMAllowedPrincipals setting on a catalog object means Lake Formation falls back to IAM permissions — it is "transparent" and existing IAM-based access works. To enforce Lake Formation strict mode, remove IAMAllowedPrincipals from the resource (database or table) and grant explicit Lake Formation permissions to the appropriate principals. Strict mode is required to use column, row, or cell-level filtering — those features have no IAM equivalent. Most organizations keep IAMAllowedPrincipals on legacy catalogs, enable strict mode for new sensitive datasets, and gradually migrate. Going to strict mode without granting Lake Formation permissions first breaks existing access for everyone — always grant first, then remove IAMAllowedPrincipals.

What Is The Glue Data Catalog?

Why The Catalog Exists

Catalog Hierarchy

What The Catalog Stores Versus What S3 Stores

Plain-Language Explanation: Glue Data Catalog And Lake Formation

Analogy 1 — The Library Card Catalog And The Library's Borrowing Rules

Analogy 2 — The Hospital Medical Records Index And HIPAA Access Controls

Analogy 3 — The Office Building Directory And Floor Access

Catalog Components — Database, Table, Partition, Connection

Database

Table

Partition

Connection

Glue Schema Registry — Streaming Schema Management

What The Schema Registry Is

Compatibility Modes

Schema Evolution Rules

Schema Registry Versus Data Catalog

Glue Crawlers — Catalog Population

How Crawlers Work

Crawler Configuration

The Multiple-Tables Gotcha

Crawlers Versus Partition Projection

What Lake Formation Adds On Top Of The Catalog

The Governance Layer

Lake Formation Permissions Model

Registered Locations

LF-Tags For Scaling Permissions

Lake Formation Fine-Grained Access — Column, Row, Cell

Column-Level Security

Row-Level Security With Data Filters

Cell-Level Security

How Athena, Spectrum, EMR Honor These Filters

Cross-Account Data Sharing

Resource Links

AWS RAM Integration

Cross-Account With LF-Tags

Cross-Account S3 And KMS

Lake Formation Versus IAM — When To Use Each

IAM-Only For Coarse Access

Lake Formation For Fine-Grained And Centralized

Hybrid — IAMAllowedPrincipals

Common Exam Traps For Glue Data Catalog And Lake Formation

Trap 1 — IAM S3 Allow Without Lake Formation Grant Fails

Trap 2 — Lake Formation Grant Without S3 Access Fails

Trap 3 — Cross-Account Without Resource Link

Trap 4 — Schema Registry Versus Catalog

Trap 5 — Crawler Created Multiple Tables

Trap 6 — LF-Tags Versus Direct Grants For Scale

Trap 7 — Lake Formation Does Not Encrypt

Key Numbers And Must-Memorize Catalog And Lake Formation Facts

Glue Data Catalog

Glue Schema Registry

Lake Formation Permissions

Cross-Account Sharing

The Two-Layer Access Model

FAQ — Glue Data Catalog And Lake Formation Top Questions

Q1 — When should I use Glue Data Catalog versus Lake Formation?

Q2 — How do I share a database from one AWS account to another?

Q3 — Why does my user have Lake Formation SELECT but Athena says access denied?

Q4 — What is the difference between Glue Data Catalog and Glue Schema Registry?

Q5 — How does Glue crawler partition handling work and why does it sometimes create multiple tables?

Q6 — How do LF-Tags scale Lake Formation permissions?

Q7 — How do I configure Lake Formation strict mode versus IAMAllowedPrincipals?

Further Reading — Official AWS Documentation For Glue Catalog And Lake Formation

Official sources

More DEA-C01 topics