Glue DataBrew and Glue Data Quality - DEA-C01 Data Engineer Study Notes

Q: Q3 — How do I handle DQ failures without halting the entire pipeline?

Three options. Warn-and-continue : configure the EvaluateDataQuality transform to log failures and emit CloudWatch metrics without failing the job. The pipeline runs to completion; operators see metrics and triage out of band. Quarantine : use row-level outcomes to route failing records to a quarantine S3 prefix while passing rows continue to the curated zone. Compensating actions : use EventBridge events from rule failures to trigger remediation Lambdas (e.g., re-pull from source, fill nulls, or flag for manual review). Choose by severity — critical rules (PK uniqueness, financial reconciliation) should fail-fast, informational rules (slight completeness drop) should warn-and-continue, mixed-severity datasets should quarantine. The DEA-C01 exam plants this as scenario design: read the business context to decide.

Q: Q6 — What is the difference between IsComplete and Completeness in DQDL?

IsComplete "column_name" is shorthand for Completeness "column_name" = 1.0 — the column must be 100 percent non-null with no tolerance. Completeness "column_name" > threshold allows a tolerance — for example, Completeness "email" > 0.95 requires 95 percent non-null but tolerates 5 percent missing. Use IsComplete for primary-key columns and other non-negotiable fields; use Completeness for tolerant fields like optional contact info. Similar pattern: IsUnique is shorthand for Uniqueness = 1.0 , while Uniqueness > 0.99 allows a tolerance. The DEA-C01 exam may show both in answer choices and test whether the candidate knows the strict-versus-tolerant distinction.

Data quality is the silent killer of every data pipeline that crosses the line from prototype to production, and on the DEA-C01 exam Domain 3 Task 3.4 plants questions in roughly one out of every twelve scenarios that test whether the candidate knows how to embed automated quality checks into Glue ETL workflows. The trap is rarely about whether quality matters — the trap is choosing the right AWS tool for the job. Glue DataBrew, Glue Data Quality, custom Glue ETL assertions, and Deequ on EMR all overlap, and a candidate who picks DataBrew when the right answer is Glue Data Quality (or vice versa) is wrong in the same way a candidate who picks Athena when the right answer is Redshift Spectrum is wrong.

This guide walks data quality on AWS through the Data Engineer / MLOps lens — what data quality dimensions matter, when to use Glue DataBrew versus Glue Data Quality, how to write DQDL (Data Quality Definition Language) rules, how to wire DQ checks into ETL pipelines with fail-fast or warn-and-continue patterns, how to publish DQ metrics to CloudWatch, and the canonical exam traps planted around DataBrew versus Data Quality, automated rule recommendations, and data contract enforcement.

Data Quality Dimensions — The Six Standard Checks

Before picking a tool, define what quality means. The industry-standard six dimensions show up directly in DQDL rule types.

Completeness

The fraction of non-null values in a column. A customer_email column with 5 percent NULL is 95 percent complete. DQDL rule: Completeness "customer_email" > 0.95. The DEA-C01 exam plants this as the most common quality dimension because incomplete data breaks downstream joins and aggregations silently.

Accuracy

The fraction of values that match the truth — typically expressed as a custom predicate or referential check against a known-good source. DQDL rule: ColumnValues "country_code" in ["US","CA","UK","DE","JP"]. Accuracy is hardest to measure because it requires ground truth.

Consistency

The fraction of values that satisfy cross-column or cross-table relationships. DQDL rule: ColumnCorrelation "order_total" "line_items_sum" > 0.99. A row where order_total does not equal sum(line_items) is inconsistent.

Timeliness

The freshness of data — how stale is the latest record? DQDL rule: (now() - max("event_timestamp")) < 1 day. The DEA-C01 exam plants this as a pipeline-monitoring scenario where the DQ rule fires when ingestion stalls.

Uniqueness

The fraction of values without duplicates. DQDL rule: Uniqueness "order_id" > 0.99. Critical for primary-key columns where duplicates corrupt joins.

Validity

The fraction of values that match a format or pattern. DQDL rule: ColumnValues "email" matches ".*@.*\\..*". Distinct from accuracy — a syntactically-valid email may still be wrong.

AWS Glue DataBrew — Visual Data Profiling And Preparation

Glue DataBrew is the no-code visual environment for exploring, profiling, and cleaning datasets before they enter production pipelines.

What DataBrew Does

DataBrew loads a sample of a dataset (from S3, Redshift, RDS, Snowflake, Glue Data Catalog, or JDBC) into a visual grid that looks like a spreadsheet. Analysts and data engineers can sort, filter, transform, and inspect column statistics interactively. Behind the scenes DataBrew compiles the visual operations into a "recipe" — a JSON document of transformation steps — that can be replayed at scale via a DataBrew recipe job.

Profile Jobs

A DataBrew profile job runs against the full dataset (not just the sample) and produces a profile report: column types, null counts, distinct value counts, min/max/median, value distributions, correlations between columns. The profile job runs as a Glue Spark job under the hood and bills per node-hour. Profile reports are stored in S3 and can be referenced from the DataBrew console.

Recipe Jobs

A DataBrew recipe job applies the recipe (visual transformations the analyst defined) to the full dataset and writes results to S3, Redshift, or another target. Recipes can rename columns, filter rows, replace values, apply regex extraction, fill nulls, derive new columns from formulas, and PII-mask sensitive fields. Recipe jobs are the production execution path for transformations that started as interactive analyst work.

When DataBrew Is Right

DataBrew is the right answer when the team has analysts or domain experts who need to clean data without writing code, when the transformation logic is visual and exploratory, and when the dataset needs profiling to discover quality issues before designing pipelines. It is the path of least resistance for "I need to understand what is in this 100-column CSV and produce a cleaned version."

When DataBrew Is Wrong

DataBrew is wrong for production pipelines that require the same checks every run with no human in the loop, for very large datasets where the Spark job cost is justified by code-level optimization, or for streaming data quality (DataBrew is batch-only). The DEA-C01 trap: candidates who default to DataBrew for every quality scenario miss that Glue Data Quality is the production-pipeline answer.

Glue DataBrew is a visual no-code data preparation tool for analysts that produces profile reports and applies transformation recipes to datasets in S3, Redshift, and other sources — designed for exploratory cleaning and profiling, not for in-pipeline quality enforcement. DataBrew has 250+ built-in transformations including PII handling, format conversion, value replacement, and statistical imputation. Profile jobs surface column statistics; recipe jobs apply transformations at scale. The DEA-C01 exam plants DataBrew as the right answer when the scenario emphasizes "no-code," "visual," "analyst-driven," or "data profiling before pipeline design." When the scenario describes "production ETL pipeline with embedded quality checks," Glue Data Quality is the right answer instead.

AWS Glue Data Quality — Rules In The Pipeline

Glue Data Quality is the rule-based quality evaluation framework integrated directly into Glue ETL jobs and the Glue Data Catalog.

What Glue Data Quality Does

Glue Data Quality lets you define a "ruleset" — a collection of DQ rules expressed in DQDL — and evaluate it against a dataset either as a standalone task or as a step inside a Glue ETL job. The result is a pass/fail score per rule plus an overall score, surfaced in the Glue console, written to CloudWatch metrics, and emitted as events that downstream systems can react to.

DQDL — Data Quality Definition Language

DQDL is the declarative language for specifying rules. A ruleset looks like:

Rules = [
  RowCount > 1000,
  Completeness "customer_email" > 0.95,
  Uniqueness "order_id" > 0.99,
  ColumnValues "country_code" in ["US","CA","UK","DE","JP"],
  ColumnCorrelation "order_total" "line_items_sum" > 0.99,
  CustomSql "SELECT COUNT(*) FROM primary WHERE total < 0" = 0
]

DQDL is human-readable, version-controllable in Git, and supported by Glue Studio's visual rule builder for analysts who prefer the UI.

Built-In Rule Types

The DEA-C01 exam expects familiarity with the rule type names: RowCount, Completeness, Uniqueness, IsComplete, IsUnique, ColumnCount, ColumnExists, ColumnDataType, ColumnValues, ColumnLength, ColumnCorrelation, DataFreshness, DistinctValuesCount, Sum, Mean, StandardDeviation, Entropy, CustomSql, ReferentialIntegrity, DatasetMatch. Memorize the common eight: RowCount, Completeness, Uniqueness, ColumnValues, ColumnCorrelation, DataFreshness, CustomSql, ReferentialIntegrity.

Rule Recommendations

Glue Data Quality can analyze a sample of a dataset and recommend rules automatically — column-level completeness thresholds based on observed null rates, uniqueness thresholds based on observed duplicates, value-range constraints based on observed distributions. The exam plants this as the answer to "the team has 200 columns in a new table and needs a starting rule set without manual authoring." Use the recommendation feature to bootstrap, then refine.

Integration With Glue ETL Jobs

A Glue ETL job can include a EvaluateDataQuality transform that runs a ruleset mid-pipeline. Failing rules can be configured to fail the job (fail-fast pattern) or emit warnings while continuing (warn-and-continue pattern). Failing rows can be routed to a quarantine S3 prefix using the outcomes and rowLevelOutcomes outputs.

Integration With Glue Data Catalog

Rulesets can be attached to a Glue Data Catalog table, giving the catalog a "quality score" surfaced in the console. Athena and downstream consumers can see the freshness and quality state of the table they are about to query.

Glue Data Quality is the rule-based, in-pipeline quality framework — DQDL rulesets evaluated by Glue ETL jobs that fail fast, quarantine bad rows, or publish CloudWatch metrics for monitoring. The DEA-C01 exam plants Glue Data Quality as the right answer for "production ETL with automated quality enforcement," "embed quality checks before loading curated zone," "publish quality metrics to CloudWatch," and "block bad data from reaching downstream." DataBrew profiles and prepares; Glue Data Quality enforces. The two are complementary — DataBrew during pipeline design, Glue Data Quality during pipeline execution. Wrong-answer trap: picking DataBrew for an automated production pipeline scenario, or picking custom Spark code for a scenario where DQDL rules suffice.

DQDL Patterns — Writing Rules In The Real World

DQDL is the syntax the exam may surface in code-recognition style questions.

Pattern 1 — Schema And Structural Rules

RowCount > 1000
ColumnCount = 25
ColumnExists "customer_id"
ColumnDataType "order_total" = "Double"

These are the structural assertions that catch upstream schema drift. The DEA-C01 exam plants schema drift as a common Glue crawler trap — DQDL prevents the pipeline from continuing past an unexpected schema change.

Pattern 2 — Completeness And Uniqueness

Completeness "customer_email" > 0.95
IsComplete "order_id"
Uniqueness "order_id" > 0.99
IsUnique "transaction_id"

IsComplete is shorthand for Completeness > 1.0 (must be 100 percent), IsUnique is shorthand for Uniqueness = 1.0. Use the strict variants for primary-key-style columns and the percentage variants for tolerant rules.

Pattern 3 — Value Constraints

ColumnValues "country_code" in ["US","CA","UK","DE","JP"]
ColumnValues "order_total" between 0 and 1000000
ColumnValues "email" matches ".+@.+\\..+"
ColumnLength "phone" between 10 and 15

Value constraints catch garbage data and out-of-spec inputs.

Pattern 4 — Cross-Column Consistency

ColumnCorrelation "order_total" "shipping_cost" > 0.5
CustomSql "SELECT COUNT(*) FROM primary WHERE order_total < shipping_cost" = 0

Use ColumnCorrelation for statistical relationships and CustomSql for arbitrary SQL assertions. CustomSql is the escape hatch that handles any check the built-in rules cannot express.

Pattern 5 — Freshness And Referential Integrity

DataFreshness "ingestion_timestamp" <= 1 days
ReferentialIntegrity "customer_id" "reference.customers.id" = 1.0

DataFreshness measures the age of the most recent timestamp in a column. ReferentialIntegrity checks that values in a column match values in a reference dataset — the foreign-key check that Redshift does not enforce.

Fail-Fast vs Warn-And-Continue Patterns

How the pipeline reacts to DQ failures is a design decision that the DEA-C01 exam tests directly.

Fail-Fast

The Glue ETL job fails when any DQ rule fails. Upstream data does not reach the curated zone, downstream consumers do not see corrupted data, and the pipeline operator gets a CloudWatch alarm. Use fail-fast for high-severity quality gates — primary keys must be unique, financial totals must reconcile, regulated data must validate.

Warn-And-Continue

The job logs the failure, publishes a CloudWatch metric, and continues processing. Use warn-and-continue for low-severity informational checks — completeness is below target but not zero, freshness is slightly stale but not critical. The downstream pipeline runs; the operator gets a metric to triage but no automatic stop.

Quarantine — Row-Level Failure Routing

Glue DQ supports row-level outcomes that flag individual records as passing or failing each rule. The job can route failing rows to a quarantine S3 prefix while passing rows continue to the curated zone. This is the "best of both worlds" pattern: the pipeline does not halt, but bad data is segregated for investigation.

Choosing The Pattern

The DEA-C01 exam plants this with scenario detail. "Financial reconciliation pipeline must not load corrupted data" => fail-fast. "Marketing pipeline tolerates incomplete fields but tracks freshness" => warn-and-continue. "Customer data with mixed-quality sources" => quarantine bad rows for manual review.

Use Glue Data Quality rule recommendations to bootstrap a ruleset for a new table, then iterate on the rules as the pipeline matures. The recommendation engine analyzes a sample of the table and proposes rules — completeness thresholds based on observed nulls, uniqueness thresholds for high-cardinality columns, value-range checks for numeric columns, type-and-existence checks for all columns. The output is a starter DQDL ruleset that an engineer can review, adjust, and commit. This pattern accelerates DQ adoption from weeks of manual rule authoring to hours of curation. The DEA-C01 exam plants recommendations as the right answer when "the team has hundreds of columns and no existing rules" — never pick "manually author rules from scratch" or "use a third-party tool" when AWS recommendations are an option. After bootstrapping, schedule periodic re-recommendation as the table evolves.

Publishing DQ Metrics To CloudWatch

Glue Data Quality automatically publishes evaluation results as CloudWatch metrics for monitoring and alerting.

Custom Metrics Per Ruleset

Each ruleset evaluation produces a glue.data.quality.score CloudWatch metric scoped to the dataset and ruleset. The score is the percentage of rules that passed. Dashboards can chart the score over time; alarms can fire when the score drops below a threshold.

Per-Rule Metrics

Individual rule outcomes are emitted as separate metrics, letting operators chart "completeness of customer_email over the past 30 days" and detect gradual quality degradation.

Integration With EventBridge

DQ rule failures emit EventBridge events that downstream rules can react to — sending Slack notifications, triggering remediation Lambdas, or opening tickets in a ticketing system.

Integration With CloudWatch Alarms

A typical production pattern: DQ score below 0.95 triggers a CloudWatch alarm; the alarm sends an SNS notification to the pipeline owner; the owner investigates and acks the alarm. Combine with Composite Alarms for cross-pipeline quality dashboards.

Glue Data Quality vs Glue DataBrew vs Custom Spark — Decision Tree

Three overlapping options demand a clear decision tree.

Glue Data Quality

Right when: production ETL needs in-pipeline rule evaluation, rules are declarative and stable, the team wants CloudWatch metrics for free, integration with Glue Data Catalog for table-level quality scores is desired.

Glue DataBrew

Right when: visual no-code preparation is needed, analysts are the primary users, the workflow is exploratory profiling and one-time cleaning, transformations are simpler than full ETL.

Custom Spark Code (Glue ETL Or EMR)

Right when: rules are highly custom and don't fit DQDL, the team has Spark expertise, performance optimization is critical, integration with Deequ (Amazon's open-source DQ library) is desired.

Combined Patterns

Mature pipelines often combine: DataBrew for initial profiling and analyst-driven prep, Glue Data Quality for production rule enforcement, custom Spark for exotic checks. The DEA-C01 exam may plant this as a multi-step scenario where the right answer is "use DataBrew to profile, then write DQDL rules, then run them in the Glue ETL job."

Common Exam Traps For Data Quality

Memorize all five.

Trap 1 — DataBrew For Production DQ

A scenario describes an automated production pipeline needing quality gates. Wrong answer: DataBrew. Right answer: Glue Data Quality with DQDL rulesets and EvaluateDataQuality transform inside the Glue ETL job.

Trap 2 — Custom Spark For Standard Checks

A scenario describes basic completeness, uniqueness, and value-range rules. Wrong answer: write custom PySpark code. Right answer: Glue Data Quality DQDL — built-in, declarative, with CloudWatch integration.

Trap 3 — Manually Authoring Rules For 200 Columns

A scenario describes a wide table needing initial rules. Wrong answer: hand-write DQDL for every column. Right answer: use Glue Data Quality rule recommendations to bootstrap.

Trap 4 — DataBrew For Streaming Data

A scenario describes streaming Kinesis data needing quality checks. Wrong answer: DataBrew (batch-only). Right answer: Managed Service for Apache Flink with custom validation, or Lambda-based validation, or Glue streaming ETL with embedded checks.

Trap 5 — Confusing DataBrew Profile Jobs With Quality Enforcement

A scenario asks how to "discover" data quality issues. Right answer: DataBrew profile jobs (one-time discovery). A scenario asks how to "enforce" data quality on every pipeline run. Right answer: Glue Data Quality (continuous enforcement). The verbs in the scenario tell you which tool.

DataBrew recipe jobs and Glue Data Quality rulesets solve different problems — recipe jobs transform data, rulesets evaluate data, and the DEA-C01 exam plants questions where candidates conflate the two. A recipe job answers "how do I clean and reshape this dataset?" — rename columns, fill nulls, apply regex, filter rows, mask PII. A ruleset answers "is this dataset valid by my rules?" — completeness above 95 percent, uniqueness on the primary key, values within expected ranges. The exam trap: a scenario asks for "automated cleaning of customer data with quality checks" and the answer choices include "DataBrew recipe job," "Glue Data Quality ruleset," and "DataBrew profile job." If the scenario emphasizes transformation, recipe job. If it emphasizes validation and pass/fail enforcement, ruleset. If it emphasizes discovery of unknown issues, profile job. Read the verb carefully — clean = recipe, validate = ruleset, profile = profile job.

Data Contracts — The Architectural Pattern

Data contracts are the formal agreement between data producers and data consumers about what a dataset contains and how it behaves.

What A Contract Specifies

Schema (column names, types, nullability), quality thresholds (completeness, uniqueness, freshness), update cadence (hourly, daily, on-demand), ownership (which team produces it), and SLA (when consumers can expect new data). The contract is version-controlled and versioned — breaking changes require a new version.

Glue Data Quality As Contract Enforcement

The DQDL ruleset attached to a Glue Data Catalog table is the executable contract. When the producer's pipeline runs, the ruleset evaluates whether the new data meets the contract; consumers can query the latest evaluation score before consuming.

Schema Registry For Schema Evolution

Glue Schema Registry handles the schema-evolution side of contracts — backward, forward, or full compatibility modes that determine whether new schema versions can coexist with old consumers. Combine Schema Registry (for schema) with Glue Data Quality (for quality) for the full contract enforcement story.

Plain-Language Explanation: Glue DataBrew And Glue Data Quality

Three concrete analogies make the DataBrew-vs-Data-Quality split intuitive.

Analogy 1 — The Restaurant Health Inspection

DataBrew is the restaurant's prep cook tasting and adjusting recipes during the morning prep — checking ingredients, fixing seasoning, deciding what to throw out, all by visual inspection and expertise. The prep cook is hands-on, exploratory, and trained to spot issues. Glue Data Quality is the health inspector who arrives at noon with a checklist: refrigerator at correct temperature, no expired ingredients, hand-washing log present, allergens labeled. The inspector does not taste anything; they evaluate against pre-defined rules and either pass the kitchen or shut it down. The prep cook (DataBrew) operates during preparation; the inspector (Glue Data Quality) operates at the gate before service starts. The DEA-C01 trap is sending the prep cook to do health inspections — slow, inconsistent, no audit trail — or sending the health inspector to fix bad recipes (they cannot, that is not their job).

Analogy 2 — The Library Acquisitions Versus Circulation Desk

DataBrew is the acquisitions librarian: when a new collection of books arrives from an estate donation, the acquisitions librarian visually inspects every box, profiles the collection (how many fiction, how many non-fiction, what condition, what languages), removes duplicates, fills in missing call numbers, and decides what gets shelved versus discarded. Slow, hands-on, expert work. Glue Data Quality is the circulation desk's automated check: every book that enters the lending system must have a valid ISBN, a non-null call number, a unique barcode, and a checked-out flag set to false on first entry. The check runs on every transaction, fails fast on bad data, and writes a metric to the library's quality dashboard. Acquisitions handles the new and unusual; circulation enforces the standards on every operation. DataBrew profile jobs are the acquisitions inventory report; DataBrew recipe jobs are the acquisitions cleaning workflow; Glue Data Quality is the circulation gatekeeper.

Analogy 3 — The Postal System Sorting Versus Address Validation

DataBrew is the mail-sorting clerk who opens a sack of mixed mail, looks at each piece, decides which destination bin it goes in, and applies rules like "anything with damaged postage goes to the special handling bin." Hands-on, recipe-driven, and rerunnable on the next sack. Glue Data Quality is the address validation system at the moment of postage sale: every package must have a valid ZIP code, a non-empty street address, a recipient name, and a declared weight within limits — fail any check and the package is rejected at the counter, never entering the mail stream. Both serve quality; the sorting clerk handles the messy real world while the validation system enforces the standards. DataBrew is the place where "we have a million records of unknown shape and need to understand them." Glue Data Quality is the place where "we have a known schema and need to enforce it on every batch."

Key Numbers And Must-Memorize Facts

Glue DataBrew

250+ built-in transformations
Supports S3, Redshift, RDS, Glue Catalog, Snowflake, JDBC sources
Profile jobs produce statistics + correlations
Recipe jobs replay visual transformations at scale
Bills per node-hour (Spark backend)
No-code visual interface, also CLI/API accessible

Glue Data Quality

DQDL is the rule definition language
20+ built-in rule types (RowCount, Completeness, Uniqueness, etc.)
Rule recommendations bootstrap from sample analysis
Integrates with Glue ETL jobs (EvaluateDataQuality transform)
Integrates with Glue Data Catalog (table-level quality score)
Publishes CloudWatch metrics automatically
Emits EventBridge events on rule failures
Fail-fast, warn-and-continue, or quarantine row-level patterns

Data Quality Dimensions

Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity
DQDL covers all six via different rule types
Custom SQL is the escape hatch for arbitrary checks

Decision Rules

DataBrew = visual, exploratory, analyst-driven, profiling
Glue Data Quality = declarative rules, in-pipeline, automated enforcement
Custom Spark = exotic checks, performance-critical, Deequ integration
Schema Registry + Glue DQ = full data contract enforcement

Memorize the eight most-tested DQDL rule types: RowCount, Completeness, Uniqueness, ColumnValues, ColumnCorrelation, DataFreshness, CustomSql, ReferentialIntegrity. RowCount asserts table size. Completeness asserts non-null fraction. Uniqueness asserts no-duplicate fraction. ColumnValues asserts membership in a set or range. ColumnCorrelation asserts statistical relationship between columns. DataFreshness asserts age of most-recent timestamp. CustomSql is the escape hatch for arbitrary SQL assertions. ReferentialIntegrity asserts foreign-key-style match against a reference dataset. The DEA-C01 exam may show DQDL syntax in a code block and ask "what does this rule check?" or "which rule type would catch this issue?" Memorizing the eight names plus their semantic meaning covers nearly all surfaceable scenarios.

DEA-C01 exam priority — Glue DataBrew and Glue Data Quality. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — Glue DataBrew And Glue Data Quality Top Questions

Q1 — When should I use Glue DataBrew versus Glue Data Quality?

Use Glue DataBrew when the workflow is exploratory data profiling, no-code visual preparation, or analyst-driven cleaning where the user wants to inspect rows visually and apply transformations like a spreadsheet. Use Glue Data Quality when the workflow is automated, declarative, in-pipeline rule enforcement that runs on every Glue ETL execution and emits CloudWatch metrics. The two are complementary — DataBrew during the design phase to discover issues and prototype cleaning logic, Glue Data Quality during the production phase to enforce rules continuously. The DEA-C01 exam plants this with scenario verbs: "discover," "explore," "profile" => DataBrew; "enforce," "validate," "block bad data" => Glue Data Quality. Never pick DataBrew for a production automated pipeline scenario or Glue Data Quality for an analyst-driven exploratory scenario.

Q2 — How do I write DQDL rules for a new table without authoring 50 rules manually?

Use Glue Data Quality rule recommendations. The feature analyzes a sample of the dataset and proposes a starter ruleset based on observed properties — completeness thresholds based on null rates, uniqueness thresholds for likely-key columns, value-range constraints for numeric columns, type-and-existence checks for all columns. The recommendations are produced as DQDL text that an engineer can copy, edit, and commit to version control. After bootstrapping, refine the rules over time as the team learns which thresholds are too tight or too loose. The DEA-C01 exam plants this as the answer to "we have a wide table with no existing rules" — never hand-author when recommendations are available.

Q3 — How do I handle DQ failures without halting the entire pipeline?

Three options. Warn-and-continue: configure the EvaluateDataQuality transform to log failures and emit CloudWatch metrics without failing the job. The pipeline runs to completion; operators see metrics and triage out of band. Quarantine: use row-level outcomes to route failing records to a quarantine S3 prefix while passing rows continue to the curated zone. Compensating actions: use EventBridge events from rule failures to trigger remediation Lambdas (e.g., re-pull from source, fill nulls, or flag for manual review). Choose by severity — critical rules (PK uniqueness, financial reconciliation) should fail-fast, informational rules (slight completeness drop) should warn-and-continue, mixed-severity datasets should quarantine. The DEA-C01 exam plants this as scenario design: read the business context to decide.

Q4 — Can Glue Data Quality rules run on streaming data?

Glue Data Quality is primarily designed for batch ETL on the Spark engine, evaluating rulesets against a finite dataset per job run. For streaming data quality, the patterns are different: Managed Service for Apache Flink for stateful streaming validation, Lambda-based per-record checks on Kinesis or DynamoDB Streams events, or Glue streaming ETL with embedded validation logic. Glue streaming ETL can include EvaluateDataQuality on micro-batches, but the latency profile differs from true streaming validation. The DEA-C01 exam plants streaming DQ as a Flink-or-Lambda question, not a Glue Data Quality question — read the scenario for "real-time" or "sub-second" cues.

Q5 — How does Glue Data Quality integrate with the Glue Data Catalog?

A ruleset can be attached to a Glue Data Catalog table as a "data quality" property, giving the catalog a quality score visible in the console. Athena and downstream consumers can query the latest evaluation score before consuming the table. The integration enables a "data contract" pattern where the producer's pipeline runs the ruleset on every load and the consumer can decide whether the freshness and quality meet their needs before launching expensive analytics queries. The DEA-C01 exam plants this as the right answer for "surface data quality state to consumers without manual coordination."

Q6 — What is the difference between IsComplete and Completeness in DQDL?

IsComplete "column_name" is shorthand for Completeness "column_name" = 1.0 — the column must be 100 percent non-null with no tolerance. Completeness "column_name" > threshold allows a tolerance — for example, Completeness "email" > 0.95 requires 95 percent non-null but tolerates 5 percent missing. Use IsComplete for primary-key columns and other non-negotiable fields; use Completeness for tolerant fields like optional contact info. Similar pattern: IsUnique is shorthand for Uniqueness = 1.0, while Uniqueness > 0.99 allows a tolerance. The DEA-C01 exam may show both in answer choices and test whether the candidate knows the strict-versus-tolerant distinction.

Q7 — What is a data contract and how does Glue Data Quality enforce it?

A data contract is the formal agreement between data producers and consumers specifying schema, quality thresholds, update cadence, ownership, and SLA. The contract is version-controlled and breaking changes require a new version. Glue Data Quality enforces the executable portion of the contract — DQDL rulesets attached to Glue Data Catalog tables that evaluate the dataset against committed quality thresholds on every pipeline run. Combine with Glue Schema Registry for schema-evolution governance (backward, forward, full compatibility modes) to cover the schema portion of the contract. The DEA-C01 exam plants data contracts as the architectural answer to "the producer team and consumer team need to formally agree on dataset quality without ad-hoc coordination" — Glue Schema Registry plus Glue Data Quality is the AWS-native answer.

Data Quality Dimensions — The Six Standard Checks

Completeness

Accuracy

Consistency

Timeliness

Uniqueness

Validity

AWS Glue DataBrew — Visual Data Profiling And Preparation

What DataBrew Does

Profile Jobs

Recipe Jobs

When DataBrew Is Right

When DataBrew Is Wrong

AWS Glue Data Quality — Rules In The Pipeline

What Glue Data Quality Does

DQDL — Data Quality Definition Language

Built-In Rule Types

Rule Recommendations

Integration With Glue ETL Jobs

Integration With Glue Data Catalog

DQDL Patterns — Writing Rules In The Real World

Pattern 1 — Schema And Structural Rules

Pattern 2 — Completeness And Uniqueness

Pattern 3 — Value Constraints

Pattern 4 — Cross-Column Consistency

Pattern 5 — Freshness And Referential Integrity

Fail-Fast vs Warn-And-Continue Patterns

Fail-Fast

Warn-And-Continue

Quarantine — Row-Level Failure Routing

Choosing The Pattern

Publishing DQ Metrics To CloudWatch

Custom Metrics Per Ruleset

Per-Rule Metrics

Integration With EventBridge

Integration With CloudWatch Alarms

Glue Data Quality vs Glue DataBrew vs Custom Spark — Decision Tree

Glue Data Quality

Glue DataBrew

Custom Spark Code (Glue ETL Or EMR)

Combined Patterns

Common Exam Traps For Data Quality

Trap 1 — DataBrew For Production DQ

Trap 2 — Custom Spark For Standard Checks

Trap 3 — Manually Authoring Rules For 200 Columns

Trap 4 — DataBrew For Streaming Data

Trap 5 — Confusing DataBrew Profile Jobs With Quality Enforcement

Data Contracts — The Architectural Pattern

What A Contract Specifies

Glue Data Quality As Contract Enforcement

Schema Registry For Schema Evolution

Plain-Language Explanation: Glue DataBrew And Glue Data Quality

Analogy 1 — The Restaurant Health Inspection

Analogy 2 — The Library Acquisitions Versus Circulation Desk

Analogy 3 — The Postal System Sorting Versus Address Validation

Key Numbers And Must-Memorize Facts

Glue DataBrew

Glue Data Quality

Data Quality Dimensions

Decision Rules

FAQ — Glue DataBrew And Glue Data Quality Top Questions

Q1 — When should I use Glue DataBrew versus Glue Data Quality?

Q2 — How do I write DQDL rules for a new table without authoring 50 rules manually?

Q3 — How do I handle DQ failures without halting the entire pipeline?

Q4 — Can Glue Data Quality rules run on streaming data?

Q5 — How does Glue Data Quality integrate with the Glue Data Catalog?

Q6 — What is the difference between IsComplete and Completeness in DQDL?

Q7 — What is a data contract and how does Glue Data Quality enforce it?

Further Reading — Official AWS Documentation

Official sources

More DEA-C01 topics