examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Macie, DataZone, and Data Privacy Governance

4,000 words · ≈ 20 min read ·

DEA-C01 Domain 4 Tasks 4.4/4.5 governance: Amazon Macie ML-driven PII discovery in S3, Amazon DataZone business catalog, CloudTrail data events vs management events, S3 access logs, Redshift audit logging, dynamic data masking, and GDPR/HIPAA compliance patterns.

Do 20 practice questions → Free · No signup · DEA-C01

Macie, DataZone, and the broader data governance toolset are the DEA-C01 Domain 4 services that catch data engineers off guard because they look like security tools but actually serve a different purpose — discovering PII, cataloging business data assets, auditing access, and enforcing privacy regulations. On the DEA-C01 exam Tasks 4.4 and 4.5 plant scenarios in roughly one out of every eight questions that test whether the candidate knows the boundary between Macie (discovery) versus Lake Formation (access control) versus DataZone (business catalog) versus CloudTrail (audit). The trap is rarely about whether to govern — it is choosing the right tool for the question, because every wrong answer in this domain looks plausible.

This guide covers data privacy governance through the Data Engineer / MLOps lens — what Amazon Macie does and does not do, how DataZone's business catalog differs from the Glue Data Catalog, how CloudTrail data events differ from management events, how Redshift dynamic data masking and Lake Formation cell-level masking compose, how to log S3 object-level access for compliance, and the canonical exam traps planted around Macie-versus-Lake-Formation, DataZone-versus-Glue-Catalog, and CloudTrail's two event types.

Data Privacy Governance — The Three Pillars

Before talking about Macie or DataZone, separate the three concerns that "governance" bundles together.

Pillar 1 — Discovery

What sensitive data exists in our data lake, and where? Discovery is the prerequisite for protection — you cannot encrypt or mask what you cannot find. Amazon Macie is the AWS service for automated PII discovery in S3.

Pillar 2 — Access Control

Who can see what data, at what granularity? Access control is enforced by IAM (API-level), Lake Formation (database/table/column/row), and KMS (encryption keys). Discussed in the IAM and Lake Formation topic; covered here only at the boundary with discovery and audit.

Pillar 3 — Audit And Compliance

Who actually accessed what data, when, and from where? Audit is enforced by CloudTrail, S3 access logs, Redshift audit logs, and CloudTrail Lake. Audit is the reactive record that satisfies GDPR, HIPAA, SOX, and other regulations after the fact.

Why DEA-C01 Tests All Three

The exam plants scenarios where a candidate confuses discovery with access control ("use Macie to block PII access") or audit with discovery ("use CloudTrail to find PII"). Each tool has a job; mixing them up is the most common Domain 4 mistake.

Amazon Macie — ML-Driven PII Discovery

Amazon Macie is the AWS service for automated discovery of sensitive data in S3 buckets.

What Macie Does

Macie scans S3 objects (CSV, JSON, Parquet, Avro, plain text, common document formats) using machine learning models and pattern matching to identify sensitive data — credit card numbers, Social Security numbers, names, addresses, AWS access keys, SSH keys, and many other categories. Findings are surfaced in the Macie console, exported to Security Hub, and emitted as EventBridge events for downstream automation.

What Macie Does NOT Do

Macie is a discovery tool — it finds sensitive data and reports it. Macie does NOT mask, redact, encrypt, delete, or block access to the data. Acting on findings is the engineer's responsibility (Lake Formation grants, KMS encryption, S3 Object Lambda, manual remediation). The DEA-C01 trap: candidates who pick Macie for "automatically prevent analysts from seeing PII" — wrong, Macie reports, Lake Formation enforces.

Managed Data Identifiers

Macie ships with 150+ managed data identifiers — pre-built detection rules for common PII patterns (credit cards via Luhn check, SSNs, IBANs, AWS access keys, IP addresses, names, addresses by country). Each identifier has tunable thresholds and supports allow-listing patterns to reduce false positives.

Custom Data Identifiers

Custom data identifiers let you define regex patterns plus context keywords for organization-specific PII — employee IDs, internal product codes, custom medical record numbers. The pattern: regex matches a candidate string, the keyword check requires a contextual word ("employee", "id", "ssn") within a configured proximity.

Allow Lists And Suppression Rules

Allow lists exclude known-safe values (test data with fake SSNs, sample credit cards in documentation). Suppression rules archive findings that match a pattern without alerting, useful for known false positives in development buckets.

Macie In The Pipeline Context

The canonical pattern: enable Macie on staging S3 buckets where new data lands, configure scheduled discovery jobs (daily or weekly), surface findings in Security Hub, and route high-severity findings to a remediation Lambda that quarantines the bucket prefix or alerts the data owner. Macie is the gate between "data has arrived" and "data is approved for the production lake."

Cost Model

Macie bills per GB of data scanned for sensitive data discovery jobs and per S3 bucket evaluated for the bucket inventory feature. The DEA-C01 exam plants cost as a constraint — Macie scanning a multi-petabyte data lake daily is expensive; the right pattern is targeted scanning of staging buckets, not full-lake scanning.

Amazon Macie discovers and reports sensitive data in S3 using ML and managed data identifiers — it does NOT mask, redact, or block access to the data; remediation is the engineer's responsibility via Lake Formation, KMS, S3 Object Lambda, or manual handling. Macie's role in the pipeline is the gate: scan staging buckets to identify PII before promotion to the production data lake, surface findings to Security Hub for SOC review, and emit EventBridge events that trigger remediation workflows. The DEA-C01 exam plants Macie as the right answer for "automatically discover PII across thousands of S3 buckets" — never pick Macie for "prevent analysts from seeing credit card numbers" (that is Lake Formation column-level grants) or for "mask SSN values in query results" (that is Redshift dynamic data masking or Lake Formation data filters). Macie answers what and where; other services answer how to control.

Amazon DataZone — Business Data Catalog

Amazon DataZone is the AWS service for cross-team data discovery, publishing, and subscription.

What DataZone Provides

DataZone is a business-facing catalog where data producers publish datasets as discoverable assets, and data consumers (analysts, ML engineers, business stakeholders) browse, search, and request access. Each asset has business metadata (description, owner, classification, data quality score, columns documented), lineage graphs, and a subscription workflow.

Domains, Projects, And Assets

DataZone organizes around three concepts. Domains are the top-level governance scope (typically one per organization or business unit). Projects are collaboration spaces within a domain where teams work on data products. Assets are the published datasets (Glue tables, Redshift tables, custom assets) attached to projects.

Subscription Workflow

A consumer browses the catalog, finds an asset, and requests subscription. The asset owner approves or rejects the request. Upon approval, DataZone provisions the underlying access (Lake Formation grants, IAM role updates) automatically without the consumer or producer manually configuring permissions.

DataZone vs Glue Data Catalog

The Glue Data Catalog is the technical metadata store — schemas, partition layouts, file formats, connection details. DataZone is the business catalog — descriptions, owners, business glossary, subscription workflows, lineage. The two complement each other: Glue catalogs technical assets, DataZone publishes them to business consumers. The DEA-C01 trap: confusing the two — DataZone is not a replacement for Glue, and Glue is not visible to non-engineers.

Built-In ML Recommendations

DataZone uses ML to suggest related assets, recommend asset descriptions, and auto-classify columns. Useful for large catalogs where manual curation is impractical.

Lineage Visualization

DataZone tracks lineage from source to consumer — which Glue ETL job produced this Redshift table, which dashboard uses it, which user subscribed. The lineage graph helps consumers understand data provenance and helps producers see downstream impact of schema changes.

When DataZone Is Right

When the organization has a data mesh architecture with multiple producing teams and many consuming teams, when business stakeholders need to discover data without learning Glue or IAM, and when the access-request workflow benefits from a self-service catalog. DEA-C01 plants DataZone as the answer to "data mesh with self-service discovery."

When DataZone Is Wrong

For small teams with one or two producers and a handful of consumers, the Glue Data Catalog plus Lake Formation is sufficient and DataZone adds operational overhead. The exam plants DataZone as wrong-answer in scenarios about pure technical metadata or single-team data lakes.

CloudTrail — Management Events vs Data Events

CloudTrail is the AWS audit log for API calls, and the management-vs-data distinction is the DEA-C01 staple.

Management Events

Management events log control-plane API calls — s3:CreateBucket, iam:CreateRole, glue:CreateDatabase, redshift:CreateCluster. Enabled by default on every AWS account; free for the first copy. Use case: detecting unauthorized resource creation, tracking infrastructure changes for compliance.

Data Events

Data events log data-plane API calls — s3:GetObject, s3:PutObject, lambda:Invoke, dynamodb:GetItem. NOT enabled by default; must be configured per resource (per S3 bucket, per Lambda function). Bills per million events captured. Use case: object-level access auditing, regulated data tracking, forensic investigation of who read which file.

Why The Distinction Matters

For S3 data lake compliance, management events tell you "the bucket was created" but say nothing about who read which object. Data events log every GetObject and PutObject, providing the audit trail GDPR and HIPAA require. The DEA-C01 trap: candidates who think CloudTrail's default tracking covers S3 reads — wrong, that requires explicit data event configuration.

Selectors For Cost Control

Data events at full volume on a busy S3 bucket can generate billions of records per day. CloudTrail event selectors (basic and advanced) let you narrow the captured events — only specific buckets, only specific prefixes, only specific principals — to control cost while maintaining audit coverage of the regulated data.

CloudTrail Lake — SQL-Queryable Audit Store

CloudTrail Lake is a managed event data store that stores CloudTrail events for up to 7 years and lets you query them with SQL. Use case: ad hoc forensic queries ("who accessed bucket X between dates Y and Z"), long-term retention for compliance, audit reporting. Replaces older patterns of shipping CloudTrail logs to S3 + Athena.

CloudTrail data events are NOT enabled by default and are required for S3 object-level access auditing, Lambda invocation logging, and DynamoDB item-level access tracking — without them, CloudTrail records bucket creation but not who read which file. Management events (control-plane API calls) are free and on by default; data events (data-plane API calls) cost per-million-events and require explicit configuration per resource. The DEA-C01 exam plants this as the canonical CloudTrail trap: a scenario asks for "object-level audit log of who accessed sensitive S3 data," and candidates who pick "default CloudTrail" are wrong because default CloudTrail does not capture S3 GetObject. Right answer: configure CloudTrail data events for the specific S3 bucket (or S3 path prefix) containing regulated data, accepting the per-event cost. Use event selectors to narrow scope and control cost; ship to CloudTrail Lake for SQL-queryable retention.

S3 Access Logs vs CloudTrail S3 Data Events

Two AWS-native ways to log S3 access. The DEA-C01 exam tests when each applies.

S3 Server Access Logs

The older feature: each S3 request is logged as a record in a target bucket as a delimited text line. Captures bucket-level access (every request to the source bucket) including requester identity, action, source IP, and response code. Free (you pay only for the storage of the log files).

CloudTrail S3 Data Events

The newer feature: each S3 data API call is logged as a CloudTrail event with full IAM context, integrating with Security Hub, EventBridge, CloudTrail Lake, and the broader audit infrastructure. Bills per million events.

When To Use Which

S3 access logs are simpler and free, useful for high-volume buckets where the per-event CloudTrail cost is prohibitive. CloudTrail data events integrate with the AWS audit ecosystem and are required for compliance frameworks that mandate "audit logs in CloudTrail." For regulated data lakes, CloudTrail data events are the right answer despite the cost; for non-regulated high-volume buckets, S3 access logs are sufficient.

Combining Both

Production patterns often run both — S3 access logs for cheap full-bucket visibility, CloudTrail data events on specific regulated prefixes for the audit-grade trail. The DEA-C01 exam may plant this as a multi-pattern scenario.

Redshift Audit Logging And Dynamic Data Masking

Redshift has two distinct security observability features that the DEA-C01 exam tests.

Redshift Audit Logging

Redshift audit logging captures connection attempts, user activity, and queries executed on the cluster. Logs go to S3 or CloudWatch Logs. Use case: tracking who ran which query against the data warehouse, useful for both security and performance investigation.

Dynamic Data Masking (DDM)

Redshift Dynamic Data Masking applies role-based masking to column values at query time without modifying the underlying data. A CREATE MASKING POLICY defines the mask logic (return null, return SHA-256 hash, return last 4 chars of SSN, etc.); the policy is ATTACHed to columns and applied based on the querying role.

How DDM Differs From Lake Formation Filters

Lake Formation column-level grants and data filters apply to S3-backed Glue tables queried via Athena, Redshift Spectrum, or EMR. Redshift DDM applies to native Redshift tables (RA3 managed storage). The two are complementary — Lake Formation governs the data lake, DDM governs the warehouse.

DDM For Compliance

A common pattern: customer service reps see the last 4 digits of credit card numbers via DDM, while finance team sees full numbers. Same underlying data, different masking based on the role, no separate copies of the table. The DEA-C01 exam plants DDM as the right answer for "role-based view of sensitive Redshift columns."

Data Masking Strategies — Lake Formation, Redshift DDM, S3 Object Lambda

Three AWS approaches to masking sensitive data, each at a different layer.

Lake Formation Cell-Level Security

Combine column-level grants and row-level data filters in Lake Formation to deliver cell-level masking on Glue tables. Used by Athena, Redshift Spectrum, EMR, and other engines that query the Glue catalog. Right answer for: data lake masking on S3-backed tables.

Redshift Dynamic Data Masking

Role-based masking applied at query time to Redshift native tables. Right answer for: warehouse-resident data with role-based reveal patterns.

S3 Object Lambda

S3 Object Lambda intercepts S3 GetObject calls and runs a Lambda function that transforms the data before returning. Use case: redact PII columns from CSV files on the fly, apply custom masking logic per requesting role. More flexible but higher latency than Lake Formation or DDM. Right answer for: legacy applications reading raw S3 objects that need masking without changing the application.

Choosing The Layer

Mask at the data lake layer (Lake Formation) when the consumer is Athena/Spectrum/EMR. Mask at the warehouse layer (Redshift DDM) when the consumer is Redshift native queries. Mask at the application layer (S3 Object Lambda) when the consumer is a legacy system reading raw objects. The DEA-C01 exam plants masking choice based on the consumer in the scenario.

Amazon DataZone is a business data catalog with subscription workflows and lineage — it is NOT an access control mechanism, encryption tool, or technical metadata store, and the DEA-C01 exam plants candidates who confuse DataZone with Lake Formation or the Glue Data Catalog. DataZone publishes datasets as business-discoverable assets, processes subscription requests, and surfaces lineage; it depends on Lake Formation for the actual access enforcement and on the Glue Data Catalog for technical metadata. The exam trap describes a scenario like "the team needs to control which columns analysts can see and to discover all assets in one catalog" — wrong answer is "DataZone enforces column-level grants" (that is Lake Formation), wrong answer is "DataZone replaces Glue" (it does not). Right answer: DataZone as the discovery and subscription layer on top of Lake Formation governance and Glue catalog metadata. Three layers, three jobs, none replaceable by another.

GDPR, HIPAA, And AWS Compliance Tools

The DEA-C01 exam may surface compliance scenarios in a high-level way.

GDPR — Data Subject Rights

GDPR grants EU residents rights including access (a copy of their data), rectification (correction), erasure (right to be forgotten), and portability (export in machine-readable format). AWS tools: Macie discovers personal data, Lake Formation enforces access, S3 Object Lock prevents tampering, KMS encrypts at rest, CloudTrail audits access.

HIPAA — PHI Handling

HIPAA-covered entities must protect Protected Health Information (PHI). AWS Business Associate Addendum (BAA) covers HIPAA-eligible services. Tools: Macie discovers PHI, Lake Formation column-level for de-identification, KMS for encryption, CloudTrail data events for access logs, dedicated VPC for isolation.

AWS Compliance Programs

AWS Artifact provides downloadable compliance reports (SOC, ISO, PCI DSS, HIPAA BAA). AWS Config tracks configuration changes for compliance evidence. AWS Audit Manager automates evidence collection for audits. The DEA-C01 exam plants these as "where do I find compliance reports" type questions.

Data Residency

Some regulations require data to remain within specific regions. AWS regions are independent, and S3 cross-region replication, KMS multi-region keys, and DynamoDB Global Tables let you architect for residency. Tag data with region metadata and use IAM aws:RequestedRegion conditions to enforce.

Plain-Language Explanation: Macie, DataZone, And Data Governance

Three concrete analogies.

Analogy 1 — The Hospital With Privacy Officer, Records Department, And Audit Log

Macie is the hospital's automated PII scanner — it walks through every paper file in every cabinet and flags documents containing patient names, medical record numbers, or insurance IDs, producing a report of "where is the sensitive stuff." It does not lock cabinets or shred documents — it just finds and reports. Lake Formation is the lock-and-key system that the records department uses based on Macie's findings. DataZone is the hospital's internal data catalog that lets researchers, billing, and pharmacy teams browse "what data exists, who owns it, what is it about, can I subscribe?" — a polished business interface on top of the messy file room. CloudTrail management events log every time a new cabinet is installed; CloudTrail data events log every time a specific file is opened. S3 access logs are the entry-and-exit logbook at the records-department door. Redshift Dynamic Data Masking is the ER nurse's view that shows the last four digits of the patient's SSN while the billing team sees the full number — same patient, role-based reveal. The DEA-C01 trap is asking "Macie to lock the cabinets" (it cannot) or "DataZone to detect PII" (it is a catalog, not a scanner) — each tool has its job.

Analogy 2 — The Library With Index, Catalog, And Borrowing Records

Macie is the library's PII auditor: a robot that wanders the stacks reading book contents and flagging any book containing personal information that needs special handling. It does not move the books or restrict access — it just produces a report. The librarians (Lake Formation grants, KMS keys) act on the report. DataZone is the library's public-facing catalog: patrons browse subjects, see book descriptions and authors, see which librarian curates each section, and request a hold via the catalog UI which routes to the librarian for approval. The Glue Data Catalog is the back-of-house bibliographic record system used by librarians (technical metadata) — patrons never see it. CloudTrail management events log every time a new shelf is built or a librarian is hired; data events log every book check-out and return. S3 access logs are the turnstile counter at the library entrance. Redshift dynamic data masking is the reading-room rule that lets researchers see redacted documents while archivists with clearance see the unredacted versions of the same documents — same artifacts, role-based redaction. The DEA-C01 trap is sending the PII auditor (Macie) to design the public catalog (DataZone) — different jobs, different tools.

Analogy 3 — The Customs Border With Sniffer Dogs, Public Directory, And Surveillance Footage

Macie is the sniffer-dog team at the customs border: walking the imported cargo, finding restricted items, producing a report. It does not seize or redirect cargo — just identifies. The customs officers (Lake Formation, KMS) decide what to do with the findings. DataZone is the customs broker's directory: importers list what they have legally cleared, exporters browse, ask questions, and place orders, all through a polished interface that hides the back-office paperwork. CloudTrail management events are the master log of when the customs facility itself was built or modified; data events are the per-package log of every import processed. S3 access logs are the gate camera footage of every truck that entered the warehouse. Redshift dynamic data masking is the shipping manifest that shows redacted recipient addresses to the warehouse worker but full addresses to the driver. GDPR/HIPAA are the international trade laws that the entire system must comply with — Macie discovers what is in scope, Lake Formation controls access, CloudTrail proves compliance after the fact.

Common Exam Traps For Macie, DataZone, And Data Governance

Memorize all five.

Trap 1 — Macie Blocks PII Access

A scenario suggests using Macie to "prevent analysts from seeing credit card numbers." Wrong — Macie discovers and reports, never enforces. Right answer: Lake Formation column-level grants or Redshift DDM.

Trap 2 — DataZone Replaces Glue Data Catalog

A scenario suggests DataZone as the technical metadata store. Wrong — DataZone is the business catalog layered on top of Glue. Right answer: keep Glue for technical metadata, use DataZone for business discovery.

Trap 3 — Default CloudTrail Captures S3 Reads

A scenario asks for "audit log of who read sensitive S3 objects" and the answer choices include "default CloudTrail." Wrong — default CloudTrail captures only management events. Right answer: configure CloudTrail data events on the specific S3 bucket or prefix.

Trap 4 — S3 Access Logs For Compliance

A scenario asks for "GDPR-compliant audit trail" and includes S3 server access logs as the answer. Wrong-leaning — S3 access logs are simpler but lack IAM context and AWS-audit-ecosystem integration. Right answer for regulated data: CloudTrail data events with CloudTrail Lake retention.

Trap 5 — Macie Scans The Whole Data Lake Daily

A scenario describes Macie scanning a 5 PB data lake daily. Wrong — too expensive. Right answer: scan staging buckets where new data lands, schedule full-lake scans monthly or quarterly, use targeted jobs on suspect prefixes.

Key Numbers And Must-Memorize Facts

Amazon Macie

  • 150+ managed data identifiers (credit card, SSN, IBAN, AWS keys, names, addresses)
  • Custom data identifiers via regex + context keywords
  • Allow lists and suppression rules for false positives
  • Integrates with Security Hub and EventBridge
  • Bills per GB scanned and per bucket evaluated
  • Discovery only — does not mask, encrypt, or block

Amazon DataZone

  • Three concepts: Domain, Project, Asset
  • Subscription workflow with approval routing
  • ML-driven recommendations for descriptions and related assets
  • Lineage visualization across producers and consumers
  • Integrates with Glue Data Catalog (technical) and Lake Formation (access)
  • Per-user and per-asset pricing

CloudTrail

  • Management events: free, on by default, control-plane API calls
  • Data events: not on by default, per-million-events cost, data-plane API calls
  • Event selectors narrow scope for cost control
  • CloudTrail Lake: SQL-queryable, up to 7-year retention
  • Insights events for anomaly detection (paid feature)

Redshift Dynamic Data Masking

  • CREATE MASKING POLICY defines transformation
  • ATTACH applies to columns
  • Role-based reveal at query time
  • Native Redshift tables only (not Spectrum external tables)

S3 Access Logging

  • Server access logs: free, simpler, written to a target S3 bucket
  • CloudTrail data events: paid, integrates with Security Hub and EventBridge
  • Both can run simultaneously for defense in depth

Use Macie's managed data identifiers as the first scan, then add custom data identifiers for organization-specific PII patterns — this combination covers 90+ percent of real-world sensitive data. The 150+ managed identifiers cover credit cards, SSNs, IBANs, AWS access keys, IPv4/IPv6 addresses, names by country, mailing addresses, phone numbers, and many other categories with built-in false-positive reduction (Luhn check on credit cards, format validation on SSNs). Custom identifiers extend the coverage with regex patterns plus required context keywords — a regex matching EMP\d{6} plus the keyword "employee" within 50 characters defines an organization's employee ID. The DEA-C01 exam plants this pattern as the right answer for "how to detect both standard PII and custom organizational identifiers" — never pick "write a Lambda to scan files" or "use Glue Data Quality regex rules" when Macie's identifier framework is purpose-built for this. After detection, route findings to Security Hub or remediation Lambdas; remember Macie reports, it does not act.

Memorize the CloudTrail event types: Management Events (free, on-by-default, control-plane like CreateBucket), Data Events (paid, off-by-default, data-plane like GetObject), and Insights Events (paid, anomaly detection on management API calls). For DEA-C01 audit scenarios: regulated S3 data lake => enable Data Events on the specific buckets; infrastructure change tracking => Management Events sufficient (already on); detecting unusual API call patterns (potential compromise) => enable Insights Events. Cost optimization: use event selectors to narrow Data Events to specific prefixes and principals; ship Management Events to CloudTrail Lake for SQL-queryable retention; archive older events to S3 with lifecycle to Glacier for long-term compliance retention. The exam tests data-events-vs-management-events as a recurring trap — always ask "data plane or control plane?" when CloudTrail comes up.

DEA-C01 exam priority — Macie, DataZone, and Data Governance. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — Macie, DataZone, And Data Governance Top Questions

Q1 — When should I use Amazon Macie versus AWS Glue Data Quality?

Use Macie to discover sensitive data (PII, PHI, credentials, API keys) in S3 — it answers "what sensitive data exists and where?" Use Glue Data Quality to enforce data quality rules (completeness, uniqueness, value ranges) on Glue ETL pipelines — it answers "does this data meet our quality contract?" The two tools serve different purposes and frequently complement each other in production: Macie identifies PII columns that need masking, Glue Data Quality validates that the masking transformations applied correctly. The DEA-C01 exam plants Macie as the right answer for "discover PII across hundreds of S3 buckets" and Glue Data Quality as the right answer for "enforce schema and value rules in the ETL pipeline." Never pick Macie for general data quality (it is PII-focused) or Glue Data Quality for PII discovery (it is rule-based, not ML-driven).

Q2 — How does Amazon DataZone differ from the AWS Glue Data Catalog?

The Glue Data Catalog is the technical metadata store — schemas, partition layouts, file formats, connection details — used by Glue ETL jobs, Athena, Redshift Spectrum, and EMR for query planning. DataZone is the business-facing catalog layered on top — datasets are published as discoverable assets with descriptions, owners, business glossary terms, lineage graphs, and a subscription workflow that handles access requests. Glue is for engineers; DataZone is for business consumers and analysts. The two complement each other — Glue catalogs technical assets, DataZone publishes them in a polished interface with self-service workflows. Use Glue alone for small teams with one or two producers; use Glue plus DataZone for data mesh architectures with many producing and consuming teams. The DEA-C01 exam plants this distinction directly with scenarios about "self-service business data discovery" (DataZone) versus "technical metadata for ETL pipelines" (Glue).

Q3 — How do I configure CloudTrail to log S3 object-level access for compliance?

Enable CloudTrail data events for the specific S3 buckets containing regulated data. In a CloudTrail trail, configure event selectors that include S3 data events (AWS::S3::Object) and specify the bucket ARNs (or prefixes within buckets) to capture. The trail then logs every GetObject, PutObject, DeleteObject call with full IAM context — requesting principal, source IP, user agent, response code. Default CloudTrail (management events only) does NOT capture S3 reads — this is the most-cited DEA-C01 audit trap. Cost: data events bill per million events, so use event selectors to narrow scope to regulated prefixes only. Ship to CloudTrail Lake for SQL-queryable retention up to 7 years, satisfying GDPR and HIPAA audit requirements.

Q4 — What is the difference between S3 server access logs and CloudTrail S3 data events?

S3 server access logs are an older S3-native feature that writes a delimited text record to a target bucket for every S3 request — bucket-level visibility, simple, free (you pay only for storage). CloudTrail S3 data events are AWS-audit-ecosystem integrated, capturing requests with full IAM context, integrating with Security Hub, EventBridge, and CloudTrail Lake, billing per million events. For regulated data lakes that must satisfy GDPR, HIPAA, or SOX, CloudTrail data events are the right answer despite the cost — auditors expect CloudTrail integration. For non-regulated buckets where simple visibility suffices, S3 access logs are free and adequate. Production patterns often run both — S3 access logs for cheap baseline visibility on all buckets, CloudTrail data events on specific regulated prefixes for audit-grade integration.

Q5 — Can Amazon Macie automatically remediate findings by masking or deleting PII?

No. Macie discovers and reports — it does NOT mask, redact, encrypt, or delete data. To act on Macie findings, you build remediation around them: EventBridge rules trigger Lambda functions that quarantine bucket prefixes, send notifications to data owners via SNS, open tickets in a ticketing system, or invoke Step Functions workflows that orchestrate cleanup. Common remediation patterns: quarantine the offending S3 prefix by copying objects to an isolated bucket and removing them from the production path; apply Lake Formation column-level grants to mask the PII columns from analysts; encrypt the data with a customer-managed KMS key with restricted access. The DEA-C01 exam plants Macie's report-only role as the most common Macie trap — always pair Macie with a downstream enforcement layer.

Q6 — When should I use Redshift Dynamic Data Masking versus Lake Formation cell-level security?

Use Redshift Dynamic Data Masking when the masking applies to native Redshift tables (RA3 managed storage) and the consumer is Redshift SQL — CREATE MASKING POLICY defines the masking transformation, role-based reveal at query time, no separate copies of the table. Use Lake Formation cell-level security (column-level grants plus row-level data filters) when the masking applies to S3-backed Glue tables queried by Athena, Redshift Spectrum, EMR, or Glue. Both deliver role-based fine-grained access; the choice is determined by where the data lives. For hybrid architectures with both Redshift native tables and Glue-cataloged S3 data, use both — DDM for warehouse-resident sensitive columns, Lake Formation for data lake sensitive columns. The DEA-C01 exam plants this with scenario detail about which engine queries the data.

Q7 — How do I demonstrate GDPR compliance for a data lake on AWS?

GDPR compliance for a data lake involves multiple layers: (1) Discovery — Macie scans S3 buckets to identify personal data, producing an inventory of where EU resident data lives. (2) Access control — Lake Formation column-level grants restrict who can see personal data; IAM policies and KMS key policies enforce least-privilege access. (3) Encryption — KMS customer-managed keys encrypt data at rest, TLS enforces encryption in transit. (4) Audit — CloudTrail data events log every access to regulated S3 buckets, retained in CloudTrail Lake for 7 years. (5) Right to erasure — design data models that allow surgical deletion of an individual's records (consider Iceberg tables for ACID delete operations on data lake S3). (6) Data subject access requests — design pipelines that can extract one individual's data on demand for portability. The DEA-C01 exam plants GDPR as a multi-tool architectural answer; no single AWS service provides GDPR compliance, but the combination of Macie, Lake Formation, KMS, and CloudTrail covers the technical controls.

Further Reading — Official AWS Documentation

The authoritative AWS sources are the Amazon Macie User Guide (managed identifiers, custom identifiers, scheduled jobs, findings), the Amazon DataZone User Guide (domains, projects, asset publishing, subscription workflows), the AWS CloudTrail User Guide (management vs data events, event selectors, CloudTrail Lake), the Redshift Database Developer Guide (audit logging, dynamic data masking), and the AWS Compliance Center (Artifact, Audit Manager, compliance program documentation). The AWS Big Data Blog publishes case studies on data governance at companies like Netflix and Capital One. The Skill Builder DEA-C01 Exam Prep Standard Course has dedicated modules for Domain 4 covering audit and privacy. The AWS Well-Architected Security Pillar covers governance and audit best practices. For deeper compliance content, the AWS GDPR Center and AWS HIPAA documentation pages provide service-specific guidance for regulated workloads, and the AWS Samples GitHub repository has end-to-end sample architectures for compliant data lakes.

Official sources

More DEA-C01 topics