CloudWatch Monitoring and Data Pipeline Dashboards - DEA-C01 Data Engineer Study Notes

Q: Q4 — What metrics should I monitor on a Kinesis Data Streams pipeline?

The single most important metric is IteratorAgeMilliseconds per shard — the age of the oldest record in the shard not yet read by a consumer. Rising IteratorAge means consumers are lagging behind producers and the pipeline is falling behind real-time. Set both threshold and anomaly detection alarms on IteratorAge. Secondary metrics: IncomingRecords and IncomingBytes for ingestion volume (with anomaly detection), ReadProvisionedThroughputExceeded and WriteProvisionedThroughputExceeded for shard saturation, GetRecords.Latency for consumer-side performance. Build a dashboard that shows IteratorAge and ingestion volume across all streams in one view.

CloudWatch monitoring and dashboards for data pipelines is the observability foundation that every production AWS data engineering workload depends on, and on the DEA-C01 exam CloudWatch is tested in roughly one out of every four Domain 3 questions. The trap pattern is consistent: candidates know CloudWatch exists, they have used CloudWatch alarms, but they cannot tell scenarios apart that hinge on the metric-versus-log distinction or the CloudWatch-versus-CloudTrail boundary. Community study guides from VivekR's 30-day roadmap, Tutorials Dojo, and ExamCert.App all flag CloudWatch and CloudTrail depth as under-estimated by DEA candidates — most prep material spends ten minutes on CloudWatch and the exam asks half a dozen questions about it.

This guide is built to put CloudWatch monitoring and dashboards into operational muscle memory for the data engineer perspective. It covers the four pillars of pipeline observability, the metric versus log distinction, key data-service metrics (Glue DPU usage, Kinesis IteratorAge, Redshift query duration), CloudWatch Logs and Logs Insights query language, threshold and anomaly detection and composite alarms, dashboards spanning Kinesis and Glue and Redshift, EventBridge alarm integration for automated remediation, the CloudTrail differences, CloudTrail Lake SQL audit, AWS Config for compliance tracking, and the canonical metric-vs-log plus CloudWatch-vs-CloudTrail exam traps.

Data Pipeline Observability — The Four Pillars

Production data pipelines need four kinds of telemetry, and CloudWatch provides three of them natively (the fourth, traces, is X-Ray's domain).

Pillar 1 — Metrics

Numerical values sampled over time — Glue job DPU consumption, Kinesis IteratorAge, Redshift query duration. Metrics answer "is the pipeline healthy" and are the primary alerting signal.

Pillar 2 — Logs

Append-only event streams — Glue driver logs, Lambda invocation logs, EMR step logs. Logs answer "what happened during this specific run" and are the primary debugging source.

Pillar 3 — Alarms

Threshold or anomaly conditions on metrics that trigger notifications or automated actions. Alarms answer "wake someone up when health degrades."

Pillar 4 — Traces (Out Of Scope For DEA)

Distributed traces tying together a request across services. Useful for troubleshooting cross-service latency. DEA-C01 does not deeply test X-Ray; AWS focuses CloudWatch testing on metrics, logs, and alarms.

CloudWatch Metrics — The Numerical Heart

CloudWatch metrics are numerical samples published either by AWS services automatically or by your applications via the PutMetricData API.

Standard vs High-Resolution Metrics

Standard metrics have 1-minute granularity; high-resolution metrics have 1-second granularity. AWS service metrics are mostly standard; custom application metrics can be either. High-resolution costs more but is necessary for sub-minute alerting.

Namespaces And Dimensions

Metrics are organized into namespaces (AWS/Glue, AWS/Kinesis, AWS/Redshift) and have dimensions (job name, stream name, cluster identifier). Dimensions are the filtering axes — "show me Glue DPU usage where job_name=daily_etl."

Key Metrics For Data Engineering

For Glue: glue.driver.aggregate.numCompletedTasks, DPU usage by job, job duration, error rate. For Kinesis Data Streams: IncomingRecords, IncomingBytes, IteratorAgeMilliseconds (the consumer lag metric — the most-watched Kinesis metric). For Kinesis Firehose: IncomingRecords, DeliveryToS3.Records, DeliveryToS3.DataFreshness. For Redshift: CPUUtilization, DatabaseConnections, WLMQueryDuration, QueriesCompletedPerSecond. For Step Functions: execution count, execution duration, failed executions. For Lambda: invocations, errors, duration, throttles, concurrent executions.

Metric Math

CloudWatch supports metric math expressions: (metric_a / metric_b) * 100 for derived values, RATE(metric_a) for per-second rates, ANOMALY_DETECTION_BAND(metric_a, 2) for anomaly bands. Used in dashboards and alarms for derived KPIs.

CloudWatch Logs — Centralized Log Storage

CloudWatch Logs is the centralized log storage that AWS services and your applications write to.

Log Groups And Streams

A log group is the container, typically named after the service and resource (/aws-glue/jobs/output, /aws/lambda/my-function). Within a group, log streams are individual sources — one stream per Lambda execution environment, one per Glue worker, one per EMR application. Retention is set per log group (1 day to 10 years, or never expire).

Configuring Log Retention

By default, log groups created automatically by AWS services have indefinite retention — costing money forever for logs nobody reads. The recommended baseline: set 30-day retention on most log groups, 90 days for compliance-relevant logs, and 1 year for security audit logs. The DEA-C01 exam tests this with cost-control scenarios.

Log Subscriptions And Forwarding

Subscriptions stream logs in near real-time to Lambda, Kinesis Data Streams, or Kinesis Firehose. Pattern: forward Glue or Lambda logs to a central log analytics destination via a subscription, often into Firehose to S3 for long-term archival.

Live Tail

Live Tail is the streaming console view of incoming logs — useful for watching a Glue job during development. Limited to a small number of concurrent sessions per account.

CloudWatch Logs Insights — SQL-Like Query

Logs Insights is the query language that lets engineers search and aggregate CloudWatch Logs.

Query Syntax

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Commands: fields, filter, stats, sort, limit, parse, dedup. Output goes to a table. Stats commands aggregate: stats count(*) by bin(5m) for time-bucketed counts, stats avg(@duration) by serviceName for grouped averages.

Use Cases For Data Engineering

Find all Glue job errors in the last hour: filter @message like /Exception/ | stats count(*) by job_name. Find the slowest Lambda invocations: stats max(@duration) by @logStream | sort by max desc | limit 10. Track Redshift query duration outliers: parse @message "Query * completed in *ms" as queryId, ms | filter ms > 60000.

Cross-Log-Group Queries

A single Logs Insights query can span up to 50 log groups, useful for tracing a request across services (a Glue job that calls Lambda that writes to S3 — query all three log groups together).

CloudWatch Alarms — Threshold And Anomaly

Alarms are the primary alerting mechanism on CloudWatch metrics.

Threshold Alarms

A static threshold: "alarm when Glue job duration > 60 minutes for 1 datapoint." Simple, predictable, fits well-understood metrics. Threshold values must be set by the engineer; bad thresholds cause alert fatigue or missed signal.

Anomaly Detection Alarms

Anomaly detection learns the metric's normal pattern (including seasonality) over a 2-week window and alerts when the metric exceeds an N-standard-deviation band. Best for metrics with daily or weekly cycles where a static threshold cannot capture "normal" — for example, Kinesis incoming records that are normally high during business hours and low at night. Configured by selecting "Anomaly detection" in the alarm wizard and setting the standard-deviation multiplier (typically 2 or 3).

Composite Alarms

A composite alarm combines multiple alarms with AND/OR logic — alarm only when the Glue job alarm AND the Redshift WLM alarm both trigger, suppressing one as long as the other is in OK state. Composite alarms reduce alert noise on cascading failures where a single root cause triggers many downstream symptom alarms.

Alarm Actions

Alarms trigger SNS notifications, EventBridge events, Auto Scaling actions, or Systems Manager OpsItem creation. The EventBridge integration is the route to automated remediation — alarm fires, EventBridge rule routes to Step Functions or Lambda that performs the corrective action.

Use anomaly detection alarms on metrics with daily or weekly seasonality (Kinesis IncomingRecords, business-hours dashboard query rates) where static thresholds cannot capture normal behavior, and use static threshold alarms on metrics with deterministic acceptable ranges (Glue DPU usage exceeding job DPU configuration, Lambda errors above zero). The DEA-C01 exam tests this with scenarios where the metric pattern is described — "the team gets false alarms when traffic naturally spikes on Mondays" is the anomaly-detection answer; "the team needs to know when error count exceeds 5 per minute" is the static-threshold answer. Composite alarms cut alert noise on multi-symptom failures by requiring multiple conditions before paging — implement composite alarms for the always-noisy production environments where one root cause triggers ten alarms.

CloudWatch Dashboards — Operational Views

Dashboards are visual collections of widgets — line charts, number displays, log tables, alarm status — for at-a-glance operational visibility.

Building A Data Pipeline Dashboard

A typical pipeline dashboard combines: Kinesis IteratorAge across all streams (line chart), Glue job duration trend over 7 days (line chart), Redshift WLM query queue length (number display), Step Functions execution success rate (line chart with anomaly band), and a list of currently-firing alarms (alarm widget). The dashboard becomes the morning-standup view.

Cross-Account Dashboards

Cross-account observability lets one monitoring account view metrics and dashboards from many source accounts via CloudWatch's cross-account observability feature. Useful for centralized SRE teams overseeing data engineering accounts spread across business units.

Dashboards can be shared publicly via signed URLs (read-only, time-bounded) or via IAM permissions for in-account viewers. Public sharing is appropriate for status pages; sensitive operational data should remain IAM-gated.

EventBridge And Automated Remediation

CloudWatch alarms publish to EventBridge, enabling automated remediation.

Alarm-To-Action Pattern

Alarm fires => alarm state change event published to default EventBridge bus => EventBridge rule pattern-matches the event => target Lambda or Step Functions runs remediation. Example: Kinesis IteratorAge alarm fires on a stream consumer lag => Lambda triggers a Lambda concurrency increase or scales out an ECS consumer service.

Why Not Alarms Direct To Lambda

Alarm Lambda direct integration exists but lacks the routing flexibility, multi-target support, and cross-account routing that EventBridge provides. The pattern of "alarms publish to EventBridge, EventBridge routes" is recommended for scalable operations.

CloudWatch Agent — Custom Metrics From EC2 And EMR

The CloudWatch Agent is the on-host agent that publishes OS-level metrics (memory, disk, custom log files) from EC2, EMR cluster nodes, and on-premises servers. AWS service-level metrics are auto-published; OS-level metrics require the agent.

Why Engineers Need The Agent On EMR

EMR cluster nodes expose YARN ResourceManager and Spark UI metrics that the default AWS/EMR namespace does not capture in detail. Installing the CloudWatch Agent and configuring it to publish JMX metrics from Spark and YARN gives deep visibility into cluster health.

CloudWatch vs CloudTrail — The Critical Distinction

This is the highest-value distinction on DEA-C01 Domain 3. The exam plants confusion between the two.

CloudWatch

CloudWatch monitors metrics and logs — what is happening in the running services. Glue DPU usage, Kinesis IteratorAge, Lambda errors, log lines from your application code. CloudWatch tells you about service health and application behavior.

CloudTrail

CloudTrail monitors API calls — who did what to AWS resources. "User X called CreateGlueJob at time Y from IP Z" is CloudTrail. CloudTrail tells you about identity and access events, not service health.

CloudTrail Event Types

Management events: control-plane API calls (CreateBucket, RunInstances, CreateGlueJob). Data events: data-plane API calls (S3 GetObject/PutObject, Lambda Invoke, DynamoDB Query). Management events are free for the first copy in your account; data events are billed per-event and typically opt-in for specific resources because of volume.

Insights Events

CloudTrail Insights events flag unusual API call patterns — a sudden spike in IAM role creations, an unexpected pattern of S3 deletes. Useful for security anomaly detection.

CloudWatch and CloudTrail are not interchangeable. CloudWatch tracks metrics and application logs; CloudTrail tracks API calls. A scenario asking "who deleted the Glue job" is a CloudTrail question; a scenario asking "why is the Glue job slow" is a CloudWatch question. Engineers who try to find DELETE events in CloudWatch Logs will fail because CloudWatch Logs only contains what services choose to write there (which is application output, not API call audit). Engineers who try to find application errors in CloudTrail will fail because CloudTrail only records API calls (not Lambda function output or Glue job logs). The DEA-C01 exam plants this with phrases like "audit who modified the configuration" (CloudTrail) versus "monitor pipeline performance" (CloudWatch). Learn the boundary cold.

CloudTrail Lake — SQL-Queryable Audit Store

CloudTrail Lake is a managed audit store that retains CloudTrail events for 7 years and exposes them via SQL queries.

How It Works

A CloudTrail Lake event data store ingests events from CloudTrail (or from a CloudTrail trail), stores them in an immutable, searchable format, and exposes a SQL interface in the CloudTrail console. Query example: SELECT eventName, userIdentity.arn FROM event_data_store WHERE eventName = 'DeleteBucket' AND eventTime > '2026-01-01'.

When To Use CloudTrail Lake

Use Lake when long-term audit retention (years) and ad hoc SQL queries against audit history are needed for compliance or investigations. Skip Lake when basic CloudTrail event history (90 days in the console) is sufficient and S3 archival of trail logs covers long-term needs at lower cost.

Cost Model

Per-GB ingestion plus per-GB scanned by queries. Significantly more expensive than S3-archived CloudTrail logs queried by Athena, but with simpler operational experience and immutable storage guarantees.

AWS Config — Configuration Change Tracking

AWS Config records configuration changes to AWS resources over time and evaluates rules for compliance.

How Config Differs From CloudTrail

CloudTrail records the API call ("CreateBucket was called by user X"); Config records the resulting resource state ("the bucket has versioning disabled"). Config maintains a configuration timeline per resource — what the resource looked like at each point in time.

Config Rules

Built-in rules check for compliance — for example, "S3 buckets must have encryption enabled," "RDS instances must have backups enabled," "Glue jobs must have CloudWatch Logs enabled." Custom rules are Lambda-backed.

Use For Data Engineering

Config rules ensure data engineering resources stay compliant — encryption requirements, retention policies, network configurations. Combined with EventBridge, non-compliant resources trigger automated remediation.

Use AWS Config rules for continuous compliance monitoring of data engineering resources, paired with EventBridge for automated remediation of non-compliant resources. Common patterns: "S3 bucket without default encryption" rule auto-enables encryption via Lambda; "Glue job without CloudWatch Logs" rule auto-enables logging; "RDS without automated backups" rule alerts compliance team. Config provides the configuration history (what was the bucket setting yesterday) that CloudTrail does not (CloudTrail tells you which API call changed it but not the resulting state at each point in time). For DEA-C01 exam scenarios about compliance enforcement on data resources, Config rules are the answer.

Plain-Language Explanation: CloudWatch Monitoring And Dashboards

Three concrete analogies make CloudWatch's role and the CloudWatch-versus-CloudTrail distinction intuitive.

Analogy 1 — The Hospital Patient Vital Signs vs Visitor Logbook

Picture a hospital. Every patient room has a wall-mounted monitor showing live vital signs — heart rate, blood pressure, oxygen saturation — and these readings stream into a central nursing station for at-a-glance observation. That is CloudWatch metrics for your data pipelines: Glue DPU usage is the heart rate, Kinesis IteratorAge is the blood pressure, Redshift query duration is the oxygen saturation. The nursing station has alarms — "if heart rate exceeds 120, page the cardiologist" — exactly CloudWatch alarms with thresholds. Anomaly-detection alarms are the experienced nurses who know that a rising heart rate at 6am during morning rounds is normal but the same rise at 3am is concerning. Patient charts in the room — handwritten notes about what the doctor did and observed — are CloudWatch Logs: the application's narrative output. The dashboard at the nursing station that displays vital signs from every room together is the CloudWatch dashboard. Now picture a visitor logbook at the hospital entrance — name, time, who they visited. That is CloudTrail: who entered the building and what they came to do. If a patient's medication was changed, the chart (logs) shows what was changed and the vital signs (metrics) show whether the patient responded; the visitor log (CloudTrail) shows that the doctor who changed the medication signed in at 2pm. Looking at the visitor log to find why the patient's blood pressure dropped is the wrong tool — that is CloudWatch's job.

Analogy 2 — The Factory Production Line With Multiple Sensor Types

Imagine a factory with conveyor belts moving products through stages. The factory has temperature sensors, vibration sensors, and counter sensors at each station — these stream readings to a central control room where the operator watches dashboards. That is CloudWatch metrics. Each machine has a maintenance log where mechanics write down "replaced belt at 10am, observed unusual noise at 2pm" — that is CloudWatch Logs, written by the application as narrative. The control room has an alarm panel — red lights and buzzers when temperature exceeds threshold or counter falls below floor — that is CloudWatch Alarms. The factory's badge-access door records who entered the factory and which areas they accessed — name, time, areas visited. That is CloudTrail. If a machine breaks, the sensors (metrics) and the maintenance log (Logs) tell you what failed; the badge log (CloudTrail) tells you who was the last person to access the machine. Asking the badge log why temperature spiked is wrong; asking the sensors who entered the factory is wrong. Anomaly-detection alarms are the foreman who knows the second-shift run normally has higher vibration than the first shift; static-threshold alarms are the floor-level "stop the line if temperature exceeds 95C." Composite alarms are "do not page me when both the chiller alarm and the temperature alarm fire — they always fire together."

Analogy 3 — The Restaurant Kitchen Camera vs Reservation Book

Picture a busy restaurant kitchen with cameras pointed at each station, a temperature probe in every fridge, and a digital order ticket system. The cameras and probes stream into the manager's monitoring screen — that is CloudWatch metrics plus dashboards. The chef writes service notes throughout the night about what happened — "burner three flame went out, used burner four" — that is CloudWatch Logs. The restaurant has a reservation book and an employee time-clock — who came in and who entered the kitchen. That is CloudTrail. The reservation book does not tell you why the burner went out; the cameras and chef's notes do. The chef's notes do not tell you who entered the back office at 11pm; the time-clock and reservation book do. The fridge probe alarm fires when temperature exceeds threshold (CloudWatch threshold alarm); the experienced sous chef notices that the dishwasher's noise pattern is unusual (anomaly detection in person form). The DEA-C01 exam asks you to read a scenario, identify whether the question is about service health or about identity audit, and pick CloudWatch or CloudTrail accordingly.

Common Exam Traps For CloudWatch Monitoring

The DEA-C01 exam plants a stable set of traps. Memorize all five.

Trap 1 — CloudWatch Logs For API Audit

A scenario asks "find who deleted the IAM role last week." Wrong answer: search CloudWatch Logs. Right answer: CloudTrail event history.

Trap 2 — CloudTrail For Application Errors

A scenario asks "find why the Glue job failed last night." Wrong answer: CloudTrail. Right answer: CloudWatch Logs (Glue writes job output there).

Trap 3 — Static Threshold On Cyclical Metric

A scenario describes Kinesis IncomingRecords with daily peaks and the team setting a static threshold causing false alarms. Right answer: anomaly detection alarm.

Trap 4 — Indefinite Log Retention

A scenario describes growing CloudWatch Logs costs. Right answer: configure per-log-group retention (30/90 days for non-compliance, longer for audit logs). Wrong: archive to S3 first (right answer if compliance demands long retention at lower cost).

Trap 5 — Logs Insights For Cross-Account

A scenario needs querying logs across accounts. Logs Insights queries within an account; cross-account requires CloudWatch cross-account observability or central log aggregation via subscription filters into a central account.

CloudWatch monitors metrics (numerical values) and logs (application output); CloudTrail monitors API calls (who did what); Config tracks configuration state changes over time. Three distinct services for three distinct observability questions. This is the one paragraph to memorize for every observability question on DEA-C01. If the scenario word is "performance," "duration," "error rate," or "lag," answer CloudWatch metrics. If the scenario word is "log line," "exception," or "application output," answer CloudWatch Logs. If the scenario word is "who," "audit," "API," or "called," answer CloudTrail. If the scenario word is "compliance," "configuration drift," or "resource state over time," answer AWS Config. The boundaries are crisp; the exam tests them precisely.

Key Numbers And Must-Memorize CloudWatch Facts

Metrics

Standard granularity: 1 minute
High-resolution: 1 second
Custom metrics via PutMetricData
Namespaces: AWS/Glue, AWS/Kinesis, AWS/Redshift, etc.
Metric math: derived expressions in dashboards and alarms

Logs

Default retention: indefinite (set explicit retention)
Log Insights: SQL-like query language
Cross-log-group queries: up to 50 groups
Subscriptions: real-time forwarding to Lambda, Kinesis, Firehose

Alarms

Threshold: static condition
Anomaly detection: 2-week training, N-standard-deviation band
Composite: AND/OR over multiple alarms
Actions: SNS, EventBridge, Auto Scaling, OpsItem

Dashboards

Cross-account observability supported
Public sharing via signed URLs (time-bounded, read-only)
IAM-gated for in-account access

CloudTrail

Management events: free for first copy, control-plane
Data events: opt-in, billed, data-plane (S3 object, Lambda invoke)
CloudTrail Lake: 7-year retention, SQL queries
Insights events: anomalous API patterns

AWS Config

Configuration timeline per resource
Rules: built-in or custom Lambda-backed
Auto-remediation via EventBridge to Systems Manager

DEA-C01 exam priority — CloudWatch Monitoring and Data Pipeline Dashboards. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

Key facts to memorize. Pin the service-level limits, default behaviours, and SLA promises related to this topic. The exam often tests recall of specific defaults (durations, limits, retry behaviour, free-tier boundaries) where memorising the exact number is the difference between right and 'almost right'.

FAQ — CloudWatch Monitoring And Dashboards Top Questions

Q1 — When should I use threshold alarms versus anomaly detection alarms?

Use threshold alarms when the metric has a deterministic acceptable range — "Lambda errors must be zero," "Glue DPU usage must not exceed configured DPU," "Redshift query duration must stay under 60 seconds for the dashboard query." Threshold alarms are simple, predictable, and cheap. Use anomaly detection when the metric has natural variation that a static threshold cannot capture — "Kinesis IncomingRecords spikes during business hours and drops at night, but a sudden midnight spike is anomalous," "weekly batch processing runs longer on Mondays due to weekend backlog." Anomaly detection learns the metric's pattern over a 2-week training window and alerts on N-standard-deviation deviations. The DEA-C01 exam tests this with scenarios describing the metric's pattern; the right alarm type follows from the pattern.

Q2 — How do I distinguish CloudWatch from CloudTrail in scenarios?

CloudWatch tracks metrics and logs — it answers "what is happening" or "what happened in the application." CloudTrail tracks API calls — it answers "who did what" with identity, time, source IP, and request parameters. If the scenario word is "performance," "error rate," "log line," "exception," or "application output," CloudWatch is the answer. If the scenario word is "who," "audit," "API call," "called," "deleted," or "modified," CloudTrail is the answer. AWS Config sits adjacent — Config tracks "what was the resource state at each point in time," answering compliance questions about configuration drift. The DEA-C01 exam plants this distinction in roughly half of Domain 3 observability questions; getting the boundary right is high-leverage exam prep.

Q3 — How do I control CloudWatch Logs cost?

CloudWatch Logs cost is dominated by ingestion volume and storage retention. Three controls reduce cost. First, configure per-log-group retention — by default log groups have indefinite retention; setting 30-day retention on most groups, 90 days for critical pipelines, and 1 year for audit logs cuts storage cost dramatically. Second, reduce log verbosity in application code — log at INFO level in production, not DEBUG. Third, for high-volume logs that must be retained long-term, use a subscription filter to forward to S3 via Kinesis Firehose (S3 storage is roughly 10x cheaper than CloudWatch Logs for the same data). The DEA-C01 exam tests this with growing-cost scenarios; the answer is retention plus log level plus archival.

Q4 — What metrics should I monitor on a Kinesis Data Streams pipeline?

The single most important metric is IteratorAgeMilliseconds per shard — the age of the oldest record in the shard not yet read by a consumer. Rising IteratorAge means consumers are lagging behind producers and the pipeline is falling behind real-time. Set both threshold and anomaly detection alarms on IteratorAge. Secondary metrics: IncomingRecords and IncomingBytes for ingestion volume (with anomaly detection), ReadProvisionedThroughputExceeded and WriteProvisionedThroughputExceeded for shard saturation, GetRecords.Latency for consumer-side performance. Build a dashboard that shows IteratorAge and ingestion volume across all streams in one view.

Q5 — What is CloudTrail Lake and when should I use it?

CloudTrail Lake is a managed event data store that ingests CloudTrail events, stores them in an immutable searchable format for up to 7 years, and exposes a SQL query interface in the CloudTrail console. Use Lake when long-term audit retention beyond CloudTrail's 90-day console history is required for compliance, when investigations need ad hoc SQL queries against historical events, and when operational simplicity is more valuable than the lower cost of S3-archived trail logs queried by Athena. Skip Lake when 90-day console history is sufficient or when the team already has a Glue + Athena pipeline querying archived trail S3 buckets at lower per-GB cost. The DEA-C01 exam tests this with compliance scenarios — long-term audit needs point at Lake.

Q6 — How do I build automated remediation when a CloudWatch alarm fires?

The recommended pattern is alarm-to-EventBridge-to-action. The alarm publishes its state-change event to the default EventBridge bus. An EventBridge rule pattern-matches the alarm event (by alarm name or namespace) and routes to a target — Lambda for code-based remediation, Step Functions for multi-step remediation, or Systems Manager Automation document for parameterized runbooks. Example: Kinesis IteratorAge alarm fires, EventBridge routes to Step Functions that scales out the consumer Lambda concurrency limit and posts to a Slack webhook. The DEA-C01 exam tests this pattern with operational-resilience scenarios; EventBridge is the routing hub.

Q7 — How does AWS Config relate to CloudTrail and CloudWatch?

AWS Config records the configuration state of AWS resources over time — what the bucket's encryption setting was at each timestamp, what the security group rules looked like last week. CloudTrail records the API calls that caused those changes — who called PutBucketEncryption with which parameters at which time. CloudWatch records the running-state metrics and application logs — was the bucket actually being used, were there errors. Together they answer the full triangle: what happened (CloudTrail), what was the resulting state (Config), what was the operational impact (CloudWatch). For compliance use cases, Config rules enforce desired state automatically; for security investigations, CloudTrail provides the audit trail; for operational issues, CloudWatch provides the runtime telemetry.

Data Pipeline Observability — The Four Pillars

Pillar 1 — Metrics

Pillar 2 — Logs

Pillar 3 — Alarms

Pillar 4 — Traces (Out Of Scope For DEA)

CloudWatch Metrics — The Numerical Heart

Standard vs High-Resolution Metrics

Namespaces And Dimensions

Key Metrics For Data Engineering

Metric Math

CloudWatch Logs — Centralized Log Storage

Log Groups And Streams

Configuring Log Retention

Log Subscriptions And Forwarding

Live Tail

CloudWatch Logs Insights — SQL-Like Query

Query Syntax

Use Cases For Data Engineering

Cross-Log-Group Queries

CloudWatch Alarms — Threshold And Anomaly

Threshold Alarms

Anomaly Detection Alarms

Composite Alarms

Alarm Actions

CloudWatch Dashboards — Operational Views

Building A Data Pipeline Dashboard

Cross-Account Dashboards

Dashboard Sharing And Permissions

EventBridge And Automated Remediation

Alarm-To-Action Pattern

Why Not Alarms Direct To Lambda

CloudWatch Agent — Custom Metrics From EC2 And EMR

Why Engineers Need The Agent On EMR

CloudWatch vs CloudTrail — The Critical Distinction

CloudWatch

CloudTrail

CloudTrail Event Types

Insights Events

CloudTrail Lake — SQL-Queryable Audit Store

How It Works

When To Use CloudTrail Lake

Cost Model

AWS Config — Configuration Change Tracking

How Config Differs From CloudTrail

Config Rules

Use For Data Engineering

Plain-Language Explanation: CloudWatch Monitoring And Dashboards

Analogy 1 — The Hospital Patient Vital Signs vs Visitor Logbook

Analogy 2 — The Factory Production Line With Multiple Sensor Types

Analogy 3 — The Restaurant Kitchen Camera vs Reservation Book

Common Exam Traps For CloudWatch Monitoring

Trap 1 — CloudWatch Logs For API Audit

Trap 2 — CloudTrail For Application Errors

Trap 3 — Static Threshold On Cyclical Metric

Trap 4 — Indefinite Log Retention

Trap 5 — Logs Insights For Cross-Account

Key Numbers And Must-Memorize CloudWatch Facts

Metrics

Logs

Alarms

Dashboards

CloudTrail

AWS Config

FAQ — CloudWatch Monitoring And Dashboards Top Questions

Q1 — When should I use threshold alarms versus anomaly detection alarms?

Q2 — How do I distinguish CloudWatch from CloudTrail in scenarios?

Q3 — How do I control CloudWatch Logs cost?

Q4 — What metrics should I monitor on a Kinesis Data Streams pipeline?

Q5 — What is CloudTrail Lake and when should I use it?

Q6 — How do I build automated remediation when a CloudWatch alarm fires?

Q7 — How does AWS Config relate to CloudTrail and CloudWatch?

Further Reading — Official AWS Documentation For CloudWatch

Official sources

More DEA-C01 topics