examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 25 min

CloudWatch Alarms, EventBridge, and Automated Monitoring Actions

5,000 words · ≈ 25 min read ·

DOP-C02 deep dive on CloudWatch metric alarms, composite alarms, anomaly detection, alarm actions, and EventBridge rule patterns, schedules, archives, and cross-account event buses for event-driven DevOps automation.

Do 20 practice questions → Free · No signup · DOP-C02

CloudWatch alarms and EventBridge integration is the trigger layer of every DevOps automation chain on AWS. An alarm fires; an EventBridge rule listens; a target (SNS, Lambda, Step Functions, SSM Automation, ECS task, API destination) reacts. DOP-C02 expects you to build these chains correctly under realistic constraints — multi-account, multi-Region, with bounded blast radius, with idempotent retries, and with deduplication so you do not page the on-call ten times for one outage. The exam tests the small but unforgiving details: composite alarms vs metric math vs anomaly detection, m-out-of-n evaluation periods, missing-data treatment, EventBridge pattern matching syntax (which is not JSON), the difference between the default event bus and custom buses, archive-and-replay for incident reproduction, the new EventBridge Scheduler vs scheduled rules, and cross-account event buses with the right resource policies.

This guide assumes you understand basic CloudWatch alarms (a threshold crossing on a metric) and EventBridge concepts (event source, rule, target). It focuses on the DOP-C02 implementation depth: composite alarm boolean logic, metric math expressions in alarm definitions, anomaly detection bands, alarm actions including EC2 recovery and Auto Scaling, the EventBridge event pattern grammar with content filters and prefix matching, EventBridge Scheduler one-time and recurring schedules, archive-and-replay for incident postmortems, dead-letter queues for failed targets, input transformers that reshape events into target-friendly payloads, schema registry for typed event handling, and cross-account event delivery. Domain 4.3 (Automate monitoring and event management) and Domain 5.1 (Manage event sources to process, notify, and take action) are the home of this material.

Why CloudWatch Alarms and EventBridge Matter on DOP-C02

The DOP-C02 exam is structurally biased toward event-driven automation. The introduction of Domain 5 (Incident and Event Response, 14 percent) was the largest C01-to-C02 expansion, and every Domain 5 scenario funnels through EventBridge as the event bus. CloudWatch alarms are the synthetic event source that turns metric thresholds into events; EventBridge rules are the routing fabric that fans those events out to remediation, notification, and ticketing systems. Together they answer the recurring exam question: "When this thing happens, how does the system respond automatically without paging a human at 3am?"

The alarm and EventBridge corner is also where DOP-C02 tests discrimination between adjacent services. Alarms vs metric math vs anomaly detection differ subtly. EventBridge rules vs EventBridge Scheduler vs CloudWatch Events (the legacy name) require you to know what is current. Composite alarms vs SNS topic combinations differ in subtle ways. The exam does not test the existence of these features — it tests picking the right one under stated constraints (cost, latency, multi-account boundary, deduplication need).

  • Metric alarm: a rule that watches a single metric (or a metric math expression) and transitions between OK, ALARM, and INSUFFICIENT_DATA states.
  • Composite alarm: a meta-alarm that watches the states of other alarms with boolean operators (AND, OR, NOT) — used to suppress noise and to implement multi-signal logic.
  • Anomaly detection alarm: a metric alarm with an ML-derived expected band; alarms when the metric leaves the band (above, below, or both directions).
  • Metric math: an expression syntax (m1+m2, RATE(m1), IF(m1>5, 1, 0)) that combines multiple metrics into one alarmable series.
  • Alarm action: a target invoked on state change — SNS topic, Auto Scaling action, EC2 stop/terminate/reboot/recover, Systems Manager OpsItem/Incident, Lambda (via SNS).
  • EventBridge default event bus: the per-account bus that automatically receives AWS service events.
  • Custom event bus: a separately named bus for application or partner events; required for cross-account delivery and SaaS partner integrations.
  • Event pattern: a JSON-like rule that matches against incoming events using fields, prefix, suffix, anything-but, numeric, and exists matchers.
  • Input transformer: a per-target template that reshapes the original event into a custom payload using JSONPath references.
  • EventBridge Scheduler: the dedicated scheduling service (launched 2022) for one-time and recurring schedules; supersedes scheduled rules at scale.
  • Archive and replay: a feature that captures events on a bus, stores them for a retention period, and re-emits them later for incident reproduction or backfill.
  • Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html

Plain-Language Explanation: CloudWatch Alarms and EventBridge

The alarm-and-bus pattern is abstract, and the boolean composition rules trip up engineers who think of alarms as standalone triggers. Three different analogies help.

Analogy 1: The Smoke Detector and the Building Fire Panel

Every smoke detector in a building is a CloudWatch metric alarm: it watches one signal (smoke density) and fires when a threshold is crossed. The detector itself does not know what to do — it just outputs a state change. All detectors in a wing wire to a building fire panel which is the composite alarm: it applies boolean logic ("if hallway AND stairwell both alarm, this is a real fire; if only the kitchen alarm fires alone, it might be a burnt toast"). The fire panel does not page the fire department directly — it sends a signal to the building event bus (EventBridge default bus), where multiple subscribers listen. The fire department subscribes (an EventBridge rule with a Lambda target dispatching SMS), the elevator system subscribes (a rule that returns elevators to the ground floor), the HVAC subscribes (a rule that shuts off air handlers to slow smoke), and the building log subscribes (a rule that writes to S3 for the postmortem). One event, fan-out to many targets, each with its own retry and dead-letter queue. EventBridge input transformer is the way the fire panel translates "Detector Z14, smoke level 380 ppm" into the fire department's expected payload format ("Address: 123 Main St; Zone: West Wing; Severity: 3"). Replay is the building's recorded loop of past events you play back in the next drill to test the new dispatch software.

Analogy 2: The Industrial SCADA System

In a power plant, every sensor (turbine RPM, coolant temperature, voltage) feeds a PID controller — that is your CloudWatch metric alarm with metric math. The controller computes a derived signal (coolant_temp - ambient_temp) and trips when it exceeds tolerance. Multiple controllers cascade into a safety interlock matrix — that is your composite alarm with AND/OR/NOT logic that prevents a single false reading from triggering a shutdown. When a real interlock trips, it raises an event onto the plant's process bus (EventBridge custom event bus, segmented from the corporate IT bus). Subscribers include automated actions (close valve V12, start backup pump, page the on-shift supervisor) and slower processes (schedule a maintenance ticket, notify regulatory monitoring). The EventBridge Scheduler is the plant's preventive-maintenance calendar — every Tuesday 02:00, run the lubrication runbook on Pump A; every quarter, run the full vibration test. Archive and replay is the plant's black-box recorder that captures every process event for 90 days so investigators can replay the sequence after an incident.

Analogy 3: The Newsroom Wire Service

A wire service (Reuters, AP) is EventBridge itself. Reporters around the world (AWS services, custom applications, SaaS partners) push stories (events) into the wire (the bus). Every newsroom (subscriber Lambda, Step Function, ECS task, third-party API) registers rules describing what they care about: "anything tagged politics AND from Washington", "anything with severity critical". The wire delivers each matching story to each subscriber independently — fan-out, with each subscriber processing at their own pace. Input transformer is the wire's syndication tier that re-wraps the same story into the format each newsroom prefers (USA Today wants short paragraphs, the New York Times wants long-form). Cross-account event bus is the international syndication channel — Reuters delivers to Reuters Tokyo with a contractual permission. Archive and replay is the wire's morgue — the searchable archive of all past stories you can replay if a downstream newsroom missed a few hours.

For DOP-C02 stems centered on "alarm composition and noise suppression", reach for the fire panel building analogy. For stems centered on "industrial automation and scheduled runbooks", reach for the SCADA power plant analogy. For stems centered on "fan-out, multi-account event distribution, and replay", reach for the wire service newsroom analogy. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html

CloudWatch Metric Alarm Anatomy

A metric alarm has these required parameters: a metric (or metric math expression), a comparison operator (e.g., GreaterThanOrEqualToThreshold), a threshold, an evaluation period (number of consecutive periods that must breach), a datapoints-to-alarm count (the m in m-of-n), and a missing-data treatment. The combination is more nuanced than it looks.

m-out-of-n evaluation

Default: n consecutive periods all breach. The m-out-of-n mode lets you say "alarm if 3 of the last 5 periods breach" — useful when transient spikes are normal but sustained breach matters. Scenario: an Auto Scaling target tracking on CPUUtilization 70 percent should ignore one-minute spikes; use 3 of 5 rather than 1 of 1.

Missing data treatment

Four options: notBreaching (treat missing as healthy), breaching (treat missing as alarming), ignore (do not change state), missing (transition to INSUFFICIENT_DATA). The default is missing. For deployment-safety alarms (e.g., "fail the deployment if no successful requests received"), you want breaching. For periodic batch jobs, you want notBreaching to avoid flapping.

Alarm actions

Actions fire on state-change transitions (OK→ALARM, ALARM→OK, ALARM→INSUFFICIENT_DATA, etc.). Built-in action types:

  • SNS topic notification — the most common; fans out to email, SMS, Lambda, SQS.
  • EC2 actionsstop, terminate, reboot, recover. Recover only works on instances using EBS-backed root volumes; it brings the instance up on different underlying hardware.
  • Auto Scaling action — fires a step or simple scaling policy.
  • Systems Manager OpsItem — creates a ticket-like artifact in OpsCenter for human review.
  • Systems Manager Incident — escalates to Incident Manager response plan.

A common DOP-C02 trap: candidates pick "alarm directly invokes Lambda" as the answer. CloudWatch alarms cannot target Lambda directly. The path is alarm → SNS topic → Lambda subscription, or alarm → EventBridge rule (via the alarm state-change event) → Lambda target. The latter is the modern preferred pattern because EventBridge supports retries, dead-letter queues, and input transformation. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

Composite Alarms — Boolean Logic for Noise Suppression

A composite alarm watches multiple metric alarms and fires only when a boolean expression evaluates true. Syntax example: ALARM("WebTier-5xx") AND ALARM("WebTier-Latency-High") AND NOT ALARM("Maintenance-Window"). Use cases:

  • Suppress maintenance noise: an inhibitor alarm fires during scheduled deployments; the composite alarm uses NOT ALARM("Maintenance-Window") to mute downstream pages.
  • Multi-signal correlation: only page when both error rate AND latency degrade — a single signal might be noise.
  • Hierarchical alarm trees: a top-level "Application Health" composite rolls up dozens of leaf alarms.

Composite alarms can themselves be inputs to other composite alarms (up to a depth of 10). They have their own actions list, separate from the leaf alarms — leaf alarms typically have no actions; only the composite triggers paging. This is the right answer for "stop the on-call from being paged 30 times for one event".

Anomaly Detection Alarms

Anomaly detection wraps CloudWatch's built-in ML model around a metric to compute an expected band. The alarm fires when the metric leaves the band (above, below, or both). Configuration:

  • Direction: GreaterThanUpperThreshold, LessThanLowerThreshold, or LessThanLowerOrGreaterThanUpperThreshold.
  • Standard deviations: how wide the band is. 2 is the default; higher means fewer false positives.
  • Training period: ML uses 14 days of history.

Use anomaly detection for metrics with seasonal patterns where a fixed threshold cannot capture normal — e.g., daily traffic shape, weekend dips. Do not use anomaly detection for metrics with a hard SLO (e.g., latency P99 < 500 ms); use a static threshold there because business contracts care about absolute values, not statistical normalcy.

Metric Math in Alarms

Metric math defines a derived series for alarming. Syntax examples:

  • (errors / requests) * 100 — error rate.
  • RATE(m1) — per-second rate of a counter.
  • FILL(m1, REPEAT) — fill missing data with last value (use carefully).
  • IF(m1 > 5, 1, 0) — conditional flag.
  • ANOMALY_DETECTION_BAND(m1, 2) — built-in anomaly bands inside a math expression.

Metric math alarms are billed at the underlying metric retrieval rate. They cannot include cross-Region or cross-account metrics in a single expression — for that, use cross-account observability and aggregate at the dashboard layer.

EventBridge Event Patterns — The Real Grammar

Event patterns look like JSON but they are not arbitrary JSON matchers. They follow a specific grammar:

  • Exact match: { "source": ["aws.ec2"] } — array means "OR".
  • Prefix match: { "detail": { "instance-id": [{ "prefix": "i-abc" }] }} — only on string fields.
  • Suffix match: { "detail": { "filename": [{ "suffix": ".gz" }] }}.
  • Anything-but: { "source": [{ "anything-but": ["aws.health"] }] }.
  • Numeric: { "detail": { "cpu": [{ "numeric": [">", 80] }] }}.
  • Exists: { "detail": { "status": [{ "exists": true }] }}.
  • Wildcard (introduced 2024): limited prefix/suffix; full glob not supported in the original grammar.

Common DOP-C02 trap: candidates write {"source": "aws.ec2"} (string instead of array). The grammar requires arrays. Another trap: they expect == or > operators directly; the grammar requires the numeric matcher object.

Memorize the matcher set: prefix, suffix, anything-but, numeric, exists, cidr (for IPs), equals-ignore-case, and wildcard (newer). Every value is wrapped in an array — even single values. Patterns match against the event JSON tree by exact path. Cross-pattern logic uses array-OR within a field; AND across fields is implicit. The pattern does not support a generic regex. For complex matching, transform-and-route via Lambda. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns.html

EventBridge Scheduler vs Scheduled Rules

The legacy way to run scheduled tasks was an EventBridge rule with a cron or rate expression, targeting Lambda or SSM. In November 2022, AWS launched EventBridge Scheduler as a dedicated service. The differences matter on DOP-C02:

Capability Scheduled Rule (legacy) EventBridge Scheduler
Per-account schedule limit 300 1 million+
Time zone support UTC only Any IANA time zone
One-time schedules No Yes
Flexible time windows No Yes (jitter)
Targets EventBridge target list Templated targets, 270+ services via SDK
Pricing Free Pay per invocation

For new work, always pick EventBridge Scheduler. The exam will distractor with "scheduled rule" answers; pick Scheduler when scale or one-time schedules are mentioned.

Custom Event Buses and Cross-Account Delivery

The default event bus only receives AWS service events in your account. To accept custom events from your applications, partner SaaS, or other accounts, you create a custom event bus. Cross-account delivery requires:

  1. The receiving custom bus has a resource-based policy allowing the source account to events:PutEvents.
  2. The source account creates a rule on its bus targeting the destination bus ARN, with an IAM role that EventBridge assumes to deliver.
  3. The destination bus has rules with targets in the receiving account.

This pattern is used heavily for centralized security and operations buses. The exam tests:

  • Default bus cannot receive events from another account (must use custom bus).
  • The cross-account PutEvents is a 256 KB hard limit per event.
  • Delivery is best-effort; configure a dead-letter queue on the rule's target for failures.

Archive and Replay

Archive captures all events that arrive at a bus, stores them in EventBridge for a retention period (1 day to indefinite), with optional pattern-based filtering. Replay re-emits archived events to the same bus over a chosen time window. Use cases:

  • Test a new rule against past traffic before deploying.
  • Reproduce an incident to debug downstream processing.
  • Backfill a Lambda after a target outage.

Replay does not include the original delivery — targets are re-invoked as if the events arrived live. Set up archives early because you cannot retroactively capture events that already passed through.

Input Transformer — Reshape Before Delivery

The input transformer reshapes the matched event into a target-specific payload using JSONPath references. Two parts:

  • Input paths: a map from variable names to JSONPath expressions ({ "instance": "$.detail.instance-id", "state": "$.detail.state" }).
  • Input template: a string template using the variables ("Instance <instance> changed to <state>").

Use cases: convert AWS service events into Slack-friendly text via API destination, or shape an event into the schema an SSM Automation runbook expects.

Dead-Letter Queues and Retries

Each EventBridge target supports an optional dead-letter queue (an SQS queue) and a retry policy (maximum age of event, max retries). The default retry policy is 24 hours and 185 retries. Set a DLQ for any production target so failed deliveries are captured for postmortem instead of dropped silently.

A high-frequency DOP-C02 trap: a Lambda target throws an exception during deployment, the EventBridge rule retries 185 times over 24 hours, then drops the event. The team has no idea events were lost until customers report missing notifications. Always configure an SQS DLQ on every production target and alarm on ApproximateNumberOfMessagesVisible so dropped events trigger investigation. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html

Schema Registry

The schema registry stores JSON Schema or OpenAPI definitions for events. EventBridge auto-discovers schemas from incoming events on a bus (Schema Discovery) and lets you generate strongly typed code bindings (Java, Python, TypeScript, Go) for downstream consumers. Use this when you have many custom application events and want producer-consumer contract enforcement.

High-Frequency Exam Traps

Trap 1: Lambda Cannot Be a Direct Alarm Action

Use SNS → Lambda or alarm-state-change EventBridge rule → Lambda. Direct Lambda is not an alarm action.

Trap 2: EC2 Recover Requires EBS-Backed Root

Recover does not work on instance-store-backed instances. Distractor: "use the recover action on i3.xlarge with NVMe instance store" — wrong.

Trap 3: Composite Alarms Cannot Have Their Own Metric

A composite alarm only references other alarms. Cannot include a raw metric in the rule expression.

Trap 4: Default Event Bus Cannot Receive Cross-Account

Must use a custom bus with a resource policy. Default bus only takes AWS service events.

Trap 5: EventBridge Pattern Is Not JSON Matching

{"source": "aws.ec2"} (string) does not match. Must be {"source": ["aws.ec2"]} (array).

Trap 6: Scheduled Rule vs Scheduler at Scale

Per-account 300 scheduled-rule limit hits multi-tenant SaaS designs. Use EventBridge Scheduler for any volume above ~100 schedules.

Trap 7: Anomaly Detection Bands Need 14 Days History

A new metric cannot use anomaly detection until 14 days of training data accumulate. Distractor: "create anomaly detection alarm immediately on a brand-new custom metric" — fails until trained.

Composite alarms use a rule string with ALARM("name"), OK("name"), INSUFFICIENT_DATA("name") functions and AND, OR, NOT boolean operators. EventBridge uses JSON patterns. They are unrelated grammars. The exam will mix them in distractors — pick the syntax that matches the service in the stem. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm_How_To.html

DOP-C02 Exam Patterns and Worked Scenarios

Scenario 1: Suppress Pages During Deployment Window

Stem: "On-call engineers are paged 50 times during every deployment because alarms fire on instances cycling out." Right answer: composite alarm with NOT ALARM("Maintenance-Window") inhibitor. The maintenance window alarm is set manually or via deployment pipeline before/after the deployment.

Scenario 2: Auto-Stop Idle EC2 Instances Daily

Stem: "Stop development EC2 instances every weekday at 19:00 in user's local time." Right: EventBridge Scheduler with the user's IANA time zone, targeting the aws-sdk:ec2:stopInstances SDK target. Wrong: scheduled rule (UTC only).

Scenario 3: Multi-Region Aggregated Alarm

Stem: "Alarm when total error rate across us-east-1 and eu-west-1 exceeds 5%." Right: cross-account observability monitoring account hosts a metric stream to a central account; alarm runs in the monitoring account using a metric math expression aggregating the streamed data. Metric math itself does not cross Regions in a single expression.

Scenario 4: Replay Past Events to Test New Logic

Stem: "Team wants to test a new fraud-detection Lambda against last 7 days of order events without re-running the orders." Right: EventBridge archive on the orders bus, replay to a test bus where the new Lambda subscribes.

Scenario 5: Alarm on EC2 Recovery for Critical Instances

Stem: "An EC2 instance hosts a stateful database that cannot tolerate underlying hardware failure." Right: alarm on StatusCheckFailed_System (system status check, AWS infrastructure side) with the recover action; this rebuilds the instance on different hardware preserving instance ID, IP, and EBS volumes. Wrong: alarm on StatusCheckFailed_Instance (this is the instance OS side, recover does not help).

FAQ

Q1: When do I use a composite alarm vs metric math vs anomaly detection?

Use metric math when you need a derived numeric series with a fixed threshold (error rate = errors/requests). Use composite alarm when you need boolean logic across already-defined alarm states (suppress noise, correlate signals). Use anomaly detection when the metric has a seasonal pattern with no fixed threshold (daily traffic shape).

Q2: Should I use scheduled rules or EventBridge Scheduler?

For any new work, EventBridge Scheduler. It supports time zones, one-time schedules, flexible time windows, and 1M+ schedules per account. Scheduled rules remain for backward compatibility but are capped at 300 per account.

Q3: How do I target a Lambda with retries and a DLQ from a CloudWatch alarm?

Two patterns: (1) Alarm → SNS → Lambda (DLQ on the Lambda function itself, retry via SNS subscription's redrive policy). (2) Alarm state-change events flow to the default EventBridge bus; create a rule matching aws.cloudwatch source and CloudWatch Alarm State Change detail-type, target Lambda with rule-level DLQ and retry config. Pattern 2 is preferred for retry control.

Q4: Can EventBridge events be delivered cross-Region?

Yes — the rule target can be an event bus ARN in another Region. The same resource policy and IAM role pattern as cross-account applies, plus inter-Region delivery is best-effort with extra latency. For high-volume cross-Region routing, prefer SQS or Kinesis.

Q5: Why does my new metric anomaly detection alarm sit in INSUFFICIENT_DATA forever?

Anomaly detection requires 14 days of training data before it can compute bands. New metrics need to ingest history before the alarm activates. Use a static threshold for the first two weeks, then switch to anomaly detection.

Q6: What is the difference between an EventBridge rule on the default bus and on a custom bus?

The default bus receives AWS service events automatically. Custom buses receive only what you PutEvents to them. Use custom buses for application events, partner SaaS, and cross-account delivery. Rules can attach to either.

Q7: Can I use EventBridge to do real-time stream processing?

EventBridge is event routing, not stream processing. For high-throughput per-event processing with state, use Kinesis Data Streams or Kinesis Data Firehose. EventBridge is excellent for 0-50 events per second per rule with fan-out to many targets.

Cross-References

  • CloudWatch metrics and Logs Insights are the data layer for the alarms covered here; see cloudwatch-metrics-logs-insights.
  • EventBridge auto-remediation runbooks explain how the alarms-and-events chain ends in SSM Automation; see eventbridge-auto-remediation-runbooks.
  • Systems Manager Incident Manager explains how alarms escalate to on-call response plans; see systems-manager-incident-manager-health.
  • CloudTrail and Config dashboards explain audit-and-compliance event sources that share the same EventBridge fabric; see cloudtrail-config-audit-dashboards.
  • Deployment failure troubleshooting uses the same alarm-action chain to fail deployments and roll back; see deployment-failure-troubleshooting.

Official sources

More DOP-C02 topics