EventBridge auto-remediation with SSM Automation runbooks is the core mechanism by which DOP-C02 expects you to close the loop between detection and fix. An event arrives — a Config rule flags a non-compliant resource, a CloudWatch alarm fires, AWS Health publishes a degraded-service notice, GuardDuty surfaces a finding, an EC2 status check fails. EventBridge routes the event to an SSM Automation document that performs an idempotent set of steps to restore the desired state. No human in the loop unless explicitly required. Every Domain 5 stem with the words "automatically", "without manual intervention", "self-healing", or "remediate" is asking you to design or pick this chain. Get the document type wrong, get the IAM role wrong, get the idempotency wrong, or fail to set the dead-letter queue, and the chain becomes the next outage instead of preventing one.
This guide assumes you understand basic EventBridge rules and basic SSM Automation. It focuses on the DOP-C02 implementation depth: SSM document schema version 0.3 for Automation, the AWS-owned runbook catalog (over 300 prebuilt documents like AWS-StopEC2Instance, AWS-RestartRDSInstance, AWS-DisableS3BucketPublicReadWrite), authoring custom YAML/JSON documents with aws:executeAwsApi, aws:executeScript, aws:branch, aws:approve, aws:waitForResource steps, runbook inputs and outputs, assume-role semantics, Config-triggered remediation (manual vs auto, retry behavior), EventBridge → Lambda vs EventBridge → SSM Automation trade-offs, Step Functions as the orchestration choice when you need long-running workflows, idempotency patterns to safely retry, approvals via Change Manager and aws:approve, rollback patterns when remediation itself fails, and the dead-letter queue for events whose target failed to invoke. Domain 5.1 and 5.2 of the DOP-C02 exam guide cover exactly this material.
Why EventBridge Auto-Remediation Matters on DOP-C02
Domain 5 (Incident and Event Response) is brand new in DOP-C02 — there was no equivalent in DOP-C01. Multiple community study reports list it as one of the highest-pain domains because there is less third-party study material. The exam tests not the existence of the services but the wiring: which event source connects to which router which connects to which executor which connects to which target. Missing one hop or picking the wrong hop loses the question.
The exam style here is architectural-pattern recognition. A typical stem reads: "When an Amazon EC2 instance reports StatusCheckFailed_System, the team wants the instance automatically recovered if it has an EBS root, or the Auto Scaling Group instructed to terminate-and-replace if it has instance store. The remediation path must include human approval for production-tagged instances. Which architecture meets these requirements?" The wrong answer offers Lambda doing all the logic. The right answer is EventBridge rule on the alarm state-change event → SSM Automation document with aws:branch (EBS vs instance-store) → for production tag, an aws:approve step routed via SNS to the on-call engineer → conditional RebootInstances or TerminateInstances API call. The exam tests that you know aws:approve exists and aws:branch exists, and that putting all this in raw Lambda is operationally inferior because runbooks are visible, versioned, and centrally manageable.
- SSM Automation document (runbook): a YAML or JSON document defining a sequence of steps that the Systems Manager Automation service executes.
- Schema version: documents have versions;
0.3is the current Automation schema,2.2is for Run Command,2.0for State Manager associations. - AWS-owned runbook: an AWS-published document (prefix
AWS-) likeAWS-StopEC2Instance,AWS-PatchInstanceWithRollback. 300+ prebuilt; safe defaults; no need to write your own. - Custom runbook: a document you author. Permissions, steps, inputs, and outputs all under your control.
- Step: an action within the document. Common types:
aws:executeAwsApi,aws:executeScript(Python or Node.js),aws:branch,aws:approve,aws:waitForResource,aws:runCommand,aws:invokeLambdaFunction,aws:assertAwsResourceProperty. - Automation IAM role: the role Systems Manager assumes to execute the document. Must be in the document's
assumeRolefield or passed at runtime. - EventBridge rule: matches incoming events and dispatches to one or more targets, with optional input transformer, retries, and DLQ.
- Config remediation action: a Config-specific binding from a non-compliant rule finding to an SSM document, with
automaticormanualmode. - Step Functions: AWS's orchestration service for long-running, branching, retrying workflows; used when SSM Automation is not powerful enough.
- Change Manager: SSM's approval-and-tracking workflow built on top of Automation; required for production-changes governance.
- OpsItem: a ticket-like artifact in OpsCenter that runbooks can create or update for human follow-up.
- Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html
Plain-Language Explanation: EventBridge Auto-Remediation Runbooks
The auto-remediation loop is conceptually simple but mechanically intricate. Three analogies clarify the moving parts.
Analogy 1: The Smart Home Alarm System
Imagine a smart home with sensors everywhere — smoke, water leak, motion, door, window. Each sensor (a CloudWatch alarm, Config rule, GuardDuty finding) reports to a central hub (EventBridge) when a condition is met. The hub has a rule book (EventBridge rules with patterns) that says "if smoke AND not in cooking-window, run the Smoke Response routine; if water leak in basement, run the Water Shutoff routine; if door opens at 3am AND alarm is armed, run the Police Notification routine". Each routine is a sequence of pre-defined steps — turn off the gas valve, unlock the front door, send a notification, log the event. That sequence is an SSM Automation document. Some routines have a "wait for owner approval" step (aws:approve) before doing the truly disruptive action like cutting power. The owner approves via SMS link. If the owner doesn't approve in 10 minutes, the routine times out and falls back to a safer default (e.g., notify rather than cut power). This is exactly the EventBridge → Automation document → conditional approval pattern that DOP-C02 tests.
Analogy 2: The Hospital Code Blue Response
When a patient's heart rate drops below threshold, the cardiac monitor (CloudWatch alarm) raises a Code Blue (EventBridge event). The hospital intercom (EventBridge rule) broadcasts it to specific channels: the cardiology fellow on call, the rapid-response nurse, the central nursing station. Each respondent has a runbook in their pocket (SSM Automation document): "Step 1: confirm pulse. Step 2: start CPR. Step 3: call for crash cart. Step 4: administer medication X. Step 5: document in chart." The runbook is idempotent — even if two nurses arrive and both start step 1, no harm done because confirming pulse is safe to repeat. Step Functions is the equivalent for longer, multi-day care plans — admit-to-ICU, multi-day medication schedule, discharge-planning, follow-up. Code Blue is short and Automation handles it; ICU is long and Step Functions handles it. The chain ends with an OpsItem (incident chart in the hospital case-management system) that the morning team reviews to see what happened overnight.
Analogy 3: The Factory Production Line Self-Heal
A car-parts factory has hundreds of robotic arms. When a sensor reports an anomaly — coolant low, vibration high, motor overheat — the factory event bus dispatches the right repair routine. Common routines are pre-canned (the AWS-owned runbook catalog): "low coolant → top up to baseline level". Custom factory-specific routines are authored by the maintenance team (custom YAML documents). For routines that change the production schedule (e.g., shut down a line for an hour), the routine has an approval step routed to the shift supervisor via radio (SNS). The supervisor either approves or rejects within a defined timeout. If approved, the line shuts down and the Change Manager records the formal change for audit. If the routine itself fails (e.g., the top-up valve is stuck), it raises a higher-priority event that escalates to a human technician. This is the rollback-and-escalate pattern: remediation runbooks should themselves emit failure events that trigger fallback runbooks or human paging.
For DOP-C02 stems centered on "event-driven response with conditional approval", reach for the smart home alarm analogy. For stems centered on "incident chains with idempotency and ticketing", reach for the hospital code blue analogy. For stems centered on "factory-floor self-healing with formal change tracking", reach for the production-line self-heal analogy. Reference: https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html
SSM Automation Document Anatomy
A document has top-level keys:
schemaVersion: '0.3'
description: Restart EC2 instance with health check
parameters:
InstanceId:
type: String
description: Target EC2 instance ID
assumeRole: '{{ AutomationAssumeRole }}'
mainSteps:
- name: stopInstance
action: aws:executeAwsApi
inputs:
Service: ec2
Api: StopInstances
InstanceIds:
- '{{ InstanceId }}'
- name: waitStopped
action: aws:waitForAwsResourceProperty
inputs:
Service: ec2
Api: DescribeInstances
InstanceIds:
- '{{ InstanceId }}'
PropertySelector: $.Reservations[0].Instances[0].State.Name
DesiredValues:
- stopped
- name: startInstance
action: aws:executeAwsApi
inputs:
Service: ec2
Api: StartInstances
InstanceIds:
- '{{ InstanceId }}'
The document captures the entire procedure as versioned, reviewable code, executable from EventBridge, the console, the CLI, or via Config remediation.
Common step types
aws:executeAwsApi: any AWS SDK call; the workhorse step.aws:executeScript: arbitrary Python or Node.js inline; use for logic the SDK does not handle directly.aws:branch: conditional next-step based on input or prior step output.aws:approve: pause and wait for human approval via SNS notification with approve/reject links.aws:waitForAwsResourceProperty: poll until a resource property reaches a desired value.aws:assertAwsResourceProperty: assertion that fails the doc if the property is not as expected.aws:runCommand: invoke an SSM Run Command document on EC2 instances.aws:invokeLambdaFunction: call a Lambda for complex custom logic.aws:createStack: deploy a CloudFormation stack as part of remediation.aws:changeInstanceState: shorthand for stop/start/terminate/reboot.
IAM and assume role
Documents declare an assumeRole placeholder. At runtime, you pass the role ARN. The role must have permissions for every API call the document makes. A common pattern: the EventBridge rule's target role assumes a separate Automation execution role scoped to the specific document and resource set.
The DOP-C02 exam will offer "write a Lambda function to remediate" or "author a custom SSM document" as options when an AWS-owned runbook already exists. The correct answer is almost always to use the AWS-owned runbook (AWS-DisableS3BucketPublicReadWrite, AWS-RestartRDSInstance, AWS-StopEC2Instance, AWS-PatchInstanceWithRollback, etc.) because they are tested, peer-reviewed, and maintained by AWS. Author custom only when no AWS-owned document covers the case. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html
EventBridge → SSM Automation Pattern
The canonical wiring:
- Event source publishes to EventBridge (CloudWatch Alarm state-change, Config Compliance Change, AWS Health, GuardDuty Finding, custom application event).
- EventBridge rule matches the event with a pattern, e.g.,
{"source": ["aws.config"], "detail-type": ["Config Rules Compliance Change"], "detail": {"newEvaluationResult": {"complianceType": ["NON_COMPLIANT"]}}}. - Target: SSM Automation execution. Configure the document name, the document version, and an input transformer that maps event fields to document parameters (
InstanceIdfrom$.detail.resourceId). - DLQ (dead-letter queue) on the rule target captures failed deliveries for postmortem.
- Retry policy on the target sets max age and max attempts.
The IAM role on the rule target needs ssm:StartAutomationExecution on the document ARN. The Automation document's role needs all the AWS API permissions for its steps.
Config Rule Auto-Remediation
Config integrates more tightly than the generic EventBridge → SSM pattern. Per rule, configure a remediation action:
- Document: the SSM Automation document name.
- Parameters: map non-compliant resource attributes to document inputs. The special parameter
RESOURCE_IDis auto-populated. - Mode:
Automatic(run immediately on non-compliance) orManual(require operator click). - Retry: max attempts and seconds between attempts.
- Concurrency: max parallel executions.
Manual mode is the right choice when the remediation could affect production stability and you want a human ticket-and-review step. Automatic mode is right for low-risk, well-tested remediations like "remove public S3 ACL" or "enable EBS encryption on a new volume".
Step Functions for Long-Running Workflows
SSM Automation documents have a 12-hour maximum execution time per execution. For longer workflows — multi-day account onboarding, week-long phased migrations, complex saga patterns — use AWS Step Functions. Step Functions:
- Standard workflows: up to 1 year, exactly-once execution, full audit history.
- Express workflows: high-throughput, at-least-once, up to 5 minutes.
Step Functions can call SSM Automation as a service integration (aws-sdk:ssm:startAutomationExecution), so the patterns compose: EventBridge → Step Functions → multiple SSM documents → notification.
Idempotency Patterns
Every remediation document must be idempotent — running it twice should produce the same end state without error. Common patterns:
- Check-before-act:
aws:assertAwsResourcePropertystep to verify the current state; skip the action if already in desired state. - Use idempotent APIs:
PutBucketEncryptionis idempotent;CreateBucketis not (errors on existing). - Idempotency tokens: AWS APIs that accept a
ClientRequestTokenparameter (CloudFormation, SQS, etc.) enable safe retry. - Tag-based guard: tag the resource on first remediation; subsequent runs check the tag and skip if already remediated.
Without idempotency, retries amplify failures: a transient API error on step 5 of 10 means retry runs steps 1-4 again, possibly creating duplicate resources.
A common DOP-C02 trap: a remediation Lambda increments a counter on each invocation; under EventBridge retry, the counter ends up too high. The fix is to make the Lambda idempotent — track processed event IDs in DynamoDB or use the event.id as a deduplication key. The same applies to SSM documents: if step 3 fails transiently, the auto-retry will replay steps 1-3, so steps 1 and 2 must be safe to repeat. Reference: https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
Approvals — aws:approve and Change Manager
Two patterns for human-in-the-loop:
aws:approve step
In the document itself, an aws:approve step pauses execution and sends an SNS message with approve/reject links. Up to 10 approvers; configurable minimum approvals required. Timeout (default 7 days) after which the approval defaults to reject.
Change Manager
Change Manager wraps the entire document execution in a change request with templates, approval workflows, and audit history. Use Change Manager when you need formal change-management compliance: SOC 2, ISO 27001, healthcare regulators. Change Manager templates define:
- Required approvers per change type.
- Allowed change windows (e.g., "production changes only Sunday 02:00-04:00").
- Required template fields and rationale documentation.
The DOP-C02 exam tests that you know aws:approve is a lighter-weight pattern (just an approval gate inside one runbook) and Change Manager is the heavier formal-governance pattern (change request, template, calendar enforcement).
OpsCenter Integration
When auto-remediation cannot fully resolve an issue, runbooks should create an OpsItem in OpsCenter for human review. OpsItems include:
- Source resource ARN.
- Severity.
- Related alarms, X-Ray traces, CloudWatch dashboards.
- Notes from the runbook execution.
- Suggested next-step runbooks.
OpsCenter becomes the central queue for "issues that need a human" — the right destination when the automation chain partially succeeded or when human judgment is required.
- Source: CloudWatch alarm, Config rule, AWS Health, GuardDuty, custom event.
- EventBridge rule: pattern matches the event; input transformer reshapes it.
- Target: SSM Automation execution (or Lambda or Step Functions).
- Target IAM role: assumed by EventBridge to call
ssm:StartAutomationExecutionon the document. - Document IAM role: assumed by Automation service to call AWS APIs.
- DLQ: SQS queue capturing failed target deliveries.
- Retry policy: max age (default 24h), max attempts (default 185).
- Document idempotency: assert-before-act, idempotent APIs, tag-based guards.
- Approval:
aws:approvefor inline gate, Change Manager for formal change request. - Fallback: OpsItem creation if automation cannot fully remediate. Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html
High-Frequency Exam Traps
Trap 1: Lambda vs SSM Automation
When the action is "make a series of AWS API calls in a fixed sequence", prefer SSM Automation. Lambda is right for arbitrary code, custom logic, or non-AWS API calls. The exam will offer Lambda as a distractor for simple multi-step API sequences; pick the runbook.
Trap 2: 12-Hour Automation Limit
If the workflow runs longer than 12 hours, you need Step Functions. SSM Automation will time out.
Trap 3: aws:approve Default Timeout
7 days. If the stem says "approve within 1 hour", set Timeout explicitly.
Trap 4: Config Auto-Remediation Concurrency
Default concurrency is 1. For mass remediation across many resources, increase the concurrency limit, or remediate via EventBridge → SSM directly to bypass.
Trap 5: EventBridge Retry vs Lambda Async Retry
EventBridge has its own retry-and-DLQ on the rule target. Lambda async invocation also has its own retries and DLQ. They are layered — configure both.
Trap 6: Document Schema Version Mismatch
Automation requires schemaVersion: '0.3' (or the legacy 0.2). Run Command requires 2.2. State Manager associations use 2.0. Mismatched schema fails at execute time.
Trap 7: Cross-Account Remediation Requires Trust
To remediate resources in a member account from a central audit account, the audit account's Automation role assumes a role in each member account. The member account must have a trust policy. CloudFormation StackSets or Control Tower can deploy these trust roles uniformly.
A high-frequency DOP-C02 trap: a remediation chain silently drops events when the SSM Automation execution fails to start (e.g., role permission missing, document not found in the Region). Without a DLQ on the EventBridge target, the team has no record of dropped events. Always configure an SQS DLQ on production EventBridge targets, alarm on ApproximateNumberOfMessagesVisible > 0, and route DLQ messages to OpsCenter for human investigation. Reference: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html
DOP-C02 Exam Patterns and Worked Scenarios
Scenario 1: Disable Public S3 Buckets Across Org
Stem: "An organization needs all S3 buckets that become public-read to be remediated within 5 minutes." Right: Config rule s3-bucket-public-read-prohibited deployed via conformance pack to every account; automatic remediation with AWS-DisableS3BucketPublicReadWrite. Wrong: Lambda polling all buckets every 5 minutes.
Scenario 2: Quarantine GuardDuty High-Severity Findings
Stem: "When GuardDuty finds a compromised EC2 instance, immediately move it to an isolation security group and create an investigation ticket." Right: EventBridge rule on GuardDuty finding with severity ≥ 7; target SSM Automation document that (1) modifies the security group, (2) creates an OpsItem with the finding details, (3) snapshots the instance for forensics. AWS-owned AWS-DisableGuardDutyFinding-Compromised-EC2 provides a starting template.
Scenario 3: Auto-Restart RDS on Failure with Approval
Stem: "When RDS reports a failover-failed state, restart it; if it's a production instance, require approval first." Right: SSM document with aws:branch on a Production tag; production branch has aws:approve step routed to on-call SNS topic; non-production branch goes straight to RebootDBInstance.
Scenario 4: Long-Running Multi-Account Onboarding
Stem: "Onboarding a new account takes 3 days: SCP attachment, CloudFormation StackSet deployment, IAM Identity Center provisioning, Config aggregator inclusion." Right: Step Functions standard workflow orchestrating the multi-day chain; SSM Automation handles individual short tasks.
Scenario 5: EBS Volume Encryption Drift
Stem: "Detect any unencrypted EBS volume and either snapshot-encrypt-restore or alert if it cannot be safely re-encrypted in place." Right: Config rule encrypted-volumes → automatic remediation AWSConfigRemediation-EncryptUnencryptedVolume (which snapshots, encrypts, swaps).
FAQ
Q1: Should I use SSM Automation, Lambda, or Step Functions for remediation?
Use SSM Automation for short (≤12 hour) AWS-API-driven workflows where AWS-owned runbooks are likely to exist. Use Lambda for arbitrary code or non-AWS API calls. Use Step Functions for long-running (>12 hour) or complex branching workflows. Step Functions can also orchestrate multiple SSM documents.
Q2: How do I make a remediation runbook idempotent?
Three patterns: (1) assert-before-act with aws:assertAwsResourceProperty to verify current state. (2) Use idempotent APIs (PutBucketEncryption, PutBucketPolicy). (3) Tag-based guards — first run tags the resource, subsequent runs skip if tag exists.
Q3: When does Config auto-remediation retry?
Config remediation retries per the rule's RetryAttemptSeconds and MaximumAutomaticAttempts configuration. After exhausting retries, the rule remains NON_COMPLIANT and Config emits a remediation-failed event to EventBridge for further escalation.
Q4: How do I require approval before remediation?
Two options: (1) Inline aws:approve step in the SSM document (lightweight). (2) Wrap the document execution in Change Manager with a template requiring approver sign-off (formal change-management). Use Change Manager when audit/compliance cares about the change-request paper trail.
Q5: How do I remediate resources in a member account from a central account?
Deploy a trust role in every member account (via CloudFormation StackSets or Control Tower) allowing the central account's Automation role to assume it. The SSM document uses assumeRole to assume into the member account before each step.
Q6: What if my remediation runbook itself fails?
The SSM Automation execution emits failure events to EventBridge. Catch those with a follow-up rule that creates an OpsItem and pages the on-call engineer. Build remediation chains with explicit failure handling, not blind retry.
Q7: How do I debug a failed remediation?
Check the SSM Automation execution history for the failed step and its error message. Look at the EventBridge rule's DLQ for events that never even started. Look at CloudTrail for the StartAutomationExecution API call to see who invoked. Look at CloudWatch Logs for any aws:executeScript step output.
Cross-References
- CloudWatch alarms and EventBridge are the trigger layer for these remediation chains; see
cloudwatch-alarms-eventbridge-integration. - CloudTrail and Config dashboards provide the detection signals that fire remediation; see
cloudtrail-config-audit-dashboards. - Incident Manager and AWS Health handle higher-severity escalations beyond automated remediation; see
systems-manager-incident-manager-health. - Deployment failure troubleshooting uses the same EventBridge → SSM pattern for rollback automation; see
deployment-failure-troubleshooting. - CloudWatch metrics and Logs Insights provide observability into remediation execution; see
cloudwatch-metrics-logs-insights.