examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 25 min

Systems Manager Incident Manager and AWS Health Events

5,000 words · ≈ 25 min read ·

DOP-C02 deep dive on Systems Manager Incident Manager response plans, escalation, contacts, AWS Health Dashboard, Health API, organization-wide health events, EventBridge integration, and post-incident analysis automation.

Do 20 practice questions → Free · No signup · DOP-C02

Systems Manager Incident Manager and AWS Health are the human-side of incident and event response on DOP-C02. Where EventBridge auto-remediation closes loops without humans, Incident Manager engages humans correctly when they're needed — paging the right on-call rotation, opening a shared chat channel, attaching the right runbook, escalating up the contact ladder if the first responder doesn't acknowledge, and capturing the timeline for post-incident review. AWS Health is the inbound feed for AWS-side events affecting your account: scheduled maintenance, service degradation in a Region, deprecation notices, EC2 instance retirements, certificate expirations. The two services compose: Health publishes the event, EventBridge routes it, and Incident Manager activates the response plan that pages your team and tracks the response.

This guide assumes you understand basic on-call concepts (rotation, escalation, paging) and basic Systems Manager. It focuses on the DOP-C02 implementation depth: Incident Manager response plans, engagement plans, escalation plans, contacts, chat channels (Slack and Microsoft Teams via AWS Chatbot), incident records with timeline and post-incident analysis, runbook attachment to response plans for guided execution, AWS Health Dashboard Personal Health (per-account events) vs Service Health (global service status), AWS Health Organizational View for cross-account aggregation, AWS Health API programmatic access (Business and Enterprise Support tiers), EventBridge integration for aws.health events, OpsCenter as the workhorse ticketing layer below the incident threshold, and the relationship between automated remediation, OpsItems, and full-blown incidents. Domain 5.1 (Manage event sources to process, notify, and take action) and Domain 5.3 (Troubleshoot system and application failures) are the home of this material.

Why Incident Manager and AWS Health Matter on DOP-C02

The DOP-C02 exam guide explicitly lists "AWS Health" alongside EventBridge and CloudTrail as the three event sources DevOps engineers must integrate. Multiple community study reports flag AWS Health and Incident Manager as underrepresented in third-party prep material — exactly the territory the exam likes to test. The exam style here is integration-architecture: how does an incoming Health event become a paging chain that wakes the right engineer with the right runbook, and how does the response get captured for post-incident review without manual paperwork?

A typical DOP-C02 stem reads: "AWS announces an EC2 instance retirement notice for 12 instances across 3 accounts. The platform team needs to be paged automatically, with a chat channel auto-created for coordination, and a Jira-equivalent ticket opened for tracking the migration." The wrong answers cluster around custom Lambda + custom paging integration. The right answer is AWS Health Organizational View as the aggregator, EventBridge organization-level rule matching aws.health EC2 Instance Retirement Scheduled events, target = Incident Manager response plan that engages the platform on-call engagement plan, opens a chat channel via AWS Chatbot, creates an OpsItem in OpsCenter, and (optionally) starts an SSM Automation runbook that drains and replaces the affected instances. The exam tests whether you know each of these components exists and how they wire together.

  • Incident Manager: AWS Systems Manager component for incident response — response plans, contacts, escalation, paging, chat channels, post-incident analysis.
  • Response plan: a configuration that defines what to do when an incident is created — title template, severity, engagements, runbook, chat channels.
  • Engagement plan: a sequence of contact channels (SMS, voice, email, chatbot) tried in order with time delays.
  • Escalation plan: a multi-stage engagement with multiple contacts per stage and configurable progression rules.
  • Contact: a person or team identity in Incident Manager with one or more contact channels (phone, email, SMS).
  • Incident record: a created incident with timeline, engagements, runbook execution, and post-incident analysis.
  • Chat channel: an Incident Manager artifact that bridges the incident to a Slack or Microsoft Teams channel via AWS Chatbot.
  • Post-incident analysis (PIA): a structured review document attached to the incident, with sections for timeline, contributing factors, action items.
  • AWS Health: the service that publishes events about AWS infrastructure affecting your account.
  • Personal Health Dashboard (PHD): per-account health events visible to that account.
  • Service Health Dashboard (SHD): the public, AWS-wide service status page.
  • AWS Health Organizational View: aggregated health events across all accounts in an Organization, viewable from the management or delegated administrator account.
  • AWS Health API: programmatic access to events (Business and Enterprise Support only).
  • OpsCenter / OpsItem: a lightweight ticket-like artifact, lower severity than an incident.
  • Reference: https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html

Plain-Language Explanation: Incident Manager and AWS Health

The on-call response domain is conceptually familiar but the AWS-specific wiring trips engineers up. Three analogies clarify the parts.

Analogy 1: The Hospital Code Team

Imagine a hospital's response to a patient deterioration. The cardiac monitor (CloudWatch alarm) trips a Code Blue (EventBridge event) which activates the Code Blue Team Response Plan (Incident Manager response plan). The plan dispatches via the hospital's paging system: first, the on-call resident (engagement plan stage 1, SMS); if no acknowledgment in 2 minutes, the cardiology fellow (stage 2, voice call); if still no response, the attending (stage 3, escalation). A Code Blue chat channel is auto-created (Incident Manager chat channel via AWS Chatbot), and the resuscitation runbook is pulled up on the team tablet (SSM Automation document attached to response plan). After the event resolves, the attending leads a morbidity-and-mortality conference documenting the timeline, contributing factors, and process-improvement actions (the post-incident analysis). AWS Health in this analogy is the hospital's infectious disease alert system — it tells you "there's a new variant circulating in the community" (a Region-wide service degradation) before any individual patient codes. Smart hospitals integrate the alert into the response plan: if a Code Blue happens during a community outbreak, the plan branches to add infectious-disease precautions automatically.

Analogy 2: The Fire Department Dispatch

When a smoke alarm trips (CloudWatch alarm), the building's fire panel calls 911 (EventBridge → Incident Manager). The dispatcher activates the Fire Response Plan which routes by location and severity: the closest engine company (engagement plan stage 1), with backup from a second house if it's a multi-alarm fire (escalation). On-scene command opens a dedicated radio channel (Incident Manager chat channel) and pulls up the building pre-plan runbook showing the floor plan, hazardous materials, hydrant locations, and shutoff valves (SSM Automation document with relevant context). After the fire is out, the captain writes the incident report including cause, timeline, response performance, and recommendations (post-incident analysis). AWS Health is the equivalent of the National Weather Service's red flag warning — "extreme fire danger today across the region". Fire departments adjust staffing and pre-position resources when these warnings arrive. AWS DevOps teams adjust monitoring and pre-position runbooks when an AWS Health "increased error rates in us-east-1" event arrives.

Analogy 3: The Airline Operations Center During Diversion

A flight has a mechanical problem mid-flight (a service alarm). The pilot calls dispatch (EventBridge), which activates the Diversion Response Plan. Dispatch engages: the on-duty operations director (stage 1), the maintenance hub at the diversion airport (stage 2), the passenger services team (stage 2 parallel), and escalates to the VP of operations if the diversion is to a foreign country requiring coordination with consulate (escalation stage 3). A dedicated bridge is opened — pilots, maintenance, ops, customer service all on one call (chat channel). The diversion runbook is opened: paperwork, customs, hotel arrangements, rebooking (SSM Automation runbook with mostly informational steps and some automated like booking notifications). After the flight is safely on the ground, the safety team writes the incident report for the regulator (post-incident analysis). AWS Health is the FAA NOTAMs feed — "Runway 27R closed at JFK from 14:00-18:00 UTC for repair". The airline's ops dashboards integrate NOTAMs and adjust scheduling automatically; AWS DevOps teams integrate Health events and adjust deployment windows automatically.

For DOP-C02 stems centered on "alarm fires and the right people get paged with the right runbook", reach for the hospital code team analogy. For stems centered on "multi-alarm escalation across teams", reach for the fire department analogy. For stems centered on "AWS infrastructure event triggering proactive response", reach for the airline ops center with NOTAMs analogy. Reference: https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html

Incident Manager Response Plan Anatomy

A response plan is the central template. Components:

Title and severity

A free-text title template that supports incident-creation-time interpolation (Critical: API Latency High - {{IncidentId}}). Severity is 1-5 with 1 highest.

Engagements

Up to 5 engagements per plan, where each engagement is either:

  • A direct contact (a person), or
  • A contact group (a team), or
  • An escalation plan (a staged sequence).

When the incident starts, all engagements fire in parallel (each pages its own contact channels in sequence per its engagement plan).

Chat channel

Optionally attach an AWS Chatbot configuration. When the incident starts, Incident Manager auto-creates a Slack or Teams channel and posts incident updates, runbook progress, and team comments back to the timeline.

Runbook

Attach a single SSM Automation document. The runbook executes as part of incident creation; engineers can pause, resume, and modify steps from the incident console. Runbook output appears in the timeline.

Incident tags and template

Attached to created incidents for filtering, search, and downstream automation.

Trigger

Response plans trigger from:

  • CloudWatch alarms with the action arn:aws:ssm-incidents:::response-plan/....
  • EventBridge rules with target = response plan ARN.
  • Manual creation via console or API.
  • CodeDeploy deployment failures (auto-create incident option).

Engagement Plans — The Paging Sequence

An engagement plan defines how a single contact is paged. Steps:

  1. SMS to phone (delay 0).
  2. Voice call to phone (delay 5 minutes).
  3. Email (delay 10 minutes).

Each step has a contact channel, an order, and a delay (in minutes from incident start). The engagement stops as soon as the contact acknowledges via any channel. Each contact has their own engagement plan.

Escalation Plans — Cross-Contact Staging

An escalation plan strings multiple contacts together with progression rules. Stages run sequentially; within a stage, all contacts page in parallel. Stage progression:

  • All contacts acknowledged: engagement complete.
  • No acknowledgment after timeout: progress to next stage.
  • One contact acknowledged in stage: option to require all or to progress immediately.

Use escalation plans for "primary on-call → secondary on-call → manager" patterns.

A common DOP-C02 confusion: candidates think the response plan's 5 engagements are sequential. They are parallel — all 5 fire when the incident starts. To get sequential staging across teams, wrap them in an escalation plan with explicit stages. The right pattern: response plan engagement = "Platform On-Call Escalation"; the escalation plan inside has stage 1 (primary), stage 2 (secondary, after 5 min), stage 3 (manager, after 10 min). Reference: https://docs.aws.amazon.com/incident-manager/latest/userguide/escalation.html

Chat Channel Integration via AWS Chatbot

AWS Chatbot bridges AWS services to Slack and Microsoft Teams. For Incident Manager:

  1. Configure Chatbot with the Slack/Teams workspace and target channel.
  2. Create the chat channel in Incident Manager referencing the Chatbot configuration.
  3. Attach the chat channel to a response plan.

When an incident is created, Chatbot can either post to an existing channel or auto-create a per-incident channel (configurable). The bot posts:

  • Incident open/close events.
  • Engagement acknowledgments.
  • Runbook step progress.
  • Comments added to the timeline.

Engineers can post comments back to the channel which Incident Manager records in the timeline.

AWS Health — The Inbound Event Feed

AWS Health publishes events in three categories:

Issue events

Real-time service issues affecting your resources. Examples:

  • Increased error rates in a Region.
  • Connectivity issues to a service.
  • Performance degradation.

Scheduled change events

Upcoming maintenance affecting your resources. Examples:

  • EC2 instance scheduled retirement.
  • RDS database upgrade window.
  • Certificate Manager auto-renewal failure.
  • IAM access key rotation reminder.

Account notifications

Account-specific notices: limit increases, billing alerts, abuse reports, security best-practice reminders.

Personal Health vs Service Health Dashboard

  • Personal Health Dashboard (PHD) is per-account; it only shows events affecting your specific resources. Free.
  • Service Health Dashboard (SHD) is the public AWS-wide status page. Not account-specific.

The DOP-C02 exam tests that Personal Health is the right source for account-specific automation; SHD is just informational.

Organizational View

AWS Health Organizational View aggregates events from every member account into the management or delegated administrator account. Enabled at the Organizations level; provides a single console for the platform team to see all impacted accounts.

AWS Health API

The Health API requires Business or Enterprise Support plan. Used for programmatic event retrieval and filtering. Not available with Basic or Developer Support.

A common DOP-C02 trap: candidates design a custom Lambda polling the AWS Health API for proactive event handling, not realizing the customer is on the Developer Support plan where the API is not available. The fix: use EventBridge which surfaces Health events at all support tiers (free), and let EventBridge route to your handlers. The Health API is only needed for bulk historical retrieval, which most use cases don't require. Reference: https://docs.aws.amazon.com/health/latest/ug/health-api.html

EventBridge Integration for Health

AWS Health publishes events to the default EventBridge bus with source: aws.health. Common detail-types:

  • AWS Health Event (issues).
  • AWS Health Abuse Event.
  • The detail includes service, eventTypeCode, eventTypeCategory, affectedEntities, region.

Pattern examples:

  • Match all EC2 retirement notices: {"source": ["aws.health"], "detail": {"service": ["EC2"], "eventTypeCategory": ["scheduledChange"]}}.
  • Match Region-wide issues: {"source": ["aws.health"], "detail": {"eventTypeCategory": ["issue"]}}.

Targets typically include Incident Manager response plan, Lambda for custom logic, SNS for human notification, OpsCenter for ticket creation.

OpsCenter — Tickets Below the Incident Threshold

OpsCenter holds OpsItems for issues that do not warrant full incident response. Examples:

  • Config rule non-compliance findings (after attempted auto-remediation).
  • Low-severity GuardDuty findings.
  • AWS Health scheduled change events for review.
  • Failed pipeline executions.

OpsItems have severity, status (Open/InProgress/Resolved), assignment, related resources, and runbook recommendations. The DOP-C02 exam pattern: automated remediation that fully succeeds → no OpsItem; partial success or human review needed → OpsItem; production-impacting urgent → Incident Manager incident.

Post-Incident Analysis (PIA)

After an incident is resolved, Incident Manager generates a PIA template with:

  • Timeline (auto-populated from the incident timeline).
  • Contributing factors.
  • Resolution.
  • Lessons learned.
  • Action items with owners and due dates.

Action items can integrate with Jira, ServiceNow, or OpsCenter for tracking. PIAs are searchable across the team and become institutional knowledge.

  1. Title template with interpolation.
  2. Severity 1-5.
  3. Engagements (up to 5, parallel).
  4. Chat channel via AWS Chatbot (Slack/Teams).
  5. Runbook (single SSM Automation document attached).
  6. Tags for filtering and downstream automation.
  7. Incident template with default summary. Triggers: CloudWatch alarm action ARN, EventBridge rule target ARN, manual API call, CodeDeploy auto-create option. Engagement plans run sequentially per contact (SMS → voice → email with delays). Escalation plans run stages sequentially across multiple contacts. Reference: https://docs.aws.amazon.com/incident-manager/latest/userguide/response-plans.html

High-Frequency Exam Traps

Trap 1: Engagements Are Parallel, Not Sequential

5 engagements on a response plan all fire at once. For sequential cross-team paging, use an escalation plan with explicit stages.

Trap 2: Personal Health vs Service Health

Personal Health is account-specific (PHD). Service Health is the public AWS-wide status page (SHD). For automation, use Personal Health via EventBridge.

Trap 3: Health API Support Tier Restriction

Health API requires Business or Enterprise Support. Health EventBridge integration works at all tiers including Basic.

Trap 4: One Runbook per Response Plan

A response plan attaches one SSM Automation document. To run multiple sequenced runbooks, build a parent runbook that calls children, or use Step Functions.

Trap 5: Incident Manager Is Regional

Incident Manager has a per-Region service. For multi-Region readiness, configure response plans in each Region. There is a "replication set" feature to sync configuration across Regions.

Trap 6: Chat Channel Auto-Creation vs Reuse

The chat channel can be a fixed channel (everyone joins on incident start) or per-incident (auto-created and archived). Per-incident is cleaner for postmortem but requires Slack/Teams admin permissions for auto-create.

Trap 7: OpsCenter vs Incident Manager Severity

OpsItem is a lightweight ticket; Incident Manager incident is a full paging-and-runbook event. Misclassifying noise as incidents pages the on-call too often; misclassifying real incidents as OpsItems delays response.

Incident Manager itself is a regional service. If the primary Region is impaired, you may not be able to create incidents in that Region. The fix is the replication set — a list of Regions across which Incident Manager state is replicated. Configure 2-3 Regions for the replication set; create response plans only once and they replicate. This is the right answer for "what if the on-call paging system itself is in the Region that's down". Reference: https://docs.aws.amazon.com/incident-manager/latest/userguide/replication-set.html

DOP-C02 Exam Patterns and Worked Scenarios

Scenario 1: AWS Health-Driven On-Call Engagement

Stem: "AWS announces an EBS volume scheduled maintenance for production volumes; team needs paging and a Jira ticket auto-created." Right: EventBridge rule on aws.health source filtered to EBS scheduledChange, targets = (1) Incident Manager response plan engaging the storage on-call, (2) Lambda creating the Jira ticket via API destination.

Scenario 2: Multi-Stage Escalation for P1 Outage

Stem: "When the API error rate exceeds 5%, page the primary on-call; if not acknowledged in 5 min, page the secondary; if not acknowledged in another 5 min, page the manager." Right: response plan with escalation plan stage 1 = primary, stage 2 = secondary (5 min), stage 3 = manager (10 min).

Scenario 3: Incident Auto-Created from CodeDeploy Failure

Stem: "Production deployment failure must immediately trigger an incident with the deployment runbook attached." Right: enable CodeDeploy Alarm integration with a CloudWatch alarm; the alarm action triggers the response plan whose attached runbook is AWS-RollbackDeployment or a custom rollback document.

Scenario 4: Cross-Account AWS Health Aggregation

Stem: "Platform team needs visibility into Health events across 50 member accounts." Right: enable AWS Health Organizational View at the management account; create EventBridge rules at the management/delegated admin account on the org-wide bus.

Scenario 5: Post-Incident Action Item Tracking

Stem: "After incident resolution, team needs structured PIA with action items tracked to closure." Right: use Incident Manager's built-in post-incident analysis template with action items synced to OpsCenter or Jira.

FAQ

Q1: When do I use OpsItem vs Incident Manager incident?

OpsItem for low/medium-severity issues that need human review but don't warrant paging. Examples: Config rule non-compliance (post auto-remediation), AWS Health scheduled change for future planning. Incident Manager incident for production-impacting events requiring immediate paging and runbook execution.

Q2: How do I integrate Incident Manager with PagerDuty or Opsgenie?

Two paths: (1) use Incident Manager directly since it includes full paging — no need for PagerDuty unless you have requirements it covers (vendor consolidation reduces cost). (2) Keep PagerDuty as the paging platform; use EventBridge → API destination → PagerDuty REST API for routing. The DOP-C02 exam prefers the AWS-native answer (Incident Manager) when no specific multi-vendor constraint is given.

Q3: Can I attach multiple runbooks to a response plan?

No — one document per plan. Workarounds: build a parent SSM Automation document that calls children sequentially with aws:executeAutomation, or use Step Functions as the orchestrator and attach a thin runbook that starts the state machine.

Q4: What if my Region is impaired and I can't access Incident Manager there?

Use a replication set spanning 2-3 Regions. Incident Manager replicates response plans, contacts, and incidents across Regions. Create incidents in any Region of the set when the primary is impaired.

Q5: How does AWS Health Organizational View differ from per-account Personal Health?

Personal Health shows events affecting one account's resources. Organizational View aggregates Personal Health from every member account into the management or delegated admin account. Required for cross-account platform-team visibility.

Yes — AWS Health publishes service-limit-approaching warnings, billing anomaly alerts, abuse notifications, and account-level security best-practice reminders. Filter by eventTypeCategory: accountNotification to route those separately from infrastructure events.

Q7: How do I prevent alert fatigue from noisy Health events?

Filter at the EventBridge rule level — match only specific services and eventTypeCategory: issue for paging-worthy events; route scheduledChange events to OpsCenter for review-without-paging; archive accountNotification events for monthly review.

Cross-References

  • EventBridge auto-remediation runbooks cover the SSM Automation pattern that Incident Manager response plans attach to; see eventbridge-auto-remediation-runbooks.
  • CloudWatch alarms and EventBridge explain the alarm-to-incident triggering chain; see cloudwatch-alarms-eventbridge-integration.
  • CloudTrail and Config dashboards provide the audit trail that incident timelines reference; see cloudtrail-config-audit-dashboards.
  • Deployment failure troubleshooting uses Incident Manager when deployment failures escalate to user-impacting outage; see deployment-failure-troubleshooting.
  • CloudWatch metrics and Logs Insights provide the data behind incident timelines and PIA root-cause analysis; see cloudwatch-metrics-logs-insights.

Official sources

More DOP-C02 topics