Deployment failure troubleshooting is the most-frequent real-world DevOps task and one of the most-tested scenarios on DOP-C02. Every Domain 5.3 stem is a variation: "the deployment failed, here are the symptoms, what is the root cause and what is the fix?" Multiple community study reports flag this domain as where many candidates lose points because the troubleshooting requires knowing the failure surface of every deployment service — CodePipeline, CodeBuild, CodeDeploy, CloudFormation, ECS, Lambda alias traffic shifting, Auto Scaling group deployments, Elastic Beanstalk — plus the rollback signals each service uses and the lifecycle hooks where things commonly break.
This guide assumes you understand basic CI/CD concepts (build → test → deploy) and basic CodePipeline structure. It focuses on the DOP-C02 implementation depth: CodePipeline failed-action diagnosis (transient vs permanent, retry strategy), CodeBuild buildspec.yml phase failures (install, pre_build, build, post_build, finally), CodeBuild reports and exit codes, CodeDeploy appspec.yml lifecycle hooks (BeforeInstall, AfterInstall, ApplicationStart, ValidateService, BeforeAllowTraffic, AfterAllowTraffic, BeforeBlockTraffic, AfterBlockTraffic), CodeDeploy alarm-based rollback, blue/green vs in-place failure modes, CloudFormation rollback configuration with monitoring period and alarms, CloudFormation hooks for pre-deployment validation, ECS deployment circuit breaker and rolling vs blue/green failure semantics, Lambda alias traffic shifting with pre/post traffic hooks, CloudWatch Synthetics canaries as deployment-validation tests, and the rollback-signal hierarchy (alarm trigger > exit code > timeout > manual). Domain 5.3 (Troubleshoot system and application failures) is the home of this material with significant overlap into Domain 1.4.
Why Deployment Failure Troubleshooting Matters on DOP-C02
DOP-C02 weights SDLC Automation at 22 percent (highest of any domain) and Incident Response at 14 percent. Together that is 36 percent of the exam, and the overlap between them is exactly deployment failures. The exam expects you to (1) identify the failed component, (2) find the right log or signal, (3) determine the root cause, (4) prescribe both the immediate fix and the long-term prevention. A typical stem reads: "A CodeDeploy in-place deployment to 50 EC2 instances has 12 instances showing 'BeforeInstall' lifecycle event timeouts, 8 instances at 'AllowTraffic' marked failed by the load balancer health check, and 30 instances pending. CodeDeploy reports the deployment as Failed. The team needs the immediate rollback action and the underlying fix." The wrong answer is "redeploy". The right answer triages each batch separately: stuck at BeforeInstall is usually agent or script issue (check /opt/codedeploy-agent/deployment-root/.../scripts/... logs); stuck at AllowTraffic is target group health check failure (verify health check path returns 200, security group allows ALB SG, container startup time fits the deregistration delay); pending instances will be unaffected because the deployment is in failed state. The exam tests this triage instinct.
Community study reports specifically flag buildspec.yml and appspec.yml syntax as must-memorize because the exam shows partial YAML and asks you to identify the broken phase or hook. They also flag deployment circuit breaker for ECS and alarm-based rollback for CodeDeploy as high-frequency topics introduced in DOP-C02 that older C01 prep material misses.
- CodePipeline action: a step in a stage; failure of one action fails the stage, optionally with retry.
- CodeBuild phase: a section of buildspec (
install,pre_build,build,post_build,finally); each phase has a single command list and exit code. - CodeBuild reports: test result reports (JUnit, NUnit, Cucumber, etc.) the build can produce and CodeBuild aggregates.
- CodeDeploy lifecycle hook: a named point during deployment where a hook script can run (BeforeInstall, AfterInstall, etc.); failure to complete a hook within the timeout fails the instance deployment.
- CodeDeploy alarm: a CloudWatch alarm associated with a deployment; if the alarm enters ALARM state during the deployment, it fails and rolls back.
- CodeDeploy rollback: the action of reverting to the previous successful revision; can be on alarm, on failure, or manual.
- CloudFormation rollback configuration: stack-level monitoring period and alarm list; if any alarm fires within the period after stack creation/update, the stack rolls back.
- CloudFormation hook: a service-side validation extension (Lambda or third-party) that gates resource creation/update with policy checks.
- ECS deployment circuit breaker: a service-level setting that automatically rolls back ECS deployments when a configured number of consecutive task failures occur.
- Lambda traffic shifting: routing a percentage of traffic to a new alias version (linear or canary), with optional pre/post traffic Lambda hooks.
- CloudWatch Synthetic canary: a scheduled scripted test (typically Puppeteer) that validates a service endpoint and emits metrics for alarming.
- Reference: https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html
Plain-Language Explanation: Deployment Failure Troubleshooting
The deployment-failure surface is broad and the root-cause hierarchy can feel overwhelming. Three analogies reduce it to triage instinct.
Analogy 1: The Restaurant Kitchen Service Line During a Failed Plate
A new dish arrives at the pass and the expediter rejects it (the deployment fails the canary). Where did the failure come from? The triage walks the line backwards: was the plating wrong (CodeDeploy AfterInstall hook), or was the sauce off (BeforeInstall script that prepares the environment), or was the protein undercooked (the actual application binary), or was the ticket entered wrong (CodePipeline source action sending the wrong artifact), or was the prep wrong upstream (CodeBuild pre_build phase). Each hand-off has a clear handoff log; the expediter's job is to walk back through them in order. The alarm is the customer's first complaint reaching the manager (alarm-based rollback); the circuit breaker is the line cook saying "we've burned 3 in a row, stop sending tickets" (ECS deployment circuit breaker auto-rolling back). The canary is the manager tasting one plate from each batch before any goes out (CloudWatch Synthetic canary running before promoting traffic). The lifecycle hooks are the pass's standardized checks: every plate goes through wipe-the-rim, garnish-check, temperature-probe, photo-for-the-camera before it leaves — skip any check and the plate fails.
Analogy 2: The Aircraft Pre-Flight Checklist Cascade
Before a flight, the crew runs a multi-stage pre-flight checklist: maintenance sign-off (CodeBuild build phase), fueling (CodeBuild post_build artifact upload), boarding (CodeDeploy BeforeInstall), engine start (CodeDeploy AfterInstall), taxi (ApplicationStart), takeoff roll (AllowTraffic), and rotation (BeforeAllowTraffic / AfterAllowTraffic at the load-balancer cutover). At any stage, a check failure aborts and rolls back to the gate. The alarms are the cockpit warning lights — engine temperature high, hydraulic pressure low — wired to the rollback decision. The circuit breaker is the auto-throttle pulling power if vibration sensors detect anomaly. The canary is the chase plane that flies the route 5 minutes ahead and reports back. Synthetics canaries are exactly that — scripted flights that validate the route is safe before the production traffic flies it.
Analogy 3: The Surgical Operating Room Time-Out Protocol
A surgical team uses a time-out protocol at multiple points: before incision (BeforeInstall), after instruments are counted (AfterInstall), before closing (BeforeAllowTraffic, the patient is about to leave the OR for recovery), and a final closure check (AfterAllowTraffic, signing off the chart). Each time-out has a checklist; failing the checklist halts the procedure. CloudFormation rollback configuration is the post-op monitoring window — for the next N minutes after surgery, if the patient's vitals (CloudWatch alarms) deteriorate, we roll back to the pre-op state (revert the stack update). ECS circuit breaker is the OR's policy: if 3 surgeries in a row from the same team have complications, the team is automatically benched until a review (auto-rollback after N consecutive failures). The lifecycle hook scripts are the surgical checklists themselves — version-controlled, peer-reviewed documents that prevent drift between practitioner and protocol.
For DOP-C02 stems centered on "trace back through the deployment stages to find the failure", reach for the kitchen pass triage analogy. For stems centered on "checklist-driven multi-phase deployment with rollback gates", reach for the aircraft pre-flight analogy. For stems centered on "monitoring window and circuit breaker for safety", reach for the OR time-out protocol analogy. Reference: https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html
CodePipeline Failure Diagnosis
Pipelines fail at the action level. The console shows the failed action with the failure message. Common patterns:
- Source action: source repository unreachable, branch deleted, IAM perm missing.
- Build action: CodeBuild project failed; click through to the build's logs.
- Test action: typically another CodeBuild with different buildspec; same diagnosis.
- Deploy action: CodeDeploy / CloudFormation / ECS / Lambda / Beanstalk failure; drill into the deployment service.
- Approval action: timeout (default 7 days) or explicit reject.
- Invoke action: Lambda / Step Functions error; check function logs.
Failed pipelines can be retried at the failed stage with the same input artifact, or restarted from source. For transient infrastructure issues (e.g., ECR throttling), retry is right. For code or config bugs, fix and re-source.
CodeBuild buildspec.yml Phase Failures
version: 0.2
phases:
install:
runtime-versions:
python: 3.11
commands:
- pip install -r requirements.txt
pre_build:
commands:
- aws ecr get-login-password | docker login ...
build:
commands:
- docker build -t myapp:$CODEBUILD_RESOLVED_SOURCE_VERSION .
post_build:
commands:
- docker push ...
artifacts:
files:
- imagedefinitions.json
reports:
unit-tests:
files:
- test-results.xml
file-format: JUNITXML
Each phase fails on first non-zero exit code. Common failures:
- install: missing runtime version, package install errors.
- pre_build: failed authentication to ECR or other registries.
- build: compilation error, test failure, Docker build failure.
- post_build: push failure, artifact upload failure.
The finally block runs regardless of prior phase outcome; use for cleanup. The artifacts section defines what to ship to the next pipeline action. Reports aggregate test results into the CodeBuild report group view.
buildspec.yml phases (in order): install, pre_build, build, post_build, finally. Use runtime-versions only in install. Each phase runs commands in order; first non-zero exit fails the phase.
appspec.yml hooks for EC2/on-prem in-place: ApplicationStop, BeforeInstall, AfterInstall, ApplicationStart, ValidateService. Plus for blue/green: BeforeAllowTraffic, AfterAllowTraffic. Plus for replacement environment: BeforeBlockTraffic, AfterBlockTraffic on the original environment.
appspec.yml for Lambda: BeforeAllowTraffic, AfterAllowTraffic only. For ECS: BeforeInstall, AfterInstall, AfterAllowTestTraffic, BeforeAllowTraffic, AfterAllowTraffic. Reference: https://docs.aws.amazon.com/codedeploy/latest/userguide/reference-appspec-file.html
CodeDeploy Lifecycle Hook Failures
A lifecycle event failure on any instance fails that instance's deployment. The deployment configuration determines whether one instance failure fails the whole deployment:
CodeDeployDefault.OneAtATime: one failure → next instances skip → deployment fails.CodeDeployDefault.HalfAtATime: tolerate up to 50% failures.CodeDeployDefault.AllAtOnce: highest risk; one failure fails all.
Hook script logs land at /opt/codedeploy-agent/deployment-root/<deployment-group>/<deployment-id>/logs/scripts.log on EC2. The CodeDeploy console links to the script output for each failed instance.
Common hook failures
- BeforeInstall timeout: hook script too slow; default timeout 30 min, configurable per hook in appspec.
- ApplicationStart fail: app fails to start; usually missing config, port conflict, or dependency.
- ValidateService fail: smoke test (curl localhost) returns non-200; this is the single most useful hook for catching broken deploys before traffic hits.
CodeDeploy alarm-based rollback
Configure the deployment group with a CloudWatch alarm list. During the deployment, if any alarm enters ALARM state, the deployment is automatically rolled back. Common alarms: ALB target group UnHealthyHostCount > 0, application 5xx rate above threshold, P99 latency above threshold.
CloudFormation Rollback Configuration
For CFN stack creation and updates:
- Rollback configuration: a list of CloudWatch alarms and a monitoring period (default 0, recommend 5-15 min).
- During the monitoring period, if any alarm enters ALARM state, the stack rolls back to the prior state.
- Stack failure options:
ROLLBACK(default),DO_NOTHING(preserve resources for debugging),DELETE(full teardown on create failure).
Common CFN failure causes:
- IAM role missing permissions for resource creation.
- Resource name conflict (e.g., S3 bucket name already taken).
- Quota limit (VPC limit, EIP limit).
- Custom Resource Lambda timeout.
- WaitCondition timeout.
The stack events tab shows the failure resource and reason. CFN hooks (separate feature) intercept resource operations for policy enforcement; failed hook = blocked resource = stack rollback.
Many DOP-C02 stems describe a stack that "deployed successfully but the application immediately broke". The fix is to add rollback configuration with a monitoring period of at least 5 minutes and alarms on application 5xx rate and target group unhealthy hosts. CFN waits the monitoring period after each create/update; if any alarm fires, automatic rollback. This is the AWS-native way to gate IaC deployments on application health, equivalent to CodeDeploy alarm-based rollback. Reference: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-rollback-triggers.html
ECS Deployment Failures and Circuit Breaker
ECS supports two deployment types:
- Rolling update (default): replaces old tasks with new tasks gradually based on
minimumHealthyPercentandmaximumPercent. - Blue/green via CodeDeploy: full new task set in parallel, then traffic shift.
The deployment circuit breaker (rolling update only) automatically rolls back when consecutive task failures exceed a threshold (default 3). Enable with deploymentCircuitBreaker: { enable: true, rollback: true }. Without the circuit breaker, a broken image will keep replacing healthy tasks one-by-one until the entire service is broken.
Common ECS deployment failures:
- Task fails to start: image pull failure (ECR auth, network), invalid task definition, IAM role missing.
- Task starts but fails health check: app crashes on start, port mismatch, dependency unreachable.
- CapacityProviderStrategy failure: no capacity in the chosen provider.
Drill in via:
- ECS service events tab (service-level failures).
- Task stopped reason (per-task failures).
- CloudWatch Logs (
awslogsdriver) for application logs. - Container Insights for deeper resource metrics.
Lambda Alias Traffic Shifting and Hooks
Lambda alias traffic shifting (Linear, Canary) requires CodeDeploy. Configuration:
- Linear: shift X% every Y minutes (e.g., LambdaLinear10PercentEvery1Minute).
- Canary: shift X% then wait Y, then 100% (e.g., LambdaCanary10Percent5Minutes).
- AllAtOnce: full cutover, no gradual shift.
Pre and post traffic hooks (Lambda functions) run before and after the shift. Use the pre-traffic hook to validate the new version (synthetic invocations), and the post-traffic hook for cleanup. Hook function returns success/failure to CodeDeploy via PutLifecycleEventHookExecutionStatus API.
CloudWatch alarm during the shift triggers automatic rollback (alias points back at prior version).
Auto Scaling Group Deployment Failures
ASG deployments fail when:
- Lifecycle hook timeout: instance launch hook (Pending:Wait) doesn't get heartbeat or complete signal.
- Health check fail: ELB health check or EC2 status check fails before instance is in service.
- Bootstrap failure: cfn-init or user data script error.
CodeDeploy + ASG: the deployment configuration must respect ASG scaling events; rolling deployments through new instances are common. Avoid in-place deployments on ASGs because new instances spawning during deployment may not get the new code.
CloudWatch Synthetics — Pre- and Post-Deploy Canaries
Synthetics canaries are scripted browser tests (Puppeteer/Selenium) running on a schedule. Canaries emit metrics (success/failure, latency, broken-link counts) that you alarm on. Use canaries:
- Pre-deploy validation: run against the staging environment.
- Post-deploy verification: run against production after deploy.
- Continuous synthetic monitoring: detect regressions between deploys.
The canary's SuccessPercent metric in CloudWatch is the standard gate — alarm if below 95% and trigger CodeDeploy rollback or CloudFormation rollback configuration.
A common DOP-C02 trap: in-place CodeDeploy on an ASG where the ASG scales out during the deployment. New instances launched mid-deployment are not part of the deployment group's instance list and never receive the new code. The deployment reports success but the fleet is in an inconsistent state. The right pattern is blue/green deployment which deploys to a new replacement ASG with the new launch template, then shifts traffic. Or use rolling deployment with launch template versioning so new instances always come up with the latest. Reference: https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html
High-Frequency Exam Traps
Trap 1: Lifecycle Hook Order
EC2 in-place: ApplicationStop → DownloadBundle → BeforeInstall → Install → AfterInstall → ApplicationStart → ValidateService. The exam shows partial sequences and asks for the missing hook.
Trap 2: CodeBuild Phase runtime-versions Only in install
runtime-versions is only valid in the install phase. Putting it elsewhere fails the build.
Trap 3: CloudFormation Rollback Disabled with DO_NOTHING
Setting OnFailure: DO_NOTHING on stack creation is for debugging; it leaves resources in failed state. Not for production.
Trap 4: ECS Circuit Breaker Only for Rolling Updates
Blue/green ECS via CodeDeploy uses CodeDeploy alarm rollback, not the ECS circuit breaker. They are different mechanisms.
Trap 5: Lambda Pre-Traffic Hook Must Call PutLifecycleEventHookExecutionStatus
The hook function must explicitly call the API to report success or failure. Just returning normally is not enough.
Trap 6: Synthetic Canary Permissions
Canaries run as Lambda functions; the canary IAM role needs S3 write for screenshots, CloudWatch Logs write, and any application-side permissions if testing authenticated endpoints.
Trap 7: CodePipeline Cross-Region Action Requires Cross-Region Replication Buckets
To run actions in different Regions, you must configure cross-Region artifact buckets. Pipeline silently fails without them on cross-Region deploy actions.
The DOP-C02 exam treats "deployment without alarm-based rollback" as a wrong answer in any production scenario. Whether it is CodeDeploy (alarm list on deployment group), CloudFormation (rollback configuration), or Lambda alias shifting, the right answer always wires alarms on application-health metrics (5xx rate, P99 latency, target unhealthy count) into the rollback signal. The exam's "best practice" answers always include this layer. Reference: https://docs.aws.amazon.com/codedeploy/latest/userguide/monitoring-create-alarms.html
DOP-C02 Exam Patterns and Worked Scenarios
Scenario 1: CodeDeploy In-Place Failure on ASG with New Instances
Stem: "In-place deployment to ASG of 20 instances; ASG scaled to 25 mid-deployment; deployment reports success but new instances run old code." Right: switch to blue/green deployment with new replacement ASG; or convert to launch template versioning so all new instances boot with new code.
Scenario 2: Buildspec Failing in pre_build with ECR Auth
Stem: "CodeBuild fails in pre_build with denied: User: arn:... is not authorized to perform: ecr:GetAuthorizationToken." Right: add ecr:GetAuthorizationToken and ecr:BatchCheckLayerAvailability permissions to the CodeBuild service role; verify image is in the same Region as the build.
Scenario 3: Lambda Alias Canary Rollback
Stem: "Lambda canary deployment shifted 10% to new version; user reports increase in 500 errors." Right: configure CodeDeploy with a CloudWatch alarm on the function's Errors metric; alarm fire automatically reverts the alias to prior version.
Scenario 4: ECS Service Stuck Replacing Tasks
Stem: "ECS rolling deployment keeps killing healthy old tasks and starting new ones that fail health check; service is degrading." Right: enable deployment circuit breaker with rollback: true; rolls back automatically after 3 consecutive task failures.
Scenario 5: CloudFormation Stack Update Rolled Back
Stem: "CFN stack update fails halfway; team needs to know which resource failed and why." Right: review stack events tab for the failed resource and message; for resource creation issues, check IAM role permissions; for custom resource Lambda issues, check the function's CloudWatch Logs.
FAQ
Q1: How do I find the actual error message for a failed CodePipeline?
Click into the failed action's "Details" link. For CodeBuild, view the build logs (CloudWatch Logs). For CodeDeploy, view per-instance logs (/opt/codedeploy-agent/...). For CloudFormation, view stack events. For ECS, view service events and task stopped reasons.
Q2: What's the difference between CodeDeploy in-place and blue/green failure modes?
In-place failures leave the original deployment group in a partially-updated state; rollback re-deploys the previous revision over the failed instances. Blue/green failures simply terminate the new (green) environment; the original (blue) was never touched and continues serving traffic. Blue/green is safer at higher cost.
Q3: How do I prevent a bad image from blowing away my ECS service?
Enable the deployment circuit breaker with rollback: true for rolling updates. For more safety, use CodeDeploy blue/green for ECS which keeps both task sets running until validation passes.
Q4: How does Lambda traffic shifting roll back automatically?
Configure a CodeDeploy deployment with an alarm list. During the shift, if any alarm enters ALARM, CodeDeploy reverts the alias's routing config to point back at the prior version. The post-traffic hook is not run.
Q5: How do I troubleshoot a Custom Resource Lambda timeout in CloudFormation?
Check the Lambda function's CloudWatch Logs for the actual error. Common causes: VPC subnet without NAT gateway (Lambda cannot reach the AWS API endpoint to send the response), function timeout shorter than the API call duration, missing IAM permissions. Always send the response signal in a try/finally block to avoid CFN waiting until its 1-hour timeout.
Q6: Should I use CloudWatch Synthetics or ALB health checks for deploy validation?
Use both. ALB health check validates basic reachability (the target is up, port is open, HTTP returns OK). CloudWatch Synthetics validates end-to-end user journey (login works, checkout flow completes, third-party integrations respond). Synthetics catches breakages that ALB cannot see.
Q7: How do I avoid in-place deployment risks on ASGs?
Switch to blue/green deployment which spawns a new replacement ASG, validates, then shifts traffic. Or use immutable deployments via launch template version updates, where new instances always boot with new code and old instances are gradually retired.
Cross-References
- EventBridge auto-remediation runbooks can wrap the rollback steps as documented procedures; see
eventbridge-auto-remediation-runbooks. - CloudWatch alarms and EventBridge provide the alarm signals that drive rollback; see
cloudwatch-alarms-eventbridge-integration. - Incident Manager and Health escalate severe deployment failures to the on-call response plan; see
systems-manager-incident-manager-health. - CloudWatch metrics and Logs Insights are the diagnosis tools for finding the failure root cause; see
cloudwatch-metrics-logs-insights. - CloudTrail and Config dashboards reveal who initiated the deployment and what configuration changed; see
cloudtrail-config-audit-dashboards.