Disaster Recovery RTO/RPO Strategies and AWS Backup - DOP-C02 DevOps Engineer Study Notes

Q: Q1: How do I pick between RDS cross-region read replica and Aurora Global Database?

Aurora Global Database offers sub-second storage-level replication and 1-2 minute managed failover; RDS cross-region read replicas use logical replication with 1-second-to-minutes lag and require manual promotion. For tier-1 RTO/RPO, Aurora Global Database wins. For tier-2, RDS read replicas are simpler and cheaper.

Q: Q2: Can AWS Backup back up resources I do not have IAM access to?

No, AWS Backup uses an IAM service role to access resources; the role must have permissions to read/snapshot/copy the resources. For cross-account backup, the destination account must accept the source account's copy via vault access policy.

Q: Q3: What is the difference between vault lock governance and compliance mode?

Governance mode: lock can be removed by IAM users with backup:DeleteBackupVaultLockConfiguration permission. Useful for accidental-deletion protection. Compliance mode: lock cannot be removed (after 3-day grace period) until retention expires - even by root. Required for regulated industries.

Q: Q4: Can DynamoDB Global Tables be used as the primary storage for active-active?

Yes - that is its design. Both regions accept reads and writes; conflicts are resolved with last-writer-wins (highest timestamp wins). Be aware: any business logic relying on strong cross-region consistency (e.g., uniqueness constraints) does not hold in Global Tables.

Q: Q5: How do I test a multi-region failover safely?

Run game days in a non-production environment first. For production, use ARC routing controls to flip a small percentage of traffic to the secondary, monitor for issues, then expand. Schedule game days during business hours with leadership awareness; the value of finding gaps in working hours far exceeds the perceived risk of intentional disruption.

Q: Q6: What happens to RDS automated backups if the source region becomes unavailable?

Automated backups are stored in the same region as the DB instance. If the region is unavailable, the backups are unavailable too. For DR, configure cross-region snapshot copy or use AWS Backup's cross-region copy. Alternatively, Aurora Global Database has its data replicated to the secondary region's storage independent of any backup.

Q: Q7: How do I budget for DR while meeting the RTO/RPO?

The systematic approach: list each workload's RTO/RPO based on business impact analysis; map each to the cheapest strategy that meets both targets; sum the cost. Tier-1 workloads (often 5-10 percent of the portfolio) drive most of the DR budget; tier-3 workloads (often 80 percent) cost very little because backup-and-restore is sufficient.

Disaster recovery on DOP-C02 reduces to two numbers - Recovery Time Objective (RTO) and Recovery Point Objective (RPO) - and four canonical strategies that meet them at different cost tiers. The exam tests when backup-and-restore suffices, when pilot light is the right pick, when warm standby is mandatory, and when multi-site active-active is the only acceptable answer. Domain 3.3 ("automated recovery processes to meet RTO and RPO requirements") is one of the highest-scored Domain 3 sections because the trade-offs are concrete and the wrong pick shows up clearly in the answer choices.

This guide assumes you know what RTO and RPO mean and that AWS regions are independent failure domains. The DOP-C02 focus: strategy selection by RTO/RPO numbers, AWS Backup with cross-region and cross-account copy, vault lock for ransomware protection, RDS Multi-AZ vs cross-region read replicas, Aurora Global Database, DynamoDB Global Tables, S3 Cross-Region Replication (CRR) and Same-Region Replication (SRR), Route 53 ARC for failover orchestration, DR game day testing, and the constraints that move a workload from one strategy to the next.

Why DR Strategies Matter on DOP-C02

DOP-C02 explicitly lists "Disaster recovery concepts (for example, RTO, RPO)" and "AWS Backup and recovery strategies (for example, pilot light, warm standby)" as Domain 3.3 knowledge requirements. Community pass reports cite RTO/RPO matching as one of the most consistent question types: the stem provides RTO=4 hours, RPO=1 hour, then asks which strategy fits. Pick wrong and you pick a more expensive (over-engineered) or insufficient (under-engineered) answer.

The exam also tests precise mechanics: AWS Backup's vault lock has compliance mode and governance mode with different override semantics; cross-region copy can be encrypted under different KMS keys per region; Aurora Global Database has up to 1-second replication lag and supports failover to a secondary region in 1-2 minutes; DynamoDB Global Tables uses last-writer-wins conflict resolution. Knowing these exactly is the difference between selecting the right strategy and accepting a plausible-looking distractor.

RTO (Recovery Time Objective): maximum acceptable time from disaster to fully restored service.
RPO (Recovery Point Objective): maximum acceptable data loss measured in time (e.g., 5 minutes of writes lost).
Backup-and-Restore: snapshots in another region; restore on demand. Highest RTO/RPO, lowest cost.
Pilot Light: minimal core (DB replicas, baseline IaC) running in DR region; rest provisioned at recovery time.
Warm Standby: scaled-down full stack running in DR region; scale up at recovery time.
Multi-Site Active-Active: full production capacity in both regions, both serving traffic.
AWS Backup: managed backup service that orchestrates snapshots across services with policies, vaults, lifecycle, and cross-region/cross-account copy.
Backup vault: a logical container for backups with KMS encryption and access policies.
Vault lock (compliance/governance mode): WORM (Write Once Read Many) protection on a vault preventing deletion.
AWS Backup plan: a policy assigning backup frequency, retention, and copy targets to resources via tags or resource selection.
RDS Multi-AZ: synchronous standby in a second AZ for HA within a region.
Aurora Global Database: multi-region Aurora cluster with sub-second replication and 1-2 minute cross-region failover.
DynamoDB Global Tables: multi-region active-active DynamoDB with last-writer-wins.
S3 Cross-Region Replication (CRR): asynchronous replication of S3 objects to another region.
Same-Region Replication (SRR): replication within a region (compliance, log aggregation).
Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html

Plain-Language Explanation: DR Strategies and Backup

DR strategies map cleanly to physical-world contingency plans. Three angles cover the strategy hierarchy, AWS Backup, and the data-replication primitives.

Analogy 1: The Hospital Surgical Backup Tiers

A hospital plans for OR failures with four tiers of redundancy. Backup-and-Restore is storing surgical equipment in a sealed off-site warehouse - if the OR is destroyed, equipment is shipped in (RTO measured in hours/days) and any patient data not yet synced to off-site is lost (RPO measured in hours). Cheap to maintain, slow to recover.

Pilot Light is a second OR with the core machines (anesthesia, monitors) plugged in and warm but no surgical staff scheduled - patient records replicate continuously (RPO seconds-to-minutes), but recovery requires calling in the team (RTO 30 minutes). Mid-cost, mid-recovery.

Warm Standby is a fully staffed but scaled-down second OR - skeleton crew on site, equipment warm. Recovery is "scale up the team" - 10 minutes RTO, seconds RPO. Higher cost, faster recovery.

Multi-Site Active-Active is two fully equipped, fully staffed ORs both seeing patients continuously - if one OR fails, the other absorbs the load. RTO near zero, RPO near zero (synchronous replication). Highest cost.

AWS Backup is the hospital's centralized records-management department - it knows what equipment exists, schedules off-site shipping, enforces records retention rules (vault lock - cannot be destroyed even by the hospital director for compliance reasons), and tests recovery quarterly.

Aurora Global Database is the patient-record system replicated to two campuses with sub-second sync - either campus can read locally; only the primary campus accepts writes; failover takes 1-2 minutes.

DynamoDB Global Tables is the bidirectional-sync patient list across both campuses - both campuses can write; last-writer-wins for conflicts.

Analogy 2: The Restaurant Chain Disaster Plan

A regional restaurant chain plans for store-level disasters. Backup-and-Restore is off-site storage of recipes, menus, financial records - if a store burns down, ship records to a new location, hire staff, equip the kitchen (RTO weeks). RPO = "since the last weekly backup".

Pilot Light is a second store with kitchen equipment plugged in but no staff - recipes sync nightly (low RPO), staff hired and trained at recovery time.

Warm Standby is a fully staffed but lower-volume second store in the same city - takes the hit during the primary's downtime by extending hours and accepting all the primary's customers.

Multi-Site Active-Active is two equally-busy stores in opposite ends of town - either can absorb the other's load.

AWS Backup vault lock is the off-site records storage's tamper-proof safe - even the chain CEO cannot delete records before the retention period; required for SEC compliance.

Analogy 3: The Power Utility Contingency Tiers

A utility plans for substation failures. Backup-and-Restore is stocked spare transformers in a regional warehouse - shipping and installing takes 24-72 hours after a substation fire.

Pilot Light is a redundant cold-standby substation - transformers are connected but de-energized. Energizing and load-transferring takes hours.

Warm Standby is a hot-spare substation operating at minimal load - already synchronized and ready to take a portion of demand within minutes.

Multi-Site Active-Active is two substations both serving 50 percent of the load - if one fails, the other smoothly absorbs all 100 percent within seconds.

S3 Cross-Region Replication is the utility's metered-data shipped continuously to a backup data center - the backup is always within seconds of the source.

RDS Multi-AZ is the redundant power-grid synchronous tie line - synchronous replication between two transformers in the same substation; the failover happens in seconds without data loss but only protects against AZ-level (substation-level) failure, not regional grid failure.

For the four-tier DR strategy hierarchy, the hospital surgical backup tiers map cleanest. For AWS Backup vault lock semantics, the restaurant SEC-compliance off-site safe captures the WORM model. For RDS Multi-AZ vs cross-region replicas, the utility substation tie lines clarifies the AZ-vs-region scope. Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html

The Four Canonical DR Strategies

AWS's DR whitepaper defines four strategies with different RTO/RPO and cost profiles.

Backup and Restore

RTO: hours to days (provision new infrastructure, restore data, validate).
RPO: hours (typically last full or incremental backup).
Cost: lowest - only paying for backup storage in the DR region.
Use cases: tier-3 internal apps, dev/test environments, long-archive compliance copies.

Mechanism: AWS Backup with cross-region copy of EBS snapshots, RDS snapshots, DynamoDB on-demand backups, EFS, FSx, S3.

Pilot Light

RTO: 10s of minutes to a few hours.
RPO: minutes (continuous replication of core data).
Cost: low - core infrastructure (DB replicas, baseline IAM, networking) running 24/7; compute provisioned at recovery.
Use cases: workloads where RTO of 30-60 minutes is acceptable.

Mechanism: cross-region read replica for RDS, DynamoDB Global Tables (active-passive), CloudFormation/CDK templates ready to run for compute, AMI replication, ECR image replication.

Warm Standby

RTO: minutes to 10s of minutes.
RPO: seconds to minutes.
Cost: moderate - full stack running at scaled-down capacity in the DR region.
Use cases: critical apps with downtime tolerance under 30 minutes.

Mechanism: full ASG and ECS services running at minimum size in DR region, RDS cross-region replica or Aurora Global Database, ALB and Route 53 records pre-configured. Recovery flips routing and scales up.

Multi-Site Active-Active

RTO: seconds.
RPO: near-zero (asynchronous) to zero (synchronous in some sub-services).
Cost: highest - full production capacity in both regions.
Use cases: tier-1 apps with strict business continuity requirements.

Mechanism: Aurora Global Database write-forwarding or DynamoDB Global Tables, traffic via Route 53 latency-based routing, both regions handle live traffic. Failure of one region is absorbed by the other automatically.

Strategy Selection Matrix

RTO Required	RPO Required	Likely Strategy
Hours-Days	Hours	Backup-and-Restore
30-60 min	Minutes	Pilot Light
< 30 min	Seconds	Warm Standby
< 1 min	Seconds-Zero	Multi-Site Active-Active

The exam frequently tempts with an over-engineered answer. If the requirement is "RTO 1 hour, RPO 15 minutes", the right answer is Pilot Light - not Warm Standby (over-spec) or Backup-and-Restore (under-spec). Always read the RTO/RPO numbers carefully and pick the lowest-cost strategy that satisfies both. Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html

AWS Backup

AWS Backup orchestrates protection across many AWS services with consistent policies and reporting.

Supported Resources

EBS volumes, RDS, Aurora, DynamoDB, EFS, FSx (Windows, Lustre, ONTAP), Storage Gateway, S3, EC2 instances (instance + EBS), VMware on AWS, Neptune, DocumentDB, Redshift, Timestream, SAP HANA on EC2.

Backup Plan, Vault, Selection

Backup plan: a policy with one or more rules. Each rule has schedule, lifecycle (move to cold storage, expire), copy actions to other vaults/regions/accounts.
Backup vault: KMS-encrypted container; access policy controls who can read/restore/delete.
Resource selection: by tag (e.g., Backup=daily) or explicit ARN.

Cross-Region and Cross-Account Copy

A backup rule can include CopyActions:

Copy to another vault in the same account different region (DR).
Copy to another account different region (account isolation - if source account is compromised, backups in the destination account survive).

Cross-account copy requires a vault access policy granting the source account permission to copy in.

Vault Lock

Backup vault lock writes a WORM policy to the vault:

Governance mode: vault lock can be removed by IAM users with sufficient permission. Useful for accidental-deletion protection without compliance binding.
Compliance mode: vault lock cannot be removed for the configured retention period - not even by the root user. Required for SEC 17a-4(f), CFTC, and FINRA compliance.

Compliance mode has a 3-day grace period during which it can be undone; after that, it is immutable until expiry.

Once compliance mode is locked (after the 3-day grace window), nobody can delete backups in that vault until the retention period expires - not the AWS account root, not even AWS Support. The exam tests this as a ransomware mitigation: a compromised root account cannot delete the backups. Reference: https://docs.aws.amazon.com/aws-backup/latest/devguide/vault-lock.html

Backup Audit Manager

AWS Backup Audit Manager validates that backups are happening as expected per controls (e.g., "RDS instances tagged production are backed up daily, retained 35 days, copied to us-west-2"). Reports compliance per resource. Useful for compliance reporting like SOC 2 or PCI evidence.

Service-Specific DR Mechanisms

Beyond AWS Backup, individual services offer DR features.

RDS

Multi-AZ: synchronous standby in a second AZ. Failover automatic, ~60-120 seconds. RPO=0 within region.
Cross-region read replicas: asynchronous replication. Promote replica for DR; takes minutes.
Automated snapshots + cross-region snapshot copy: backup-and-restore.

Aurora

Aurora Multi-AZ: 6 copies across 3 AZs, automatic failover ~30 seconds.
Aurora Global Database: secondary region cluster with sub-second replication lag (storage-level), failover to secondary in 1-2 minutes via managed planned failover or up to 5 minutes for unplanned.
Aurora Backtrack (MySQL only): rewind to a point in time without restoring from snapshot.

Aurora Global Database with write forwarding lets the secondary region service writes (forwarded to primary) for active-active patterns at the cost of higher write latency.

DynamoDB

Point-in-Time Recovery (PITR): continuous backups for the last 35 days; restore to any second.
On-demand backups: explicit snapshots, retained indefinitely.
Global Tables: multi-region active-active with last-writer-wins. RPO ~1 second.

S3

Cross-Region Replication (CRR): async replication to a bucket in another region. Multiple destinations supported. Replication time tracking SLA: 99.99 percent of objects in 15 minutes.
Same-Region Replication (SRR): in-region replication for log aggregation, data sovereignty, account isolation.
Versioning + MFA Delete: protects against accidental and ransomware deletion.

Enabling CRR/SRR replicates only new objects after the rule was created. To replicate existing objects, you need S3 Batch Replication (a managed batch operation) or one-time CLI sync. Many candidates assume "enable replication, all objects copy" - it does not. Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html

Route 53 ARC for DR Orchestration

Application Recovery Controller adds:

Readiness checks: continuously verify backup region is actually ready (capacity, configuration, replication lag) before failover.
Routing controls: explicit On/Off toggles flipping Route 53 records bypassing health-check evaluation.
Cluster: 5 control endpoints for 99.999 percent control-plane availability.

ARC is the standard answer for "operator-controlled multi-region failover with safety rules".

DR Testing

DR plans degrade without testing. The exam expects:

Quarterly game days: simulated regional failure. Run the runbook end-to-end.
Restore tests: recover a backup to a sandbox account and validate.
Chaos engineering: AWS Fault Injection Service (FIS) injects faults (terminate instances, throttle networks).
Documented runbooks: stored in Systems Manager OpsCenter or a wiki, with named operators and escalation paths.

The exam pattern is "the team built warm standby 18 months ago and never tested - what risk does this introduce?" The answer is "the runbook may have drifted from reality - configuration changes, IAM policy updates, replication lag growth that nobody noticed". DR plans need a recurring test cadence to remain valid. Reference: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html

Common Pitfalls (常考陷阱)

Picking warm standby for tier-3 apps with hours of acceptable RTO: over-engineered, wasteful. Backup-and-restore suffices.
Picking backup-and-restore for tier-1 apps with minutes of RTO: under-engineered, breaks the SLA.
Forgetting that S3 Replication does not replicate existing objects: use S3 Batch Replication for one-time backfill.
Treating RDS Multi-AZ as DR: Multi-AZ is HA within a region; for region-level DR, you need cross-region replicas or Aurora Global Database.
Skipping cross-account copy for ransomware protection: same-account backups are vulnerable if the account is compromised.
Vault lock compliance mode without due diligence: irreversible once locked; misconfiguration is permanent.
Assuming Aurora Global Database failover is instant: 1-2 minutes for managed failover, up to 5 minutes for unplanned.

FAQ

Q1: How do I pick between RDS cross-region read replica and Aurora Global Database?

Aurora Global Database offers sub-second storage-level replication and 1-2 minute managed failover; RDS cross-region read replicas use logical replication with 1-second-to-minutes lag and require manual promotion. For tier-1 RTO/RPO, Aurora Global Database wins. For tier-2, RDS read replicas are simpler and cheaper.

Q2: Can AWS Backup back up resources I do not have IAM access to?

No, AWS Backup uses an IAM service role to access resources; the role must have permissions to read/snapshot/copy the resources. For cross-account backup, the destination account must accept the source account's copy via vault access policy.

Q3: What is the difference between vault lock governance and compliance mode?

Governance mode: lock can be removed by IAM users with backup:DeleteBackupVaultLockConfiguration permission. Useful for accidental-deletion protection. Compliance mode: lock cannot be removed (after 3-day grace period) until retention expires - even by root. Required for regulated industries.

Q4: Can DynamoDB Global Tables be used as the primary storage for active-active?

Yes - that is its design. Both regions accept reads and writes; conflicts are resolved with last-writer-wins (highest timestamp wins). Be aware: any business logic relying on strong cross-region consistency (e.g., uniqueness constraints) does not hold in Global Tables.

Q5: How do I test a multi-region failover safely?

Run game days in a non-production environment first. For production, use ARC routing controls to flip a small percentage of traffic to the secondary, monitor for issues, then expand. Schedule game days during business hours with leadership awareness; the value of finding gaps in working hours far exceeds the perceived risk of intentional disruption.

Q6: What happens to RDS automated backups if the source region becomes unavailable?

Automated backups are stored in the same region as the DB instance. If the region is unavailable, the backups are unavailable too. For DR, configure cross-region snapshot copy or use AWS Backup's cross-region copy. Alternatively, Aurora Global Database has its data replicated to the secondary region's storage independent of any backup.

Q7: How do I budget for DR while meeting the RTO/RPO?

The systematic approach: list each workload's RTO/RPO based on business impact analysis; map each to the cheapest strategy that meets both targets; sum the cost. Tier-1 workloads (often 5-10 percent of the portfolio) drive most of the DR budget; tier-3 workloads (often 80 percent) cost very little because backup-and-restore is sufficient.

Wrap-Up

Disaster recovery on DOP-C02 is a composition of four canonical strategies (backup-restore, pilot light, warm standby, multi-site active-active) selected to meet RTO and RPO at minimum cost. AWS Backup is the orchestrator with cross-region and cross-account copy plus vault lock for ransomware protection. Service-specific mechanisms (RDS Multi-AZ, Aurora Global Database, DynamoDB Global Tables, S3 CRR) provide the data-plane primitives. Route 53 ARC orchestrates safe, operator-controlled multi-region failovers. Memorise the strategy-to-RTO/RPO mapping, the vault lock compliance-mode irreversibility, the S3 Replication "no existing objects by default" trap, and the testing cadence requirement. With those, DR scenarios resolve cleanly to recognition.

Disaster Recovery — RTO/RPO Patterns, AWS Backup, and Cross-Region Recovery

Why DR Strategies Matter on DOP-C02

Plain-Language Explanation: DR Strategies and Backup

Analogy 1: The Hospital Surgical Backup Tiers

Analogy 2: The Restaurant Chain Disaster Plan

Analogy 3: The Power Utility Contingency Tiers

The Four Canonical DR Strategies

Backup and Restore

Pilot Light

Warm Standby

Multi-Site Active-Active

Strategy Selection Matrix

AWS Backup

Supported Resources

Backup Plan, Vault, Selection

Cross-Region and Cross-Account Copy

Vault Lock

Backup Audit Manager

Service-Specific DR Mechanisms

RDS

Aurora

DynamoDB

S3

Route 53 ARC for DR Orchestration

DR Testing

Common Pitfalls (常考陷阱)

FAQ

Q1: How do I pick between RDS cross-region read replica and Aurora Global Database?

Q2: Can AWS Backup back up resources I do not have IAM access to?

Q3: What is the difference between vault lock governance and compliance mode?

Q4: Can DynamoDB Global Tables be used as the primary storage for active-active?

Q5: How do I test a multi-region failover safely?

Q6: What happens to RDS automated backups if the source region becomes unavailable?

Q7: How do I budget for DR while meeting the RTO/RPO?

Wrap-Up

Official sources

More DOP-C02 topics