SkarpSkarp

Chapter 4 of 9

Designing Resilient Architectures: High Availability, Backup, and Disaster Recovery

Systems fail in all the creative ways the exam writers can imagine—your job is to keep the lights on. This module demystifies multi‑AZ vs multi‑Region, RTO/RPO, and backup strategies so you can instantly match each scenario to the right resiliency pattern.

15 min readen

Step 1: What Do We Mean by Resilience, HA, FT, and DR?

Three Related Ideas

You will repeatedly see three ideas: High availability (HA), Fault tolerance (FT), and Disaster recovery (DR). Distinguishing them makes exam scenarios much easier.

High Availability (HA)

High availability aims to keep apps up most of the time, measured in availability % (for example, 99.9%). It usually uses multi-AZ deployments, load balancers, and health checks in a single Region.

Fault Tolerance (FT)

Fault tolerance means the system keeps working with no visible interruption when components fail. It uses extra redundancy so any single failure is invisible to users.

Disaster Recovery (DR)

Disaster recovery prepares for large-scale events like Region outages or major data loss. It is measured by RTO and RPO and uses backups, replication, and alternate environments.

How They Fit Together

HA/FT try to avoid downtime during normal failures; DR is about coming back after rare disasters. In questions, ask: is this about better uptime now, or about recovering from a big disaster?

Step 2: RTO and RPO – The Two Numbers That Drive Everything

What Are RTO and RPO?

DR choices are driven by two numbers: RTO and RPO. Most exam questions are asking: given these, which DR pattern should you choose?

Recovery Time Objective (RTO)

RTO is the maximum acceptable downtime after a disaster. Example: RTO = 4 hours means you must restore service within 4 hours.

Recovery Point Objective (RPO)

RPO is the maximum acceptable data loss, measured in time. Example: RPO = 15 minutes means you can lose at most 15 minutes of changes.

Low vs High RTO/RPO

Hours of RTO/RPO → backup and restore. RTO < 1 hour and RPO in minutes → pilot light or warm standby. RTO ≈ 0 and RPO ≈ 0 → multi-site active-active.

Spotting RTO/RPO in Questions

Look for phrases like "up to 4 hours of downtime" (RTO) or "at most 5 minutes of data loss" (RPO). Map these to DR patterns later.

Step 3: Multi-AZ vs Multi-Region – Which Problem Are You Solving?

Availability Zones

AZs are separate data centers within a Region, with fast, low-latency links and independent failure. Multi-AZ designs spread resources across AZs for high availability.

Multi-AZ Patterns

Examples: RDS Multi-AZ with synchronous replication and automatic failover, or EC2 Auto Scaling groups across 2–3 AZs behind an ALB.

Regions and Multi-Region

Regions are geographically separated. Multi-Region means deploying in 2+ Regions, usually for disaster recovery, lower latency, or compliance.

Multi-Region Examples

Examples: S3 Cross-Region Replication, DynamoDB global tables, or Route 53 failover routing between primary and DR Regions.

Choosing AZ vs Region

If the concern is AZ failure → multi-AZ. If the concern is Region failure or global latency → multi-Region. "If primary Region is unavailable" almost always means multi-Region.

Step 4: Visualizing HA vs DR – A Simple Web App

Baseline 3-Tier App

Imagine: Users → ALB → EC2 app tier → RDS database, all in one Region. We will use this to contrast high availability vs disaster recovery.

High Availability in One Region

App tier: EC2 in AZ A and B behind an ALB. DB: RDS Multi-AZ with primary in AZ A and standby in AZ C. If AZ A fails, traffic and DB fail over to other AZs.

What This Gives You

This multi-AZ setup gives high availability and some fault tolerance inside a single Region. It does not protect against Region-wide failures.

Adding a DR Region

Add `us-west-2` as DR: S3 Cross-Region Replication, RDS replica or backup-based DB there, and a smaller copy of the app stack (pilot light or warm standby).

Failover With Route 53

Use Route 53 health checks and failover routing. If `us-east-1` is unhealthy, Route 53 directs users to the DR stack in `us-west-2`.

Step 5: DR Patterns – From Cheapest to Most Resilient

Four DR Patterns

There are four classic DR patterns: Backup and restore, Pilot light, Warm standby, and Multi-site active-active, ordered from cheapest/slowest to most expensive/fastest.

Backup and Restore

RTO: hours–days, RPO: hours. Store backups in S3/Glacier/AWS Backup. No infra in DR Region; rebuild from backups when needed. Good when cost matters more than uptime.

Pilot Light

RTO: tens of minutes–hours, RPO: minutes–hours. Minimal environment and replicated data in DR Region. In a disaster, scale up compute and supporting services.

Warm Standby

RTO: minutes, RPO: seconds–minutes. A scaled-down but fully functional copy runs in DR Region with continuous replication. Scale up during a disaster.

Multi-Site Active-Active

RTO ≈ 0, RPO ≈ 0. Workloads run in multiple Regions simultaneously, with near real-time data replication. Used for mission-critical systems needing almost no downtime.

Step 6: Match RTO/RPO to DR Patterns

Use this thought exercise to solidify how RTO/RPO map to DR strategies. For each scenario, pause and decide the best-fit DR pattern before reading the guidance.

Scenario A

  • An internal reporting tool.
  • Acceptable downtime: 24 hours.
  • Acceptable data loss: 12 hours.
  • Budget: very limited.

Your choice?

Guidance: This fits backup and restore. You can run only in one Region, take daily backups to S3/Glacier, and rebuild if needed.

---

Scenario B

  • Customer-facing mobile app.
  • Acceptable downtime: 1 hour.
  • Acceptable data loss: 15 minutes.
  • Budget: moderate but not huge.

Your choice?

Guidance: This suggests pilot light or a small warm standby. A minimal environment in the DR Region plus frequent replication meets these RTO/RPO targets at reasonable cost.

---

Scenario C

  • Online trading platform.
  • Acceptable downtime: effectively 0.
  • Acceptable data loss: effectively 0.
  • Global users.

Your choice?

Guidance: This needs multi-site active-active. You run in multiple Regions, use global data stores, and route traffic across Regions.

When you practice questions, explicitly write the RTO/RPO in the margin and then pick the pattern that fits.

Step 7: Backup Strategies and Common AWS Building Blocks

AWS Backup

AWS Backup lets you centrally manage backups for services like EBS, RDS, DynamoDB, EFS, and FSx, including cross-Region and cross-account backups for stronger DR.

S3 for Backups

Use S3 versioning to keep object history and Object Lock for WORM protection. Lifecycle policies can move older backups to cheaper S3 Glacier tiers.

Database Backup Features

RDS offers automated backups and snapshots; Aurora has continuous backups and global DBs; DynamoDB provides on-demand backups and point-in-time recovery.

Backups vs Replication

Backups are point-in-time snapshots, great for logical errors. Replication is continuous and good for low RPO but can also replicate corruption. Robust designs use both.

Recognizing Exam Cues

"Accidental deletion" → backups or versioning. "Centralize backups" → AWS Backup. "Protect from deletion by attackers" → S3 Object Lock or cross-account backups.

Step 8: Quick Check – Multi-AZ vs Multi-Region

Answer this question to test your understanding of multi-AZ vs multi-Region.

A company runs a critical web application in a single AWS Region. They want the application to remain available if one Availability Zone fails, but they are not concerned about Region-wide disasters. Which change best meets this requirement with minimal cost?

  1. Add a second Region with a warm standby environment and Route 53 failover routing.
  2. Deploy EC2 instances in multiple AZs behind an Application Load Balancer and enable RDS Multi-AZ.
  3. Enable S3 Cross-Region Replication for all application assets.
  4. Create daily backups with AWS Backup and store them in another Region.
Show Answer

Answer: B) Deploy EC2 instances in multiple AZs behind an Application Load Balancer and enable RDS Multi-AZ.

The requirement is to survive an **AZ failure**, not a Region failure. The correct pattern is **multi-AZ**: EC2 instances across AZs behind an ALB and RDS Multi-AZ. Multi-Region and cross-Region backups address Region disasters, which the question explicitly deprioritizes.

Step 9: Quick Check – RTO/RPO to DR Pattern

Now connect RTO/RPO directly to a DR pattern.

An application has an RTO of 30 minutes and an RPO of 5 minutes. Management wants to minimize cost while still meeting these objectives. Which DR strategy is the best fit?

  1. Backup and restore
  2. Pilot light
  3. Warm standby
  4. Multi-site active-active
Show Answer

Answer: C) Warm standby

RTO 30 minutes and RPO 5 minutes require relatively fast recovery and low data loss. **Warm standby** keeps a scaled-down but fully functional environment running in the DR Region with continuous replication, balancing cost and speed. Pilot light is usually slower; active-active is faster but more expensive.

Step 10: Flashcard Review – Key Terms

Flip through these flashcards to reinforce the core vocabulary.

High availability (HA)
Designing systems to remain operational for a very high percentage of time, often using redundancy within a Region (for example, multi-AZ, load balancing, health checks).
Fault tolerance (FT)
The ability of a system to continue operating without interruption when one or more components fail, often via N+1 redundancy and self-healing designs.
Disaster recovery (DR)
Processes and architectures that allow a system to be restored after a major event such as a Region outage or large-scale data loss, guided by RTO and RPO.
Recovery Time Objective (RTO)
The maximum acceptable amount of time that an application can be unavailable after a disruption before it must be fully restored.
Recovery Point Objective (RPO)
The maximum acceptable amount of data loss measured in time, indicating how far back in time you may need to recover data.
Multi-AZ
Deploying resources across multiple Availability Zones in a single Region to improve high availability and fault tolerance against AZ failures.
Multi-Region
Deploying workloads and data across two or more AWS Regions, typically for disaster recovery, global latency, or compliance requirements.
Backup and restore (DR pattern)
A DR approach where data and configurations are regularly backed up and the environment is rebuilt from those backups after a disaster; lowest cost and slowest recovery.
Pilot light (DR pattern)
A DR approach where a minimal version of the environment, especially critical data stores, is always running in the DR Region and scaled up during a disaster.
Warm standby (DR pattern)
A DR approach where a scaled-down but fully functional copy of the production environment runs in the DR Region and is scaled up during a disaster.
Multi-site active-active (DR pattern)
A DR approach where workloads run actively in multiple Regions at the same time, providing near-zero RTO and RPO.
AWS Backup
A managed service that centralizes and automates backups across AWS services, supporting cross-Region and cross-account backup policies.

Key Terms

Region
A geographically separate AWS area consisting of multiple Availability Zones, used for isolation, latency optimization, and compliance.
Multi-AZ
An architecture that deploys resources across multiple Availability Zones in a single Region to improve availability and resilience.
AWS Backup
An AWS service for centralized, automated backup management across supported AWS resources.
Pilot light
A DR strategy where a minimal, critical subset of the environment is always running in the DR Region and scaled up when needed.
Multi-Region
An architecture that deploys resources and data in multiple AWS Regions, often for disaster recovery or global performance.
Warm standby
A DR strategy where a smaller but fully functional copy of the production environment runs in the DR Region, ready to scale.
Backup and restore
A DR strategy based on periodic backups and reconstructing the environment from those backups after a disaster.
Fault tolerance (FT)
The ability of a system to continue operating correctly even when some components fail, often with no visible impact to users.
Availability Zone (AZ)
An isolated location within an AWS Region, with independent power and networking, designed to fail independently from other AZs.
Disaster recovery (DR)
Strategies and processes to restore systems and data after major disruptive events, such as Region outages or catastrophic data loss.
High availability (HA)
Designing systems to remain operational for a very high percentage of time, typically using redundancy and failover within a Region.
Cross-Region replication
Automatically copying data from one AWS Region to another, for example with S3 Cross-Region Replication or DynamoDB global tables.
Multi-site active-active
A DR strategy where multiple Regions serve traffic simultaneously, providing near-zero downtime and data loss.
Recovery Time Objective (RTO)
The maximum acceptable duration of application downtime after a disruption before full service must be restored.
Recovery Point Objective (RPO)
The maximum acceptable data loss measured in time, indicating how recent the recovered data must be.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself