Chapter 9 of 26
Designing Multi-AZ and Multi-Region Resilient Architectures
High availability and disaster recovery are at the heart of Domain 2. This module shows how to spread risk across Availability Zones and Regions using core services like EC2, RDS, S3, and Route 53.
Big Picture: Resilience, AZs, and Regions
Reliability in Context
This module lives in the Reliability pillar: "The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to."
Regions and AZs
An AWS Region is a geographic area (like us-east-1). Each Region has multiple Availability Zones, which are separate data centers designed so one AZ can fail without taking down the others.
Fault Domains
A fault domain is a group of resources that can fail together, such as a rack or an AZ. Designing for resilience means spreading your workload across independent fault domains.
Multi-AZ vs Multi-Region
Multi-AZ protects against losing a single AZ inside a Region. Multi-Region protects against losing an entire Region or handling cross-border and compliance needs.
RPO, RTO, and Disaster Recovery Patterns
RPO and RTO
RPO is how much data loss (in time) is acceptable. RTO is how long the system can be down. Exam scenarios hide these values in business language.
Backup and Restore
Backup and restore uses periodic backups to another AZ or Region. It is the cheapest option, with RPO and RTO often measured in hours.
Pilot Light
Pilot light keeps a minimal version of your core systems running in a DR Region. You scale out other components during a disaster.
Warm Standby and Multi-site
Warm standby runs a smaller but complete stack in a second Region. Multi-site runs full production in multiple Regions with near-zero RPO/RTO.
Designing Multi-AZ Architectures with EC2 and Load Balancing
Spreading Compute Across AZs
Run EC2 instances in at least two AZs and front them with an ALB or NLB that has subnets in multiple AZs. Use Auto Scaling across those AZs.
Eliminating Single-AZ Dependencies
Avoid single-AZ NAT gateways or load balancers. Use separate NAT gateways per AZ and multi-AZ ALBs/NLBs to keep outbound and inbound traffic working.
How It Looks
Visualize us-east-1 with two AZs. An ALB spans both. EC2 instances in an Auto Scaling group run in private subnets in each AZ. If one AZ fails, the other keeps serving.
Exam Angle
Any design that claims AZ-level resilience but uses only one AZ for compute or LB is incorrect for requirements like "tolerate loss of a single AZ without downtime".
RDS Multi-AZ, Read Replicas, and Resiliency
RDS Multi-AZ
RDS Multi-AZ gives you a synchronous standby in another AZ. It is for high availability and automatic failover, not for read scaling.
RDS Read Replicas
Read replicas use asynchronous replication to scale reads. They can live in the same Region or another Region, but they can lag and risk some data loss.
Failover Behavior
With Multi-AZ, RDS automatically promotes the standby and keeps the same endpoint. With read replicas, promotion is manual and may lose recent writes.
Exam Traps
If the question is about availability and automatic failover, choose Multi-AZ. If it is about scaling reads or cross-Region DR with some lag, choose read replicas.
S3 Durability, Availability, and Replication (2026 View)
Durability vs Availability
S3 Standard offers 11 9s durability across multiple AZs and around 99.99% availability. Durability is about not losing data; availability is about access at a moment.
S3 and AZ Failures
Inside a Region, S3 automatically stores your data across multiple AZs, so it already tolerates the loss of a single AZ without extra work.
S3 Replication Options
Use S3 Same-Region Replication (SRR) for copies within a Region, and Cross-Region Replication (CRR) for geo-redundancy and DR across Regions.
Compliance and Replication
For data residency, avoid CRR to other jurisdictions. For off-site backup requirements, CRR to another Region is a common solution.
Worked Example: Mapping RPO/RTO to Multi-AZ vs Multi-Region
EduCast Requirements
EduCast must survive an AZ failure, tolerate 30 minutes of data loss and 2 hours of downtime in a Region disaster, and keep data within the EU.
Within-Region Design
Use EC2 Auto Scaling and an ALB across 3 AZs, RDS Multi-AZ, and S3 Standard. This covers AZ-level high availability.
Cross-Region DR Design
Create an RDS cross-Region read replica and S3 CRR to eu-central-1, plus a minimal or small duplicate stack there (pilot light or warm standby).
Why It Works
Asynchronous replication meets 30-minute RPO, scaling up and failover meet 2-hour RTO, and all Regions are in the EU for data residency.
Multi-Region Architectures and Route 53 Routing Policies
Why Route 53 Matters
In multi-Region designs, Route 53 controls where users connect. It can direct traffic based on health, latency, geography, or weights.
Failover and Latency Routing
Failover routing sends traffic to a primary endpoint and switches to a secondary if health checks fail. Latency routing sends users to the lowest-latency Region.
Geo and Weighted Routing
Geolocation routing directs users by location, often for compliance. Weighted routing splits traffic by percentage, handy for gradual cutovers.
Exam Clues
Automatic DR Region DNS switch → failover routing. Lowest latency → latency-based. Gradual traffic shift → weighted. Region-by-region rules → geolocation.
Design Challenge: Pick the Right DR Pattern
Use this thought exercise to practice mapping requirements to DR patterns and Multi-AZ vs multi-Region.
Scenario A
A payment processing API:
- Cannot lose any confirmed transactions.
- Must recover from a Region failure in under 5 minutes.
- Compliance requires data to stay in one country.
- Which pattern fits best?
- Backup and restore
- Pilot light
- Warm standby
- Multi-site (active-active)
- Do you need Multi-AZ, multi-Region, or both?
Pause and decide before reading the guidance below.
Guidance
- RPO = 0 (no data loss), RTO = 5 minutes. Backup/restore and pilot light are too slow and too lossy.
- Warm standby might be tight; you would need very aggressive automation.
- Multi-site (active-active) in two AZs per Region, plus synchronous or near-synchronous replication is the most realistic.
- You still use Multi-AZ inside each Region for AZ failures.
- Data residency constraint means you choose two Regions within the same country (where available) or reconsider requirements.
Scenario B (self-check)
An internal reporting tool:
- Can lose up to 4 hours of data.
- Can be down for 1 business day after a disaster.
- Budget is tight.
Which pattern is likely enough? What AWS services would you use?
(Think: backup and restore with S3, AWS Backup, and occasional cross-Region copies.)
Quick Check 1: Multi-AZ and RDS
Test your understanding of Multi-AZ vs read replicas.
Your company runs a critical OLTP database on Amazon RDS for MySQL in us-east-1. The CTO says: "We must survive the loss of a single AZ with automatic failover and no code changes. Read scaling is not a concern right now." What is the MOST appropriate configuration?
- Create a cross-Region read replica in another Region and promote it during an AZ failure.
- Enable RDS Multi-AZ on the instance so AWS maintains a synchronous standby in another AZ.
- Create two read replicas in different AZs and configure the application to fail over between them.
- Move the database to a single large EC2 instance with EBS Multi-Attach volumes across AZs.
Show Answer
Answer: B) Enable RDS Multi-AZ on the instance so AWS maintains a synchronous standby in another AZ.
RDS Multi-AZ is designed for high availability within a Region. It maintains a synchronous standby in another AZ and performs automatic failover without code changes. Cross-Region read replicas are for DR and read scaling with manual promotion and possible data loss. Read replicas generally do not provide automatic failover, and EBS Multi-Attach does not span AZs.
Quick Check 2: S3 and DR Patterns
Check your understanding of S3 durability and DR.
You store daily database backups in an S3 bucket in ap-southeast-1. Management is worried about a full-Region outage and wants off-site backups with minimal changes. RPO of 24 hours and RTO of 24 hours are acceptable. What is the MOST cost-effective solution?
- Enable S3 Cross-Region Replication to another Region and keep using the current backup process.
- Migrate all backups to S3 One Zone-IA in the same Region to save costs.
- Move backups to EBS snapshots in the same Region and copy them manually when needed.
- Set up an RDS Multi-AZ configuration and stop taking S3 backups.
Show Answer
Answer: A) Enable S3 Cross-Region Replication to another Region and keep using the current backup process.
You already have daily backups on S3. To protect against a Region failure, enabling S3 Cross-Region Replication to a bucket in another Region adds geo-redundancy with minimal change. One Zone-IA is single-AZ and not suitable for Region DR. EBS snapshots in the same Region do not solve Region failure, and Multi-AZ only protects within the Region and does not replace backups.
Key Term Review
Flip through these flashcards to reinforce core concepts before moving on.
- Availability Zone (AZ)
- One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZs are physically separated to limit correlated failures and are key fault domains for high availability designs.
- Region
- A physical geographic area that contains multiple Availability Zones. Multi-Region designs protect against full-Region failures and help meet latency and regulatory requirements.
- RPO (Recovery Point Objective)
- The maximum acceptable amount of data loss measured in time. It answers: How far back in time can our data be rolled back after a disaster?
- RTO (Recovery Time Objective)
- The maximum acceptable time a system can be offline after a disaster before it must be restored to service.
- RDS Multi-AZ
- An RDS configuration that maintains a synchronous standby in another AZ within the same Region for automatic failover and high availability. It is not used for read scaling.
- RDS Read Replica
- An asynchronously replicated copy of an RDS database used primarily for read scaling and sometimes for DR. Can be in the same or a different Region and may lag behind the primary.
- S3 Cross-Region Replication (CRR)
- An S3 feature that automatically replicates objects from a source bucket in one Region to a destination bucket in another Region, used for geo-redundancy, DR, and compliance.
- Backup and Restore DR Pattern
- A low-cost DR strategy where data and configuration are regularly backed up and restored to rebuild the environment after a disaster. RPO and RTO are usually measured in hours.
- Warm Standby DR Pattern
- A DR strategy where a scaled-down but fully functional copy of the environment runs in another Region. During a disaster, it is scaled up and traffic is shifted. Provides lower RPO/RTO than backup or pilot light.
- Multi-site (Active-Active) DR Pattern
- A DR strategy where full production capacity runs in two or more Regions simultaneously. Provides near-zero RPO and RTO but at higher cost and complexity.
- Route 53 Failover Routing
- A Route 53 policy where DNS traffic is sent to a primary endpoint and automatically switched to a secondary endpoint when health checks detect a failure.
Key Terms
- Region
- A geographic area that contains multiple Availability Zones. Multi-Region designs mitigate Region-level failures and address latency and regulatory needs.
- Pilot light
- A DR pattern where a minimal version of the environment (typically core databases and services) runs in the DR Region and is scaled out during a disaster.
- Fault domain
- A group of resources that can fail together, such as a rack, an Availability Zone, or a data center.
- RDS Multi-AZ
- An Amazon RDS deployment option that keeps a synchronous standby in another AZ within the same Region for automatic failover and high availability.
- Warm standby
- A DR pattern where a scaled-down but fully functional copy of the entire stack runs in another Region, ready to be scaled up during a disaster.
- RDS read replica
- An asynchronously replicated copy of an RDS database instance, used for read scaling and sometimes DR, which can be in the same or a different Region.
- Availability Zone (AZ)
- One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region, used as independent fault domains.
- Route 53 failover routing
- A Route 53 routing policy that directs traffic to a primary endpoint and automatically fails over to a secondary endpoint when health checks fail.
- Multi-site (active-active)
- A DR pattern where full production runs in multiple Regions simultaneously, providing very low RPO and RTO.
- RTO (Recovery Time Objective)
- The maximum acceptable time a system can be offline after a failure before it must be restored.
- RPO (Recovery Point Objective)
- The maximum acceptable amount of data loss measured in time after a failure.
- S3 Cross-Region Replication (CRR)
- An S3 feature that automatically copies objects from a bucket in one Region to a bucket in another Region for geo-redundancy and DR.