SkarpSkarp

Chapter 9 of 26

Designing Multi-AZ and Multi-Region Resilient Architectures

High availability and disaster recovery are at the heart of Domain 2. This module shows how to spread risk across Availability Zones and Regions using core services like EC2, RDS, S3, and Route 53.

27 min readen

Big Picture: Resilience, AZs, and Regions

Reliability in Context

This module lives in the Reliability pillar: "The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to."

Regions and AZs

An AWS Region is a geographic area (like us-east-1). Each Region has multiple Availability Zones, which are separate data centers designed so one AZ can fail without taking down the others.

Fault Domains

A fault domain is a group of resources that can fail together, such as a rack or an AZ. Designing for resilience means spreading your workload across independent fault domains.

Multi-AZ vs Multi-Region

Multi-AZ protects against losing a single AZ inside a Region. Multi-Region protects against losing an entire Region or handling cross-border and compliance needs.

RPO, RTO, and Disaster Recovery Patterns

RPO and RTO

RPO is how much data loss (in time) is acceptable. RTO is how long the system can be down. Exam scenarios hide these values in business language.

Backup and Restore

Backup and restore uses periodic backups to another AZ or Region. It is the cheapest option, with RPO and RTO often measured in hours.

Pilot Light

Pilot light keeps a minimal version of your core systems running in a DR Region. You scale out other components during a disaster.

Warm Standby and Multi-site

Warm standby runs a smaller but complete stack in a second Region. Multi-site runs full production in multiple Regions with near-zero RPO/RTO.

Designing Multi-AZ Architectures with EC2 and Load Balancing

Spreading Compute Across AZs

Run EC2 instances in at least two AZs and front them with an ALB or NLB that has subnets in multiple AZs. Use Auto Scaling across those AZs.

Eliminating Single-AZ Dependencies

Avoid single-AZ NAT gateways or load balancers. Use separate NAT gateways per AZ and multi-AZ ALBs/NLBs to keep outbound and inbound traffic working.

How It Looks

Visualize us-east-1 with two AZs. An ALB spans both. EC2 instances in an Auto Scaling group run in private subnets in each AZ. If one AZ fails, the other keeps serving.

Exam Angle

Any design that claims AZ-level resilience but uses only one AZ for compute or LB is incorrect for requirements like "tolerate loss of a single AZ without downtime".

RDS Multi-AZ, Read Replicas, and Resiliency

RDS Multi-AZ

RDS Multi-AZ gives you a synchronous standby in another AZ. It is for high availability and automatic failover, not for read scaling.

RDS Read Replicas

Read replicas use asynchronous replication to scale reads. They can live in the same Region or another Region, but they can lag and risk some data loss.

Failover Behavior

With Multi-AZ, RDS automatically promotes the standby and keeps the same endpoint. With read replicas, promotion is manual and may lose recent writes.

Exam Traps

If the question is about availability and automatic failover, choose Multi-AZ. If it is about scaling reads or cross-Region DR with some lag, choose read replicas.

S3 Durability, Availability, and Replication (2026 View)

Durability vs Availability

S3 Standard offers 11 9s durability across multiple AZs and around 99.99% availability. Durability is about not losing data; availability is about access at a moment.

S3 and AZ Failures

Inside a Region, S3 automatically stores your data across multiple AZs, so it already tolerates the loss of a single AZ without extra work.

S3 Replication Options

Use S3 Same-Region Replication (SRR) for copies within a Region, and Cross-Region Replication (CRR) for geo-redundancy and DR across Regions.

Compliance and Replication

For data residency, avoid CRR to other jurisdictions. For off-site backup requirements, CRR to another Region is a common solution.

Worked Example: Mapping RPO/RTO to Multi-AZ vs Multi-Region

EduCast Requirements

EduCast must survive an AZ failure, tolerate 30 minutes of data loss and 2 hours of downtime in a Region disaster, and keep data within the EU.

Within-Region Design

Use EC2 Auto Scaling and an ALB across 3 AZs, RDS Multi-AZ, and S3 Standard. This covers AZ-level high availability.

Cross-Region DR Design

Create an RDS cross-Region read replica and S3 CRR to eu-central-1, plus a minimal or small duplicate stack there (pilot light or warm standby).

Why It Works

Asynchronous replication meets 30-minute RPO, scaling up and failover meet 2-hour RTO, and all Regions are in the EU for data residency.

Multi-Region Architectures and Route 53 Routing Policies

Why Route 53 Matters

In multi-Region designs, Route 53 controls where users connect. It can direct traffic based on health, latency, geography, or weights.

Failover and Latency Routing

Failover routing sends traffic to a primary endpoint and switches to a secondary if health checks fail. Latency routing sends users to the lowest-latency Region.

Geo and Weighted Routing

Geolocation routing directs users by location, often for compliance. Weighted routing splits traffic by percentage, handy for gradual cutovers.

Exam Clues

Automatic DR Region DNS switch → failover routing. Lowest latency → latency-based. Gradual traffic shift → weighted. Region-by-region rules → geolocation.

Design Challenge: Pick the Right DR Pattern

Use this thought exercise to practice mapping requirements to DR patterns and Multi-AZ vs multi-Region.

Scenario A

A payment processing API:

  • Cannot lose any confirmed transactions.
  • Must recover from a Region failure in under 5 minutes.
  • Compliance requires data to stay in one country.
  1. Which pattern fits best?
  • Backup and restore
  • Pilot light
  • Warm standby
  • Multi-site (active-active)
  1. Do you need Multi-AZ, multi-Region, or both?

Pause and decide before reading the guidance below.

Guidance

  • RPO = 0 (no data loss), RTO = 5 minutes. Backup/restore and pilot light are too slow and too lossy.
  • Warm standby might be tight; you would need very aggressive automation.
  • Multi-site (active-active) in two AZs per Region, plus synchronous or near-synchronous replication is the most realistic.
  • You still use Multi-AZ inside each Region for AZ failures.
  • Data residency constraint means you choose two Regions within the same country (where available) or reconsider requirements.

Scenario B (self-check)

An internal reporting tool:

  • Can lose up to 4 hours of data.
  • Can be down for 1 business day after a disaster.
  • Budget is tight.

Which pattern is likely enough? What AWS services would you use?

(Think: backup and restore with S3, AWS Backup, and occasional cross-Region copies.)

Quick Check 1: Multi-AZ and RDS

Test your understanding of Multi-AZ vs read replicas.

Your company runs a critical OLTP database on Amazon RDS for MySQL in us-east-1. The CTO says: "We must survive the loss of a single AZ with automatic failover and no code changes. Read scaling is not a concern right now." What is the MOST appropriate configuration?

  1. Create a cross-Region read replica in another Region and promote it during an AZ failure.
  2. Enable RDS Multi-AZ on the instance so AWS maintains a synchronous standby in another AZ.
  3. Create two read replicas in different AZs and configure the application to fail over between them.
  4. Move the database to a single large EC2 instance with EBS Multi-Attach volumes across AZs.
Show Answer

Answer: B) Enable RDS Multi-AZ on the instance so AWS maintains a synchronous standby in another AZ.

RDS Multi-AZ is designed for high availability within a Region. It maintains a synchronous standby in another AZ and performs automatic failover without code changes. Cross-Region read replicas are for DR and read scaling with manual promotion and possible data loss. Read replicas generally do not provide automatic failover, and EBS Multi-Attach does not span AZs.

Quick Check 2: S3 and DR Patterns

Check your understanding of S3 durability and DR.

You store daily database backups in an S3 bucket in ap-southeast-1. Management is worried about a full-Region outage and wants off-site backups with minimal changes. RPO of 24 hours and RTO of 24 hours are acceptable. What is the MOST cost-effective solution?

  1. Enable S3 Cross-Region Replication to another Region and keep using the current backup process.
  2. Migrate all backups to S3 One Zone-IA in the same Region to save costs.
  3. Move backups to EBS snapshots in the same Region and copy them manually when needed.
  4. Set up an RDS Multi-AZ configuration and stop taking S3 backups.
Show Answer

Answer: A) Enable S3 Cross-Region Replication to another Region and keep using the current backup process.

You already have daily backups on S3. To protect against a Region failure, enabling S3 Cross-Region Replication to a bucket in another Region adds geo-redundancy with minimal change. One Zone-IA is single-AZ and not suitable for Region DR. EBS snapshots in the same Region do not solve Region failure, and Multi-AZ only protects within the Region and does not replace backups.

Key Term Review

Flip through these flashcards to reinforce core concepts before moving on.

Availability Zone (AZ)
One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZs are physically separated to limit correlated failures and are key fault domains for high availability designs.
Region
A physical geographic area that contains multiple Availability Zones. Multi-Region designs protect against full-Region failures and help meet latency and regulatory requirements.
RPO (Recovery Point Objective)
The maximum acceptable amount of data loss measured in time. It answers: How far back in time can our data be rolled back after a disaster?
RTO (Recovery Time Objective)
The maximum acceptable time a system can be offline after a disaster before it must be restored to service.
RDS Multi-AZ
An RDS configuration that maintains a synchronous standby in another AZ within the same Region for automatic failover and high availability. It is not used for read scaling.
RDS Read Replica
An asynchronously replicated copy of an RDS database used primarily for read scaling and sometimes for DR. Can be in the same or a different Region and may lag behind the primary.
S3 Cross-Region Replication (CRR)
An S3 feature that automatically replicates objects from a source bucket in one Region to a destination bucket in another Region, used for geo-redundancy, DR, and compliance.
Backup and Restore DR Pattern
A low-cost DR strategy where data and configuration are regularly backed up and restored to rebuild the environment after a disaster. RPO and RTO are usually measured in hours.
Warm Standby DR Pattern
A DR strategy where a scaled-down but fully functional copy of the environment runs in another Region. During a disaster, it is scaled up and traffic is shifted. Provides lower RPO/RTO than backup or pilot light.
Multi-site (Active-Active) DR Pattern
A DR strategy where full production capacity runs in two or more Regions simultaneously. Provides near-zero RPO and RTO but at higher cost and complexity.
Route 53 Failover Routing
A Route 53 policy where DNS traffic is sent to a primary endpoint and automatically switched to a secondary endpoint when health checks detect a failure.

Key Terms

Region
A geographic area that contains multiple Availability Zones. Multi-Region designs mitigate Region-level failures and address latency and regulatory needs.
Pilot light
A DR pattern where a minimal version of the environment (typically core databases and services) runs in the DR Region and is scaled out during a disaster.
Fault domain
A group of resources that can fail together, such as a rack, an Availability Zone, or a data center.
RDS Multi-AZ
An Amazon RDS deployment option that keeps a synchronous standby in another AZ within the same Region for automatic failover and high availability.
Warm standby
A DR pattern where a scaled-down but fully functional copy of the entire stack runs in another Region, ready to be scaled up during a disaster.
RDS read replica
An asynchronously replicated copy of an RDS database instance, used for read scaling and sometimes DR, which can be in the same or a different Region.
Availability Zone (AZ)
One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region, used as independent fault domains.
Route 53 failover routing
A Route 53 routing policy that directs traffic to a primary endpoint and automatically fails over to a secondary endpoint when health checks fail.
Multi-site (active-active)
A DR pattern where full production runs in multiple Regions simultaneously, providing very low RPO and RTO.
RTO (Recovery Time Objective)
The maximum acceptable time a system can be offline after a failure before it must be restored.
RPO (Recovery Point Objective)
The maximum acceptable amount of data loss measured in time after a failure.
S3 Cross-Region Replication (CRR)
An S3 feature that automatically copies objects from a bucket in one Region to a bucket in another Region for geo-redundancy and DR.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself