SkarpSkarp

Chapter 10 of 26

Designing for Resilience: High Availability, Fault Tolerance, and the Reliability Pillar

When the question mentions SLAs, RTO, or RPO, you are in resilience territory; practice mapping these requirements to concrete AWS design patterns.

27 min readen

Reliability Pillar and Resilience: Why This Matters

Reliability in the Well-Architected Framework

The AWS Well-Architected Framework has six pillars: 1) Operational excellence, 2) Security, 3) Reliability, 4) Performance efficiency, 5) Cost optimization, 6) Sustainability.

Canonical Reliability Definition

Know this phrase: "The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle."

Exam-Relevant Questions

For SAA-style questions, reliability means: can the workload keep running when something fails, can it recover fast enough, and does the design match the SLA or availability target?

From Business Terms to AWS Patterns

You will translate SLAs, RTO, and RPO into AWS design choices: Multi-AZ vs Multi-Region, active-active vs active-passive, and backup/restore vs pilot light vs warm standby.

High Availability vs Fault Tolerance (and Durability)

High Availability

High availability means the system is up most of the time (for example, 99.9% uptime). It may have brief outages, but it recovers quickly via redundancy and automated failover.

Fault Tolerance

Fault tolerance means the system keeps operating without interruption even when components fail. Failures are masked from users, requiring real-time redundancy and no single point of failure.

Durability

Durability is about how likely your data is to still exist and be correct after failures. S3 Standard offers 99.999999999% durability, which is about not losing data, not about fast access.

Exam Traps

Multi-AZ RDS is highly available but not fully fault tolerant; S3 is highly durable but not automatically a HA app; a single-AZ EC2+EBS setup is neither HA nor FT.

Availability Zones, Regions, and Failure Domains

Regions vs AZs

A Region is a geographic area like us-east-1. Within a Region, Availability Zones are separate data centers with independent power and networking but low-latency links.

Failure Domains

Design around failure domains: AZs isolate data center failures; Regions isolate large-scale or geographic failures and support compliance and latency needs.

Multi-AZ vs Multi-Region

Multi-AZ protects against AZ failure inside a Region. Multi-Region protects against Region-wide issues and can improve global latency.

Common AZ/Region Traps

All instances in one AZ behind an ALB are not Multi-AZ. EBS is AZ-scoped. RDS Multi-AZ is within one Region; Multi-Region needs extra features like read replicas or Aurora Global.

Designing a Simple Highly Available Web App (Multi-AZ)

Scenario Requirements

Stateless web app, high availability in one Region, low cost, RTO under 5 minutes, near-zero RPO. We want a practical Multi-AZ design.

Network and Compute

Use one VPC with public subnets for ALBs and private subnets for EC2 and RDS, spread across at least two AZs. An Auto Scaling group runs EC2 instances in multiple AZs.

Load Balancing and Database

Place an Application Load Balancer in front, targeting the Multi-AZ Auto Scaling group. Use RDS Multi-AZ for the database so AWS maintains a synchronous standby in another AZ.

Failure Behavior

If an instance fails, ASG replaces it and ALB routes only to healthy targets. If an AZ fails, instances and RDS standby in the other AZ serve traffic after brief failover.

Multi-Region Architectures: When and How

When to Use Multi-Region

Use Multi-Region for extreme availability SLAs, global low latency, or regulatory data residency. It protects against a full-Region outage.

Active-Passive Multi-Region

Primary Region serves traffic; secondary Region is warm or cold standby. Route 53 health checks and DNS failover shift traffic if the primary fails.

Active-Active Multi-Region

Multiple Regions serve users simultaneously. Route 53 latency-based or geolocation routing directs clients; data uses services like DynamoDB global tables or Aurora Global.

Trade-offs and Exam Hint

Multi-Region boosts availability and latency but adds cost and complexity. Unless Region failure or global users are explicit, Multi-AZ single-Region is usually correct.

RTO, RPO, and Mapping to AWS Patterns

RTO and RPO Basics

RTO is how long the system can be down before recovery. RPO is how much data loss in time is acceptable. Both are usually given in minutes or hours.

Backup and Restore vs Pilot Light

Backup and restore uses snapshots and backups, giving high RTO/RPO. Pilot light keeps minimal critical components always running in a DR Region for faster recovery.

Warm Standby and Multi-Site

Warm standby runs a scaled-down copy of prod in another Region, offering low RTO/RPO. Multi-site (active-active) runs full capacity in multiple Regions with near-zero RTO.

Pattern Selection by RTO/RPO

Hours/hours → backup and restore. Tens of minutes/minutes → pilot light or warm standby. Near-zero RTO and very low RPO → active-active Multi-Region.

Thought Exercise: Matching RTO/RPO to Designs

Work through these scenarios and decide which AWS pattern best fits. Think like the exam: eliminate options that are obviously too weak or too expensive for the stated goals.

  1. Scenario A
  • A reporting system runs nightly batch jobs.
  • Business says: "If it is down for up to 8 hours, that's fine. We can tolerate losing a day of reports."
  • Question: Which DR strategy is sufficient?
  • Your reasoning: Note the very relaxed RTO and RPO.
  1. Scenario B
  • A financial trading dashboard must be restored within 15 minutes, and losing more than 5 minutes of data is unacceptable.
  • Users are in one Region. Cost matters, but availability is more important.
  • Question: Which DR pattern makes sense, and would you use Multi-Region?
  1. Scenario C
  • A global SaaS application with customers worldwide.
  • SLA promises 99.99% availability and explicitly mentions resilience to Region failure.
  • Question: What high-level architecture would you propose?

Pause and write down your answers before checking the guidance below.

Suggested answers (do not peek too early):

  • Scenario A: Backup and restore is enough. RTO 8 hours, RPO 24 hours is very tolerant.
  • Scenario B: Warm standby or a well-designed pilot light in another AZ/Region. RTO 15 minutes, RPO 5 minutes are tight enough to justify pre-provisioned resources. Likely Multi-AZ first; Multi-Region if Region failure is a concern.
  • Scenario C: Active-active Multi-Region with Route 53 latency-based routing and a global data layer (for example, DynamoDB global tables or Aurora Global Database). Pilot light or warm standby is usually not enough to guarantee 99.99% plus Region resilience.

Reliability Pillar Best Practices on AWS

Reliability Definition Recap

Reliability is the ability of a workload to perform its intended function correctly and consistently when expected, including operating and testing through its lifecycle.

Foundations and Architecture

Use multiple AZs, understand quotas, and prefer managed services. Architect for automatic recovery, loose coupling, and stateless compute with managed data stores.

Change Management and Testing

Use infrastructure as code to recreate environments, test recovery procedures, and apply safe deployment strategies like blue/green or canary releases.

Monitoring and Automation

Rely on CloudWatch and CloudTrail for visibility and automate remediation using Systems Manager, Lambda, and EventBridge to improve resilience.

Quick Check: HA vs Fault Tolerance vs Durability

Test your understanding of the core distinctions.

Which option best describes a fault-tolerant design in AWS?

  1. An application deployed in a single AZ with frequent EBS snapshots to S3.
  2. A Multi-AZ RDS instance that fails over to a standby in another AZ, causing a brief outage.
  3. An application running in multiple AZs behind an ALB, where loss of an AZ does not interrupt service because other AZs handle traffic seamlessly.
  4. A static website hosted on S3 with 99.999999999% durability for objects.
Show Answer

Answer: C) An application running in multiple AZs behind an ALB, where loss of an AZ does not interrupt service because other AZs handle traffic seamlessly.

Fault tolerance means the system continues operating without interruption when components fail. Option 3 describes a design where multiple AZs can handle traffic seamlessly if one AZ fails. Option 2 is high availability (brief outage during RDS failover). Option 1 has durability but no HA or FT. Option 4 describes durability, not fault tolerance.

Quiz: Choosing a DR Strategy from RTO/RPO

Apply RTO/RPO reasoning to a scenario.

A healthcare records system must recover within 30 minutes after a Region-level disaster, with an RPO of 5 minutes. Cost is important but secondary to resilience. Which strategy is the BEST fit?

  1. Backup and restore using daily EBS and RDS snapshots stored in S3 in another Region.
  2. Pilot light in a second Region with continuously replicated database and minimal app servers, scaled up during disaster.
  3. Warm standby in a second Region with full-size infrastructure always running at the same capacity as primary.
  4. Single-Region Multi-AZ with RDS Multi-AZ and Auto Scaling across AZs.
Show Answer

Answer: B) Pilot light in a second Region with continuously replicated database and minimal app servers, scaled up during disaster.

RTO 30 minutes and RPO 5 minutes for a Region-level disaster require pre-provisioned, replicated resources in another Region. A pilot light with continuous DB replication and minimal app servers that can be scaled up fits well and is more cost-effective than full warm standby. Backup and restore is too slow. Single-Region Multi-AZ does not protect against Region failure.

Key Term Flashcards: Resilience Essentials

Use these flashcards to reinforce the core vocabulary you will see in SAA-style questions.

High availability (HA)
Design approach that keeps a system operational for the maximum possible time, usually via redundancy and automated failover. Brief outages may occur during failover, but recovery is fast.
Fault tolerance (FT)
Ability of a system to continue operating without interruption when one or more components fail. Failures are masked from users through real-time redundancy and no single point of failure.
Durability
Likelihood that data remains intact and correct over time despite failures. Often expressed with many "nines" (for example, S3 Standard durability of 99.999999999%). It is about not losing data, not about availability.
RTO (Recovery Time Objective)
Maximum acceptable time that a system can be unavailable after a failure before it must be restored to operation.
RPO (Recovery Point Objective)
Maximum acceptable amount of data loss measured in time. It defines how far back in time data may be lost due to a failure.
Multi-AZ architecture
An AWS design that distributes resources (for example, EC2 instances, RDS) across multiple Availability Zones within a Region to protect against AZ-level failures.
Multi-Region architecture
Design where resources are deployed in more than one AWS Region, typically for Region-level resilience, global latency reduction, or regulatory requirements.
Backup and restore (DR pattern)
Disaster recovery strategy where data is regularly backed up and a new environment is created from backups after a disaster. Typically has higher RTO and RPO.
Pilot light (DR pattern)
Strategy where a minimal, critical version of the system runs in the DR Region at all times, allowing faster scale-up during a disaster and lower RTO/RPO than pure backup and restore.
Warm standby (DR pattern)
Disaster recovery approach where a scaled-down but fully functional copy of the production environment runs in another Region, ready to scale out quickly during a disaster.
Active-active Multi-Region
Architecture where multiple Regions serve production traffic simultaneously, often using Route 53 latency-based or geolocation routing and globally replicated data stores.

Design Walkthrough: Picking the Right AWS Pattern

Let’s simulate an exam-style design question. Read carefully and decide on a design before checking the guided reasoning.

Scenario:

  • A learning platform hosts video courses and quizzes.
  • Users are mainly in North America, with some in Europe.
  • SLA: 99.9% availability.
  • RTO: 1 hour for major failures.
  • RPO: 15 minutes.
  • Budget is limited; they want to avoid unnecessary complexity.

Your task: Propose an AWS architecture at a high level.

Pause and think: Do you need Multi-Region active-active? Pilot light? Or will Multi-AZ suffice?

Guided reasoning:

  1. SLA 99.9% and RTO 1 hour are moderate. Region failure resilience is not explicitly required. Multi-Region active-active is probably overkill.
  2. RPO 15 minutes suggests frequent backups or near-real-time replication, but not necessarily global multi-master databases.
  3. Users are mostly in one geography. A single Region with CloudFront for global edge delivery is usually enough.

Reasonable design:

  • Use a single Region with Multi-AZ: ALB + Auto Scaling group of EC2 or Fargate tasks across at least two AZs.
  • Store videos and static content in S3, fronted by CloudFront for caching in North America and Europe.
  • Use RDS Multi-AZ (or Aurora with Multi-AZ) for quiz and user data; enable automated backups with at least 15-minute granularity where possible.
  • Use regular EBS/RDS snapshots to S3 for additional protection and cross-Region copy if compliance requires it.

This design aligns with the stated RTO/RPO, meets 99.9% availability, leverages Multi-AZ, and avoids the complexity and cost of full Multi-Region active-active.

Key Terms

Region
A physical geographic area where AWS clusters data centers; each Region consists of multiple isolated Availability Zones.
Multi-AZ
Architecture that distributes resources across multiple Availability Zones within a Region to increase resilience to AZ-level failures.
Durability
Likelihood that data remains intact and correct over time despite failures; often expressed with many 'nines' of durability.
Pilot light
Disaster recovery pattern where a minimal, critical version of the system runs in the DR Region, enabling faster scale-up during a disaster.
Multi-Region
Architecture that deploys resources in more than one AWS Region for Region-level resilience, global latency reduction, or compliance.
Warm standby
Disaster recovery pattern where a scaled-down but fully functional copy of the production environment runs in another Region.
Backup and restore
Disaster recovery pattern where data is backed up and a new environment is created from backups after a disaster; typically higher RTO/RPO.
Reliability pillar
The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
Fault tolerance (FT)
Ability of a system to keep operating without interruption when components fail, typically using real-time redundancy and no single point of failure.
Availability Zone (AZ)
One or more discrete data centers within an AWS Region, with independent power, cooling, and networking, connected by low-latency links.
High availability (HA)
Design that keeps a system running for the maximum possible time, usually using redundancy and automated failover. Short outages may occur during failover.
Active-active Multi-Region
Pattern where multiple Regions serve production traffic simultaneously, often using global routing and replicated data stores.
RTO (Recovery Time Objective)
Maximum acceptable time a system can be unavailable after a failure before it must be restored.
AWS Well-Architected Framework
The AWS Well-Architected Framework provides a consistent set of best practices for customers and partners to evaluate architectures, and a set of questions you can use to evaluate how well an architecture is aligned to AWS best practices.
RPO (Recovery Point Objective)
Maximum acceptable amount of data loss, measured in time, that can occur due to a failure.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself