SkarpSkarp

Chapter 24 of 26

Reliability and Performance Efficiency Pillars in Practice

Reliability and performance efficiency are where your architecture either delights or frustrates users. This module applies these two pillars to refine designs and resolve trade-offs that often appear in multi-answer exam questions.

27 min readen

Framing the Pillars: Reliability and Performance Efficiency

Two Pillars, One Goal

Reliability and performance efficiency decide whether your workload delights or frustrates users. In exams, they appear as design trade-offs and multi-answer questions.

Canonical Definitions

Reliability pillar: "The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle."

Performance efficiency pillar: "The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve."

How They Show Up on Exam

Expect questions on multi-AZ vs multi-Region, sync vs async flows, EC2 vs Fargate vs Lambda, caching, DB engines, and storage classes, often with multiple correct design improvements.

Guiding Questions

Keep asking: "If this component fails or slows, what happens to users?" and "Am I using the most efficient option that still meets requirements?" This mindset guides choices in this module.

Reliability in AWS: From Single AZ to Multi-Region

Fault Domains

Reliability starts with fault domains: instance/node, Availability Zone, and Region. Each is a scope where failures can occur and must be mitigated.

Multi-AZ Basics

Multi-AZ means running redundant resources in at least two AZs in one Region. Examples: RDS Multi-AZ, Auto Scaling groups across AZs, ALB with targets in multiple AZs.

When Multi-Region?

Multi-Region duplicates critical parts of the workload in a second Region. Use it for very low RPO/RTO or to protect against Region-wide disruptions.

Exam Cues

Data residency in one country? Usually single Region, multi-AZ. Need protection from Region failure? Multi-Region. Single-AZ ALB targets? Likely a reliability anti-pattern.

Reliability Goals

Across all designs, reliability emphasizes redundancy, automated recovery, and the ability to operate and test the workload through its lifecycle.

Designing for Failure and Graceful Degradation

Design for Failure

Assume components will fail. Reliability comes from architectures that still behave acceptably when pieces break or slow down.

Loose Coupling

Use SQS, SNS, EventBridge, or Kinesis between components so one side can buffer work if the other is unavailable, instead of crashing.

Graceful Degradation

When a dependency fails, offer a reduced-quality experience instead of a hard error, such as falling back to cached or static content.

Key Patterns

Use idempotent operations, timeouts, retries with exponential backoff, and throttling or load shedding to keep the system stable under stress.

Exam Signal

Look for answer options that remove single points of failure, introduce queues, add safe retries, or provide fallback behaviors.

Auto Scaling and Health-Driven Recovery

Automated Recovery

Reliability improves when the system detects unhealthy resources and replaces them automatically, without waiting for humans.

Health Checks + Load Balancers

ALB and NLB perform health checks and stop routing traffic to unhealthy instances or IPs, isolating failures from users.

Auto Scaling Groups

EC2 Auto Scaling groups maintain desired capacity and replace failed instances. They can also scale based on metrics like CPU or request count.

Service Auto Scaling

Application Auto Scaling supports services like ECS and DynamoDB, and Lambda reserved concurrency protects critical functions from noisy neighbors.

Exam Patterns

Designs with multi-AZ ASG behind an ALB are more reliable than single instances. Manual replacement of instances is a reliability red flag.

Performance Efficiency Across Compute, Storage, DB, and Network

Matching Resources to Workloads

Performance efficiency means picking compute, storage, DB, and networking options that match workload behavior and can adapt as demand changes.

Compute Choices

Pick EC2, containers, or Lambda based on control needs and traffic patterns. Then choose the right instance family and size, plus Auto Scaling.

Storage Optimization

Use S3 storage classes for access patterns and match EBS types (gp3, io2, st1) to IOPS or throughput needs for efficient performance.

Database and Caching

Select RDS, Aurora, DynamoDB, or ElastiCache based on data model and scale. Use read replicas, Aurora readers, or DAX to offload reads.

Network and Edge

Use CloudFront as a CDN, ALB or NLB for load balancing, and optionally Global Accelerator to improve global performance and routing.

Worked Scenario: Multi-AZ, Auto Scaling, and Caching

Scenario Setup

Web app on a single EC2 in us-east-1a with local MySQL. Users see slow performance at peak and downtime during maintenance. You must improve reliability and performance.

Identify Issues

Single instance and AZ are single points of failure. Local MySQL is hard to scale and maintain. No Auto Scaling leads to overload at peak.

Reliability Fixes

Move MySQL to RDS MySQL Multi-AZ. Put EC2 instances in a multi-AZ Auto Scaling group behind an Application Load Balancer.

Performance Fixes

Right-size instances and use scaling policies. Add ElastiCache for Redis to cache hot data and reduce database load.

Choosing Answers

When limited to two choices, prioritize removing single points of failure and using managed services: ASG + ALB, and RDS Multi-AZ.

Thought Exercise: Graceful Degradation Design

Consider this design and mentally refactor it.

Current situation:

  • A mobile app calls an API hosted on API Gateway + Lambda.
  • The Lambda function calls three downstream services synchronously:
  1. A recommendations microservice (ECS service in a private subnet).
  2. A user profile service (DynamoDB table).
  3. A third-party payment gateway.
  • If any call fails or times out, the Lambda function returns HTTP 500.

Your task: Design for failure and graceful degradation.

Step-by-step reflection (pause and answer each before reading the next):

  1. Which calls are critical for the user action to succeed?
  • Payment is usually critical for checkout.
  • Recommendations are often non-critical.
  1. How could you degrade gracefully if recommendations fail?
  • Return a default list from S3 or cached data from ElastiCache.
  • Or skip recommendations entirely but still return the main page.
  1. How could you improve reliability of the profile service?
  • Use DynamoDB with on-demand capacity or auto scaling to avoid throttling.
  • Use retries with exponential backoff and short timeouts.
  1. How could you handle payment gateway issues?
  • Use a queue (SQS) for payment processing and show a “processing payment” status.
  • Implement idempotent payment operations to avoid double charges.
  1. What would you change in the Lambda logic?
  • Separate critical and non-critical failures.
  • Implement circuit breakers and fallbacks for recommendations.

Write down (or say out loud) a short description of your improved flow. Compare it against the patterns from earlier steps: queues, timeouts, retries, and fallbacks.

Capacity Planning, Right-Sizing, and Feedback Loops

Capacity vs Right-Sizing

Capacity planning estimates what you need. Right-sizing continuously adjusts instance types, storage classes, and throughput as you observe real usage.

Right-Sizing Compute

Start with a reasonable instance, watch CloudWatch metrics, then adjust size and use Auto Scaling with target tracking instead of permanent over-provisioning.

Right-Sizing Storage & DB

Use S3 lifecycle rules to move cold data, tune EBS volume types and sizes, and choose DynamoDB on-demand or provisioned with auto scaling.

Feedback Loops

Create feedback loops with CloudWatch dashboards and alarms, plus tracing tools, then feed insights into design changes like caching or queues.

Exam Alignment

Look for options that add monitoring and iterative tuning, rather than fixed capacity guesses. These support both reliability and performance efficiency.

Quiz 1: Reliability Patterns

Check your understanding of reliability concepts.

An application runs on EC2 instances behind an Application Load Balancer in a single Availability Zone. The database is on Amazon RDS with Multi-AZ enabled. Users occasionally experience downtime when the AZ has issues. What change MOST directly improves the application's reliability?

  1. Enable cross-zone load balancing on the ALB.
  2. Move the EC2 instances into an Auto Scaling group spanning multiple Availability Zones.
  3. Increase the size of the EC2 instances to handle more load.
  4. Add an ElastiCache cluster for frequently accessed data.
Show Answer

Answer: B) Move the EC2 instances into an Auto Scaling group spanning multiple Availability Zones.

The main reliability issue is that all EC2 instances are in a single AZ, so an AZ disruption causes downtime. Moving the EC2 instances into an Auto Scaling group across multiple AZs removes that single point of failure. Cross-zone load balancing helps distribution but does not fix the single-AZ placement. Larger instances and caching can improve performance, not AZ-level reliability.

Quiz 2: Performance Efficiency Choices

Apply performance efficiency to a design decision.

A video processing application runs on a fleet of EC2 instances that are idle most of the day but process thousands of short jobs during unpredictable spikes. The operations team struggles with over-provisioning and underutilization. Which change BEST aligns with the performance efficiency pillar?

  1. Replace EC2 instances with AWS Lambda functions triggered by S3 events when new videos are uploaded.
  2. Increase EC2 instance sizes so they can process spikes faster.
  3. Schedule EC2 instances to shut down at night using a cron job.
  4. Move the application to larger EBS volumes to improve disk throughput.
Show Answer

Answer: A) Replace EC2 instances with AWS Lambda functions triggered by S3 events when new videos are uploaded.

Lambda is a serverless, event-driven compute option that scales automatically with demand and avoids paying for idle capacity. This directly supports the performance efficiency pillar by efficiently using compute resources as demand changes. Larger instances and volumes may improve performance but worsen utilization. Scheduling shutdowns helps cost but not responsiveness to unpredictable spikes.

Key Term Flashcards: Reliability and Performance Efficiency

Flip through these cards to reinforce core definitions and patterns.

Reliability pillar (canonical definition)
The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
Performance efficiency pillar (canonical definition)
The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
Multi-AZ vs Multi-Region (exam cue)
Multi-AZ: protect against AZ failures within one Region, common default for high availability. Multi-Region: protect against Region failures or serve global users; used for strict RPO/RTO or global latency needs.
Graceful degradation
A design approach where, if a component or dependency fails, the system continues operating with reduced functionality or quality instead of a full outage.
Auto Scaling + health checks (reliability role)
Health checks detect unhealthy instances; Auto Scaling groups replace them automatically and spread load across AZs, maintaining desired capacity without manual intervention.
Right-sizing
The ongoing process of adjusting resource types, sizes, and capacity (compute, storage, database) based on observed metrics to meet requirements efficiently.
Loose coupling
Architectural style where components interact via asynchronous or well-defined interfaces (queues, topics, APIs), reducing dependencies and improving fault tolerance.
Caching for performance efficiency
Using services like CloudFront, ElastiCache, or DynamoDB DAX to store frequently accessed data closer to users or applications, reducing latency and load on backends.

Key Terms

Caching
Storing frequently accessed data in a fast storage layer (such as ElastiCache or CloudFront edge locations) to reduce latency and backend load.
Multi-AZ
An architecture pattern where resources (such as EC2 instances or RDS DB instances) are deployed across multiple Availability Zones within a single Region to improve availability.
Health check
A mechanism used by load balancers and other services to determine whether a resource is functioning correctly and should receive traffic.
Multi-Region
An architecture pattern where workloads are deployed across multiple AWS Regions to improve resilience to Region-level failures or to serve global users with lower latency.
Right-sizing
The ongoing process of adjusting resource types, sizes, and capacity to match workload requirements based on observed metrics, improving both performance and efficiency.
Loose coupling
An architectural style where system components depend on each other as little as possible, typically via asynchronous messaging or well-defined APIs, improving reliability and scalability.
Reliability pillar
The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
Graceful degradation
A design approach where a system continues to operate with reduced functionality or quality when some components fail, instead of experiencing a complete outage.
Auto Scaling group (ASG)
A service that automatically adjusts the number of EC2 instances (or other scalable targets via Application Auto Scaling) to maintain desired capacity and meet demand.
Performance efficiency pillar
The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself