Chapter 12 of 26
Resilient Compute Architectures with Amazon EC2 and AWS Auto Scaling
Turn raw compute into resilient fleets by combining EC2, Auto Scaling, and load balancing to survive failures and traffic spikes automatically.
From Single EC2 Instance to Resilient Fleets
Why Single EC2 Is Fragile
A single EC2 instance is a single point of failure. If it crashes, is patched badly, or its AZ has an outage, your app goes down and your SLA is at risk.
Reliability Pillar Link
AWS Well-Architected’s Reliability pillar focuses on workloads performing correctly and consistently. For compute, that means fleets, not pets.
Core Resilient Pattern
Standard resilient compute stack: EC2 instances across multiple AZs, managed by an Auto Scaling group, behind an Elastic Load Balancer, with health checks.
Think in RTO/RPO
As you learn each feature, ask: how does this reduce downtime (RTO) or data loss (RPO) when an instance or AZ fails, or when traffic suddenly spikes?
Amazon EC2 Building Blocks for Resilience
EC2 Types and Families
Instance families: t (burstable), m (general), c (compute), r (memory), p (GPU). Resilient fleets may mix types or use Spot plus On-Demand.
Region, AZ, Subnet
Regions are geographic; AZs are isolated locations within a Region; each subnet lives in exactly one AZ. Subnet choice fixes the AZ of an instance.
Multi-AZ for HA
Exam hint: “highly available in a single Region” almost always implies Multi-AZ EC2 plus a load balancer, not just bigger instances in one AZ.
Stateless App Servers
Keep app servers stateless: move sessions and data to RDS, DynamoDB, ElastiCache, or S3. Then Auto Scaling can safely kill and replace instances.
Multi-AZ EC2 Architectures: The Core Resilience Pattern
What Multi-AZ Means
Multi-AZ EC2 architectures spread instances across at least two Availability Zones in one Region, reducing risk from an AZ-level failure.
Typical Web Tier Layout
Public subnets with an ALB in multiple AZs; private subnets in the same AZs; an Auto Scaling group launching EC2 into those private subnets.
Multi-AZ vs Multi-Region
Multi-AZ handles AZ failures with low complexity; Multi-Region addresses Region failures but is more complex and used for stricter SLAs.
Exam Gotchas
An ALB in multiple AZs but targets in one AZ is not truly Multi-AZ. Ensure your ASG uses subnets in at least two AZs.
AWS Auto Scaling Groups: Concepts and Lifecycle
What an ASG Manages
An Auto Scaling group manages a fleet of EC2 instances: min, desired, and max counts, plus which subnets/AZs they run in.
Launch Templates
Launch templates define AMI, instance type, security groups, and user data. The ASG uses them to create new instances consistently.
Instance Lifecycle
ASG launches, instance boots and runs user data, registers with load balancer, passes health checks, or is replaced if unhealthy.
Health Checks for Self-Healing
Combine EC2 and ELB health checks so the ASG can detect both infrastructure and application failures and replace bad instances.
Scaling Policies: Keeping Capacity in Sync With Load
Why Scaling Policies Matter
Scaling policies let your ASG adjust instance count with demand. This protects performance and cost while keeping resilience.
Target Tracking
Set a target metric (like 50% CPU); Auto Scaling adjusts capacity to keep the metric near that value. It is usually the easiest choice.
Step and Scheduled Scaling
Step scaling reacts differently at various thresholds; scheduled scaling changes capacity at fixed times for predictable patterns.
Choosing the Right Policy
Predictable traffic → scheduled. Maintain a metric level → target tracking. Complex rules per threshold → step scaling.
Elastic Load Balancing: ALB vs NLB for Resilience
Role of ELB
Elastic Load Balancing spreads traffic across instances and stops sending requests to unhealthy ones, boosting availability.
ALB vs NLB
ALB is Layer 7 for HTTP/HTTPS with smart routing. NLB is Layer 4 for ultra-high performance, static IPs, and non-HTTP protocols.
Health Checks and Draining
Health checks detect bad targets. Deregistration delay lets connections drain before removing an instance from service.
Exam Clues
HTTP routing and WAF → ALB. Need static IPs or raw TCP/UDP → NLB. Always pair with Multi-AZ targets for resilience.
Design Walkthrough: Resilient Web Tier With EC2, ALB, and Auto Scaling
Scenario Requirements
Single-Region web app, must survive an AZ failure, handle spiky traffic, and avoid overprovisioning. Classic exam-style setup.
Network Layout
Create a VPC with two public subnets for the ALB and two private subnets for EC2, each pair spread across two different AZs.
ALB and Health Checks
Deploy an internet-facing ALB in both public subnets, with HTTP/HTTPS listeners and a `/health` path for target health checks.
ASG Configuration
Use a launch template, choose the two private subnets, set min=2, desired=2, max=6, attach the ALB target group, and enable ELB health checks.
Thought Exercise: Mapping RTO/RPO to EC2 Resilience Patterns
How to Use This Exercise
For each scenario, decide: single instance, Multi-AZ, or Multi-Region, and which EC2, ASG, and ELB features are essential.
Scenario A: Internal App
Internal finance app, business hours only, RTO 4h, RPO 24h. Think: how much Multi-AZ or Auto Scaling is really needed?
Scenario B: E-commerce
Public e-commerce with strict SLA, RTO 15 minutes for AZ failure. Likely needs robust Multi-AZ ALB+ASG and strong database HA.
Scenario C: Batch Jobs
Nightly batch processing with retry tolerance. Focus more on scalable fleets and job retries than strict uptime.
Resilience Patterns for Web, Application, and Batch Workloads
Web Frontend Pattern
Internet-facing ALB, Multi-AZ ASG of stateless EC2 instances, session data offloaded, scaling based on CPU or request count.
Application/API Tier
Internal ALB in private subnets, ASG across AZs, possibly NLB for non-HTTP protocols, often consumed by other services.
Batch Processing Pattern
Workers in an ASG consume from SQS or streams, scale based on queue depth, heavily use Spot with retries for resilience.
Mapping to Pillars
Reliability: Multi-AZ and self-healing. Performance efficiency: right metrics for scaling. Cost optimization: Spot and scaling in.
Quiz 1: Auto Scaling and ELB Basics
Check your understanding of Auto Scaling groups and load balancers in resilient architectures.
You are designing a highly available web application in a single AWS Region. Which combination best ensures that the application remains available if one Availability Zone fails and that capacity adjusts automatically to traffic spikes?
- A single EC2 instance with an Elastic IP and a larger instance size
- An Auto Scaling group spanning subnets in two AZs, behind an Application Load Balancer with health checks
- Two EC2 instances in the same subnet with a Network Load Balancer and no Auto Scaling
- An Auto Scaling group in one AZ with a Classic Load Balancer and manual scaling
Show Answer
Answer: B) An Auto Scaling group spanning subnets in two AZs, behind an Application Load Balancer with health checks
The correct answer is the Auto Scaling group spanning subnets in two AZs, behind an ALB with health checks. This provides Multi-AZ high availability, self-healing, and automatic scaling. A single instance, instances in one subnet/AZ, or manual scaling do not meet the resilience and elasticity requirements.
Quiz 2: Choosing Scaling Policies and Patterns
Test how well you can map requirements to scaling policies and workload patterns.
A marketing site has predictable traffic peaks every weekday from 09:00 to 11:00 and low traffic at night. The business wants to minimize cost while keeping performance acceptable. Which Auto Scaling configuration is the BEST fit?
- Target tracking scaling on CPU utilization only
- Step scaling policies based on ALB request count
- Scheduled scaling to increase desired capacity before 09:00 and decrease it after 11:00
- No Auto Scaling; use a single large instance sized for peak load
Show Answer
Answer: C) Scheduled scaling to increase desired capacity before 09:00 and decrease it after 11:00
Scheduled scaling is ideal when traffic patterns are predictable. You can increase capacity before the known peak window and reduce it afterwards to save cost. Target tracking and step scaling react to metrics but do not exploit the known schedule as efficiently. A single large instance wastes resources and is not resilient.
Key Term Review: EC2 Resilience and Scaling
Flip through these cards to reinforce key concepts and terminology for resilient EC2 architectures.
- Auto Scaling group (ASG)
- A service that manages a fleet of EC2 instances, maintaining a specified minimum, desired, and maximum capacity, and optionally scaling capacity automatically based on policies and health checks.
- Multi-AZ EC2 architecture
- An EC2 deployment pattern where instances are distributed across at least two Availability Zones in a Region to improve availability and fault tolerance.
- Application Load Balancer (ALB)
- A Layer 7 load balancer that distributes HTTP/HTTPS and gRPC traffic, supports advanced routing (host/path-based), and integrates with target groups and health checks.
- Target tracking scaling policy
- An Auto Scaling policy type where you define a target value for a metric (such as CPU utilization), and the ASG automatically adjusts capacity to keep the metric near that value.
- Scheduled scaling
- An Auto Scaling feature that changes the minimum, maximum, or desired capacity of an Auto Scaling group at specific times based on a schedule.
- Stateless application server
- An EC2-based application component that does not store user session or critical data locally, allowing instances to be freely terminated and replaced without data loss.
- Health check (ELB)
- A periodic test performed by a load balancer to determine whether a registered target is healthy and should receive traffic.
- Deregistration delay (connection draining)
- A load balancer setting that defines how long to keep existing connections open to a target after it is removed from service, allowing in-flight requests to complete.
- Spot Instance (in resilient fleets)
- A discounted EC2 capacity type that can be interrupted by AWS; often used in Auto Scaling groups for batch or fault-tolerant workloads to reduce cost.
- Network Load Balancer (NLB)
- A Layer 4 load balancer designed for extreme performance and low latency, supporting TCP, UDP, and TLS traffic with static IP addresses.
Key Terms
- Spot Instance
- A discounted EC2 capacity type that can be interrupted by AWS; often used in Auto Scaling groups for batch or fault-tolerant workloads to reduce cost.
- Scheduled scaling
- An Auto Scaling feature that changes the minimum, maximum, or desired capacity of an Auto Scaling group at specific times based on a schedule.
- Health check (ELB)
- A periodic test performed by a load balancer to determine whether a registered target is healthy and should receive traffic.
- Deregistration delay
- A load balancer setting that defines how long to keep existing connections open to a target after it is removed from service, allowing in-flight requests to complete.
- Auto Scaling group (ASG)
- A service that manages a fleet of EC2 instances, maintaining a specified minimum, desired, and maximum capacity, and optionally scaling capacity automatically based on policies and health checks.
- Multi-AZ EC2 architecture
- An EC2 deployment pattern where instances are distributed across at least two Availability Zones in a Region to improve availability and fault tolerance.
- Network Load Balancer (NLB)
- A Layer 4 load balancer designed for extreme performance and low latency, supporting TCP, UDP, and TLS traffic with static IP addresses.
- Stateless application server
- An EC2-based application component that does not store user session or critical data locally, allowing instances to be freely terminated and replaced without data loss.
- Target tracking scaling policy
- An Auto Scaling policy type where you define a target value for a metric (such as CPU utilization), and the ASG automatically adjusts capacity to keep the metric near that value.
- Application Load Balancer (ALB)
- A Layer 7 load balancer that distributes HTTP/HTTPS and gRPC traffic, supports advanced routing (host/path-based), and integrates with target groups and health checks.