SkarpSkarp

Chapter 13 of 26

Resilient Data Architectures: Amazon S3, Amazon RDS, and Backup Strategies

Data is often the most critical asset; design S3 and RDS configurations that survive failures, protect against corruption, and meet strict availability targets.

27 min readen

Big Picture: Data Resilience on AWS

Why Data Resilience Matters

Compute can be recreated quickly; data often cannot. This module focuses on keeping your data safe and available using Amazon S3, Amazon RDS, and solid backup and DR strategies.

Link to Well-Architected

We lean heavily on the Reliability and Security pillars. Reliability is about workloads working correctly and consistently; Security is about protecting data, systems, and assets.

RPO and RTO

Every feature you learn maps to Recovery Point Objective (how much data you can lose) and Recovery Time Objective (how long you can be down). Knowing these helps you choose the right design.

Exam Mindset

For each feature, ask: What failure does this protect against? What does it not protect against? Exam wrong answers often rely on confusing these boundaries.

Amazon S3 Durability, Availability, and Storage Classes

Durability vs Availability

Durability is about not losing bits over time (S3 gives 11 9s). Availability is about being able to reach data right now (S3 Standard gives 99.99% availability).

Core S3 Classes

S3 Standard is multi-AZ, 11 9s durability, 99.99% availability. Standard-IA is similar but for infrequent access. One Zone-IA is cheaper but single-AZ.

Intelligent-Tiering and Glacier

S3 Intelligent-Tiering moves objects between frequent and infrequent tiers automatically. Glacier classes are for archive, trading fast access for very low storage cost.

Common Exam Traps

Cheaper S3 classes mostly impact availability, retrieval time, or AZ resilience, not durability. One Zone-IA is not multi-AZ and is a poor choice for mission-critical data.

Designing Resilient S3: Versioning, MFA Delete, and Protection Features

Durability vs Human Error

S3 protects against hardware failure, not bad writes or deletes. If your app overwrites data with garbage, S3 happily stores the garbage unless you enable extra protection.

Versioning and MFA Delete

Versioning keeps old versions so you can recover from deletes or overwrites. MFA Delete adds a second factor for destructive actions like deleting versions or suspending versioning.

S3 Object Lock

Object Lock gives WORM protection. Governance or compliance mode can prevent deleting or overwriting versions until a retention period ends, useful for strict regulations.

S3 Replication for DR

S3 Replication (same or cross-region) copies objects and versions to another bucket. For Region-level disasters, Cross-Region Replication is a key building block.

Example: S3 Design for a Critical Log Archive

Scenario Setup

A fintech startup must store 7 years of audit logs, protect them from tampering, and survive a Region outage. How should they design S3?

Primary Bucket Design

Use S3 Standard in eu-west-1 with versioning and Object Lock compliance mode. Add lifecycle rules to move older data to Glacier Deep Archive to save cost.

Cross-Region Replication

Enable CRR to a bucket in eu-central-1, replicating all versions with Object Lock enabled there too. This provides Region-level disaster resilience.

Resilience Outcomes

RPO is near-zero, RTO is low using the replica, and versioning + Object Lock protect against deletion and tampering. This pattern is a classic exam answer.

Amazon RDS High Availability: Multi-AZ vs Read Replicas

Two Different Tools

RDS Multi-AZ and read replicas are not interchangeable. Multi-AZ is for high availability; read replicas are for read scaling and sometimes DR.

RDS Multi-AZ

Multi-AZ keeps a synchronous standby in another AZ. It is not used for reads. If the primary fails, RDS promotes the standby and updates the endpoint.

RDS Read Replicas

Read replicas use asynchronous replication. They offload read traffic and can be promoted for DR, but failover is not automatic in the same way as Multi-AZ.

Exam Mapping

Think: Multi-AZ = HA in one Region; Read replicas = read scaling + cross-Region DR. Multi-AZ is not for read scaling; read replicas are not automatic HA.

Example: RDS HA and Scaling Design for an E-commerce App

Scenario Overview

An e-commerce app in us-east-1 needs no downtime during business hours, must handle read spikes, and wants a plan for a long us-east-1 outage.

HA with Multi-AZ

Use RDS Multi-AZ for the primary instance. The app uses one endpoint; RDS promotes the standby on AZ failure, giving low RTO within the Region.

Read Scaling with Replicas

Add two read replicas in us-east-1. Route reporting and analytics queries to them to offload reads from the primary database.

Cross-Region DR

Create a cross-Region read replica in us-west-2. In a regional disaster, promote it and repoint the app. Expect some data loss due to async replication.

Backups, Snapshots, and Point-in-Time Recovery for RDS

Why Backups Still Matter

Multi-AZ and replicas help when infrastructure fails, but they happily replicate bad data. Backups are your safety net for corruption and mistakes.

Automated Backups

Set a retention period to enable automated backups. RDS keeps daily snapshots and logs so you can restore to any point within the window.

Manual Snapshots

You trigger manual snapshots, often before big changes. They persist until deleted and can be copied to other Regions for DR.

Point-in-Time Recovery

PITR lets you create a new instance at an exact time using automated backups. It does not roll back the existing DB; it creates a fresh one.

Disaster Recovery Patterns on AWS

Four Named DR Patterns

Know these by name: backup and restore, pilot light, warm standby, and multi-site active-active. The exam expects you to match scenarios to them.

Backup and Restore

Cheapest. Only backups exist in DR Region. In a disaster, you restore and rebuild, leading to high RPO and high RTO.

Pilot Light and Warm Standby

Pilot light keeps only core pieces running; warm standby runs a full but smaller copy. Both give faster recovery than pure backup and restore.

Multi-Site Active-Active

Multiple Regions serve traffic all the time. This has near-zero RPO and RTO but is the most expensive and complex to operate.

Thought Exercise: Matching Requirements to DR Patterns

Work through these short scenarios and decide which DR pattern and key AWS features you would use. There are no single "correct" answers here, but compare your reasoning to the hints.

  1. Small SaaS startup
  • Budget is tight.
  • They can tolerate up to 24 hours of downtime in a regional disaster.
  • Data loss of up to 12 hours is acceptable.
  • What DR pattern? Which S3/RDS features?
  • Hint after you think: This lines up with backup and restore. Think RDS snapshots copied cross-Region, S3 cross-Region replication or periodic backup jobs.
  1. News website
  • Must remain online during breaking news.
  • Global audience; latency matters.
  • Data (articles, comments) should not be lost.
  • What DR pattern? How would S3 and RDS be used?
  • Hint: Likely multi-site active-active. S3 with cross-Region replication, RDS with cross-Region read replicas or a global database solution, plus Route 53 for global routing.
  1. Back-office reporting system
  • Used only during business hours.
  • Outage of a few hours is acceptable, but they want quick, predictable recovery.
  • Data loss should be under 15 minutes.
  • What DR pattern? Which RDS settings?
  • Hint: This sounds like warm standby or pilot light with frequent replication, RDS cross-Region read replica promoted in disaster, and S3 cross-Region replication for input files.

Pause and actually decide your answers before reading the hints. Being able to justify why you chose a pattern is key for exam scenario questions.

Quiz 1: S3 and RDS Resilience Basics

Check your understanding of core distinctions.

Which combination best addresses a requirement to protect an RDS database from AZ failure AND from accidental data deletion with the smallest operational effort?

  1. Enable Multi-AZ on RDS and rely on that alone
  2. Enable Multi-AZ on RDS and automated backups with sufficient retention
  3. Create multiple read replicas in the same AZ and disable automated backups
  4. Take manual snapshots weekly and store application exports in S3 Standard-IA
Show Answer

Answer: B) Enable Multi-AZ on RDS and automated backups with sufficient retention

Multi-AZ protects against AZ/instance failure but not against logical deletion or corruption, which is replicated to the standby. Automated backups with retention enable point-in-time recovery to before the deletion. Read replicas and manual weekly snapshots alone leave large RPO gaps.

Quiz 2: DR Patterns and S3 Features

Another quick check, now on DR patterns and S3.

A company needs to ensure audit logs cannot be modified or deleted for 5 years and wants to survive a Region outage. Which is the MOST appropriate design?

  1. Store logs in S3 One Zone-IA with bucket versioning and daily lifecycle transitions to S3 Glacier Instant Retrieval
  2. Store logs in S3 Standard with Object Lock in compliance mode and Cross-Region Replication to another Region with Object Lock enabled
  3. Store logs in EBS volumes attached to EC2 instances and take daily EBS snapshots to another Region
  4. Store logs in S3 Glacier Deep Archive only, without versioning, and rely on S3 durability
Show Answer

Answer: B) Store logs in S3 Standard with Object Lock in compliance mode and Cross-Region Replication to another Region with Object Lock enabled

Object Lock in compliance mode enforces WORM retention. Using S3 Standard plus Cross-Region Replication to another Region with Object Lock provides both tamper resistance and regional resilience. One Zone-IA is single-AZ; EBS is not ideal for long-term log archive; Glacier-only without versioning or Object Lock does not protect against early deletion.

Key Term Review: S3, RDS, and DR

Flip these cards mentally and ensure you can recall each definition or distinction without looking.

Durability (in the context of S3)
The probability that data is not lost over time. S3 Standard provides 99.999999999% (11 9s) durability by redundantly storing objects across multiple devices in multiple AZs.
Availability (in the context of S3)
The percentage of time that data is accessible on demand. S3 Standard provides 99.99% availability in a given year.
RDS Multi-AZ deployment
An RDS configuration where AWS maintains a synchronous standby in another AZ for high availability and automatic failover. It is not used for read scaling.
RDS read replica
An asynchronously replicated copy of an RDS database used for read scaling and as a building block for disaster recovery. Failover to it is not automatic in the same way as Multi-AZ.
Point-in-time recovery (RDS)
The ability to restore a new RDS instance to an exact time within the automated backup retention window using snapshots and transaction logs.
Backup and restore DR pattern
A low-cost DR strategy where only backups are stored in the DR Region. Infrastructure is recreated and data restored after a disaster, leading to high RPO and RTO.
Pilot light DR pattern
A DR strategy where a minimal, core set of services is always running in the DR Region. In a disaster you scale up and deploy the rest, giving moderate RPO and RTO.
Warm standby DR pattern
A DR strategy where a scaled-down but fully functional copy of the production environment runs in the DR Region, ready to be scaled up during a disaster.
Multi-site (active-active) DR pattern
A DR strategy where multiple Regions run full production workloads simultaneously, sharing traffic. It offers near-zero RPO and RTO but at higher cost and complexity.
S3 Object Lock
A feature that enforces write-once-read-many (WORM) protection on S3 object versions using governance or compliance mode, preventing deletion or overwrite until retention expires.

Key Terms

RPO
Recovery Point Objective, the maximum acceptable age of data recovered from backup storage after an outage.
RTO
Recovery Time Objective, the targeted duration of time within which a system must be restored after an outage.
Amazon S3
An object storage service that offers high durability, scalability, and a range of storage classes for different access patterns and cost profiles.
Amazon RDS
Amazon Relational Database Service, a managed service for relational databases such as MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server.
Durability
The probability that data remains intact and is not lost over time, even in the face of hardware failures.
Pilot light
A DR pattern where a minimal version of the environment is always running in the DR Region, to be scaled up during a disaster.
Availability
The proportion of time a system or service is operational and able to serve requests.
RDS Multi-AZ
An RDS deployment option that maintains a synchronous standby in a different Availability Zone for high availability and automatic failover.
Warm standby
A DR pattern where a fully functional but lower-capacity copy of the production environment runs in the DR Region, ready to be scaled up quickly.
S3 versioning
An S3 feature that keeps multiple variants of an object in the same bucket, enabling recovery from accidental deletes or overwrites.
S3 Object Lock
An S3 feature that enforces write-once-read-many semantics by preventing deletion or overwrite of object versions during a retention period.
RDS read replica
An asynchronously replicated RDS instance used primarily for read scaling and as a disaster recovery option when promoted.
Backup and restore
A DR strategy where data is backed up and later restored to new infrastructure after a disaster, typically with high RPO and RTO.
Manual snapshot (RDS)
A user-initiated backup of an RDS instance that persists until explicitly deleted and can be copied across Regions.
Disaster recovery (DR)
The set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems after a disaster.
Automated backups (RDS)
RDS-managed daily snapshots and transaction logs that enable point-in-time recovery within a configured retention window.
Multi-site active-active
A DR pattern where multiple Regions run full production workloads at the same time, sharing traffic for very low RPO and RTO.
Point-in-time recovery (PITR)
Restoring a database to an exact time within a backup retention window, using base backups and transaction logs.
Recovery Time Objective (RTO)
The maximum acceptable duration of time that a system can be unavailable after a failure or disaster.
Cross-Region Replication (CRR)
An S3 feature that automatically copies objects from a bucket in one AWS Region to a bucket in another Region.
Recovery Point Objective (RPO)
The maximum acceptable amount of data loss measured in time (for example, 15 minutes of transactions).

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself