Chapter 24 of 26
Sustainability and Operational Excellence Considerations in AWS Architectures
Beyond passing the exam, modern designs must consider sustainability and operations; see how these ideas intersect with cost, performance, and resilience in realistic scenarios.
Positioning Sustainability and Operational Excellence in AWS
Well-Architected Context
The AWS Well-Architected Framework has six pillars: Operational excellence, Security, Reliability, Performance efficiency, Cost optimization, Sustainability.
Canonical Sustainability Definition
You must know: "The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value."
Why It Matters for SAA
Exam scenarios rarely say "sustainability" directly. Instead, they describe idle resources, overprovisioning, or manual operations and expect you to choose options that improve cost, efficiency, and environmental impact together.
Pillars Working Together
Right-sizing, automation, and efficient performance usually help cost optimization, performance efficiency, operational excellence, and sustainability without sacrificing reliability.
Operational Excellence Pillar: Why Ops Quality Drives Sustainability
Operational Excellence Focus
Operational excellence is about running workloads effectively using automation, clear procedures, and continuous improvement, so changes are safe, fast, and repeatable.
IaC and Sustainability
With CloudFormation or CDK, you can spin environments up and down easily, standardize efficient defaults, and avoid long-lived, idle dev/test resources.
Observability for Right-Sizing
CloudWatch metrics and X-Ray traces reveal underused instances and inefficient code paths, enabling right-sizing that reduces cost and environmental impact.
Runbooks and Continuous Improvement
Runbooks plus regular reviews of metrics and cost data make cleanup and optimization routine, helping sustainability become an ongoing practice, not a one-off.
Linking Sustainability, Cost Optimization, and Performance Efficiency
Three Interlocking Pillars
Cost optimization, performance efficiency, and sustainability all push you toward using just enough resources, well utilized, to meet business requirements.
Maximizing Utilization
Higher utilization of fewer instances is cheaper and greener. Auto Scaling and right-sizing keep utilization healthy without hurting performance.
Minimizing Resources Required
Efficient code and managed services like Lambda, Fargate, and RDS reduce the raw infrastructure you need to run a workload.
Energy per Business Value
Think in terms of cost or energy per request or transaction. Designs that do more work per instance-hour improve both cost and sustainability.
Compute Design Choices: EC2, Auto Scaling, and Serverless
Baseline Scenario
A web app runs on four m5.large EC2 instances behind an ALB. CPU averages 10%, with spikes to 40% during a weekly sale. This is overprovisioned and wasteful.
Option 1: Right-Size + Auto Scaling
Switch to smaller, efficient instances (for example, t4g.medium) and use an Auto Scaling group with target tracking to keep CPU around 50–60%.
Option 2: ECS on Fargate
Containerize the app and run it on Fargate. You pay per vCPU-second and GB-second, and AWS manages the underlying servers for efficiency.
Option 3: Partial Serverless
Serve static content from S3 + CloudFront and move bursty background tasks to Lambda, eliminating always-on capacity for intermittent work.
Exam Cues
Phrases like "spiky traffic", "low CPU utilization", or "ops team patching servers" usually hint at Auto Scaling, managed services, or serverless as better, greener options.
Data and Storage: RDS, S3 Storage Classes, and Lifecycle Policies
RDS Scenario
An app uses RDS PostgreSQL with a db.m5.large and 2 TB gp3. Only the last 30 days are queried often; older data is rarely accessed but must be retained for 7 years.
Right-Sizing the Database
Use CloudWatch and Performance Insights to check utilization. Downsize or switch to Aurora Serverless v2 if the workload is variable and often underutilized.
Tiering Cold Data to S3
Move old, rarely accessed data from RDS to S3 (for example, Parquet) and query it with Athena, reducing database storage and compute load.
S3 Storage Classes and Lifecycle
Keep hot data in S3 Standard or Standard-IA, then use lifecycle rules to move older data to Glacier classes and eventually delete after retention ends.
Exam Cues for Storage
Phrases like "rarely accessed", "must retain", or "reduce DB costs" hint at S3 tiering, lifecycle policies, and right-sizing or serverless database options.
Maximizing Utilization and Minimizing Resources: Concrete Techniques
Right-Sizing and Scheduling
Use Compute Optimizer and CloudWatch to shrink overprovisioned instances, and schedule dev/test resources to stop outside working hours.
Auto Scaling as Default
Apply Auto Scaling to EC2, DynamoDB, ECS, and Aurora where possible, keeping capacity tightly aligned to actual demand.
Serverless and Event-Driven
Lambda, Fargate, EventBridge, and SQS let compute run only when needed. Prefer events over polling to cut constant background work.
Efficient Data Access
Use caching and targeted queries to avoid unnecessary reads and transfers. Efficient data access means fewer I/O operations and less energy.
Managed Service Consolidation
Shared managed services like RDS, DynamoDB, and S3 are highly utilized by AWS, usually more efficient than many small, underused EC2 instances.
Operational Practices That Support Sustainable, Cost-Aware Designs
Tagging and Cost Allocation
Tag resources by environment, app, and owner so you can find idle or orphaned resources and hold teams accountable for their usage.
Reviews and Game Days
Periodic metric and cost reviews, plus game days, reveal inefficiencies and validate that your automation and scaling behave as expected.
Automated Cleanup
Use Lambda or Step Functions to clean up temporary environments, demo stacks, and old snapshots after a set time, avoiding resource sprawl.
Standardized Configurations
Systems Manager, AMIs, and container images keep configurations consistent and efficient, preventing drift to less sustainable setups.
Business-Level Metrics
Metrics like cost per request or CPU-seconds per transaction help you see when new features increase resource usage and need optimization.
Thought Exercise: Spot the Sustainable, Well-Operated Option
Work through this scenario mentally and decide which design best aligns sustainability, cost, performance, and operational excellence.
Scenario: A startup runs a multi-tier web app. Traffic is low most of the time but spikes heavily when they run social media campaigns. They currently use:
- 3 t3.medium EC2 instances in a single Availability Zone for the web tier
- 1 m5.large RDS MySQL instance
- Static assets served directly from the web servers
They complain about unexpected costs and occasional downtime during spikes.
Consider four possible improvements:
- Option A: Upgrade all EC2 instances to m5.large and the database to m5.xlarge to handle spikes.
- Option B: Put the EC2 instances in an Auto Scaling group across two AZs, fronted by an ALB; add S3 + CloudFront for static assets; keep the same instance sizes.
- Option C: Migrate the app to AWS Lambda and API Gateway immediately, leaving the database and static assets unchanged.
- Option D: Keep the current architecture but purchase 1-year Standard Reserved Instances for all EC2 and RDS instances.
Your task:
- Decide which single option best balances sustainability, cost, performance, and ops quality today.
- Explain to yourself why each of the other options is weaker from a sustainability and operational excellence perspective.
When you are ready, compare your reasoning to the guidance in the next slide.
Quiz 1: Canonical Definition and Pillar Relationships
Test your recall of the sustainability pillar definition and how it relates to other pillars.
Which option best matches the canonical definition of the sustainability pillar and its relationship to other pillars?
- The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value. It often aligns with cost optimization and performance efficiency by encouraging right-sizing and reducing idle resources.
- The sustainability pillar focuses on ensuring workloads are always available in multiple Regions and AZs, regardless of cost, so that business value is delivered even during disasters.
- The sustainability pillar focuses on encrypting all data at rest and in transit to minimize the environmental impact of data breaches.
- The sustainability pillar focuses on maximizing performance by provisioning extra capacity so that resource utilization stays below 20% at all times.
Show Answer
Answer: A) The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value. It often aligns with cost optimization and performance efficiency by encouraging right-sizing and reducing idle resources.
Option 1 states the canonical definition verbatim and correctly notes that sustainability aligns with cost optimization and performance efficiency by promoting right-sizing, higher utilization, and reduced idle resources. The other options confuse sustainability with reliability, security, or overprovisioned performance.
Quiz 2: Scenario-Based Design Choice
Apply what you have learned to a short scenario.
An analytics workload runs batch jobs nightly on a fleet of EC2 instances that are manually started and stopped by an engineer. Jobs sometimes run late, and instances are occasionally left running all weekend. Which change best improves sustainability, cost optimization, and operational excellence without hurting performance?
- Increase the instance sizes so jobs finish faster, then rely on engineers to remember to stop them.
- Move the workload to an Auto Scaling group with scheduled scaling or use AWS Batch to manage the fleet and job queue.
- Purchase 3-year Standard Reserved Instances for the current EC2 fleet to reduce cost.
- Run the jobs on a single very large EC2 instance that is always on, to simplify operations.
Show Answer
Answer: B) Move the workload to an Auto Scaling group with scheduled scaling or use AWS Batch to manage the fleet and job queue.
Using an Auto Scaling group with schedules, or AWS Batch, automates instance lifecycle and scaling. This reduces idle time, improves utilization, lowers cost, and enhances operational excellence. Increasing sizes or buying RIs preserves manual, error-prone operations and often increases idle capacity; a single always-on large instance is simple but wasteful.
Key Term and Concept Review
Flip through these cards to reinforce core definitions and relationships among pillars.
- AWS Well-Architected Framework pillars (list all six in order)
- Operational excellence, Security, Reliability, Performance efficiency, Cost optimization, Sustainability
- Sustainability pillar (canonical definition)
- The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value.
- Cost optimization pillar (canonical definition)
- The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.
- Performance efficiency pillar (canonical definition)
- The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
- How does Auto Scaling support sustainability?
- Auto Scaling adjusts capacity to match demand, reducing idle resources and increasing utilization. This lowers cost and energy use while maintaining performance.
- Example of an operational practice that improves sustainability
- Using Infrastructure as Code and scheduled automation to spin up dev/test environments during work hours and tear them down afterward, avoiding long-lived idle resources.
- Exam cue: "CPU utilization is consistently below 10%" – what should you think?
- Right-sizing or moving to managed/serverless options. Low utilization suggests overprovisioning; reducing instance size or using Auto Scaling improves cost and sustainability.
- Why are serverless services often more sustainable?
- They run only when needed and share underlying infrastructure across many customers, allowing AWS to keep utilization high and reduce waste compared to many idle dedicated instances.
- How do S3 lifecycle policies relate to sustainability?
- They automatically transition data to colder, more efficient storage classes and eventually delete it when no longer needed, minimizing storage resources and energy use.
- What operational excellence practice helps prevent orphaned resources?
- Consistent tagging combined with automated cleanup scripts or workflows that identify and remove unused resources like unattached EBS volumes or stale snapshots.
Key Terms
- Serverless
- A cloud-native model where you build and run applications without managing servers. AWS automatically provisions, scales, and manages infrastructure (for example, AWS Lambda, Fargate).
- Auto Scaling
- A set of AWS capabilities (such as EC2 Auto Scaling and Application Auto Scaling) that automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
- Right-sizing
- The process of matching resource types and sizes (such as EC2 or RDS instances) to the actual usage and performance requirements, avoiding overprovisioning or underprovisioning.
- S3 Lifecycle policy
- Configuration rules on an S3 bucket that automatically transition objects between storage classes or expire (delete) them after a specified time.
- Sustainability pillar
- The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value.
- Cost optimization pillar
- The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.
- Infrastructure as Code (IaC)
- The practice of defining and provisioning infrastructure using machine-readable configuration files or code (for example, AWS CloudFormation, AWS CDK), enabling repeatable, automated deployments.
- Operational excellence pillar
- A Well-Architected pillar that focuses on running and monitoring systems to deliver business value, and continually improving processes and procedures through automation, observability, and feedback.
- Performance efficiency pillar
- The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
- AWS Well-Architected Framework
- The AWS Well-Architected Framework provides a consistent set of best practices for customers and partners to evaluate architectures, and a set of questions you can use to evaluate how well an architecture is aligned to AWS best practices.