Chapter 25 of 26
Integrating Sustainability and Operational Excellence into AWS Architectures
While not separate exam domains, sustainability and operational excellence are increasingly relevant in real-world designs. This module shows how to weave these considerations into architectures without losing focus on core exam objectives.
Big Picture: Why Sustainability and Operational Excellence Matter
Why This Module
Sustainability and operational excellence now shape how modern AWS workloads are designed and run. They appear in exam scenarios even when not named explicitly.
Link to Other Pillars
You have already seen Security, Reliability, Performance Efficiency, and Cost Optimization. Sustainability was added as the sixth pillar in 2021, and operational excellence ties them together.
Focus of This Module
We will: 1) define the Sustainability pillar, 2) spot patterns that reduce environmental impact, and 3) use automation and observability to support excellence.
Exam Angle
On the exam, the best answer often meets requirements while minimizing idle resources, using managed/serverless services, and automating operations to reduce waste.
The Sustainability Pillar: Canonical Definition and Scope
Canonical Definition
AWS: "The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value."
What This Really Means
You aim to deliver the same or more business value with fewer resources and less energy: avoid idle capacity, avoid unnecessary work, and reduce data bloat.
Utilization and Waste
High utilization without harming performance or reliability is good. Idle EC2 instances, over-sized databases, or unused storage are both costly and less sustainable.
Patterns to Watch For
Right-sizing, serverless and managed services, auto-scaling, and data lifecycle policies are all sustainability-friendly patterns that also often save money.
Operational Excellence: What It Is and Why It Supports All Pillars
What Is Operational Excellence
Operational excellence is about how you organize, run, and evolve workloads: clear processes, automation, observability, and continuous improvement.
Key Practices
Use runbooks, automate repetitive tasks, define infrastructure as code, and instrument workloads with logs, metrics, and traces.
Supports All Pillars
Automation and observability improve security, reliability, performance efficiency, cost optimization, and sustainability by reducing errors and waste.
Exam Signals
Prefer answers that use infrastructure as code, built-in monitoring and alarms, automated scaling and backups, and documented operational procedures.
Designing for Efficient Compute Utilization
Scenario Setup
Retail web app with variable traffic needs high availability, security, cost-effectiveness, and good performance. How do we design compute for this?
Option A: Fixed EC2 Fleet
Two large EC2 instances in an Auto Scaling group (min 2, max 4) running 24/7 with manual SSH deployments. Over-provisioned and error-prone.
Option B: Fargate + CloudFront
Use ECS on Fargate behind an ALB, auto scale tasks, serve static content via S3 + CloudFront, and deploy via CI/CD pipelines.
Why B Is Better
B scales with demand, improves utilization, simplifies patching and deployments, and uses caching to reduce repeated compute and network load.
Data and Storage: Lifecycle, Tiering, and Deletion
Why Data Management Matters
Unmanaged data grows endlessly, driving up storage, backup, and compute costs. Sustainable designs actively manage data through its lifecycle.
S3 Lifecycle Policies
Transition infrequently accessed objects to IA, Intelligent-Tiering, or Glacier, and delete data after defined retention periods automatically.
Right-Sizing Databases
Choose the smallest RDS/Aurora instances that meet needs, or use Aurora Serverless v2 for variable workloads to avoid idle capacity.
Retention, Compression, Partitioning
Set explicit log/backup retention, use AWS Backup lifecycles, and for analytics use compressed columnar formats and partitions to scan less data.
Automation and Observability as Enablers of Excellence
Role of Automation
Automation via IaC, CI/CD, and Systems Manager reduces manual work, errors, and makes it easy to roll out consistent changes across environments.
Automated Operations
Use Systems Manager Automation, Patch Manager, and Lambda triggered by events to handle patching, backups, and simple remediation automatically.
Observability Basics
CloudWatch metrics, logs, and alarms plus X-Ray traces and centralized log analysis give you visibility into health, performance, and usage.
Why It Matters
With observability, you can find idle resources, detect issues early, and measure optimization impact, directly supporting all pillars including sustainability.
Thought Exercise: Balancing Sustainability, Cost, and Performance
Work through this scenario mentally and pick a design direction. There is no single perfect answer, but some options align better with sustainability and operational excellence.
Scenario:
You are designing an internal reporting system for a medium-sized company. Requirements:
- Analysts run heavy SQL queries on sales data a few times per day.
- Data volume: about 2 TB, growing slowly.
- Queries are not latency-critical; a few minutes is acceptable.
- The team wants to minimize ongoing operational effort.
You are comparing three approaches:
- Always-on RDS for PostgreSQL on a large instance, with data loaded from operational databases every hour.
- Amazon Redshift cluster sized for peak load, running 24/7, with nightly batch loads.
- S3-based data lake with data stored as compressed Parquet in S3, queried via Amazon Athena, with Glue jobs to transform data.
Questions to reflect on:
- Which option best minimizes idle capacity while still meeting requirements?
- Which option reduces the need for manual operations (patching, scaling, backups)?
- How would you add observability to ensure queries remain within acceptable performance?
Pause for a moment and decide which option you would recommend and why. Then compare your reasoning to the guidance on the next slide or when you continue the module.
Worked Scenario: Choosing a Sustainable Analytics Architecture
Reviewing Option 1: RDS
RDS is simple but runs 24/7, even when idle. Scaling and maintenance are manual; you manage more operations and pay for idle capacity.
Reviewing Option 2: Redshift
Redshift is powerful for analytics but also runs 24/7. You still manage cluster sizing and maintenance, leading to potential over-provisioning.
Reviewing Option 3: S3 + Athena + Glue
Serverless, pay-per-use querying and ETL on S3 data. No clusters to manage, and you can use compressed columnar formats to scan less data.
Why Option 3 Wins Here
Option 3 minimizes idle capacity, reduces operations overhead, and aligns with sustainability and cost goals while meeting relaxed latency requirements.
Quick Check: Sustainability Pillar and Resource Utilization
Test your understanding of the Sustainability pillar and how it relates to resource utilization.
Which design choice most clearly aligns with the Sustainability pillar for a batch processing workload that runs once per hour?
- Provision a large EC2 instance that runs 24/7 and triggers a cron job every hour.
- Use AWS Lambda with an EventBridge rule to trigger processing, ensuring functions run only when needed.
- Run the workload on a fixed-size Amazon EMR cluster that you manually start once and keep running.
- Deploy the workload on an over-provisioned RDS instance to ensure there is always spare capacity.
Show Answer
Answer: B) Use AWS Lambda with an EventBridge rule to trigger processing, ensuring functions run only when needed.
The Sustainability pillar focuses on minimizing environmental impacts by maximizing utilization and minimizing resources required. Using AWS Lambda with EventBridge means compute runs only when needed, with no idle capacity between runs. A 24/7 EC2 or always-on EMR cluster wastes resources between jobs, and over-provisioned RDS does not fit a batch processing pattern.
Quick Check: Operational Excellence and Observability
Test how well you can recognize operational excellence patterns in AWS designs.
A team wants to improve operational excellence for their microservices on AWS. Which combination of actions best supports this goal?
- Use manual deployments through the console and rely on instance-level logs only.
- Adopt CloudFormation for infrastructure, set up CloudWatch metrics and alarms, and automate deployments via CodePipeline.
- Increase EC2 instance sizes to reduce the chance of CPU throttling and avoid the need for monitoring.
- Disable detailed monitoring to reduce CloudWatch costs and focus only on application logs.
Show Answer
Answer: B) Adopt CloudFormation for infrastructure, set up CloudWatch metrics and alarms, and automate deployments via CodePipeline.
Operational excellence emphasizes automation, observability, and repeatable processes. Using CloudFormation, CloudWatch metrics and alarms, and automated deployments via CodePipeline directly supports this. Manual deployments, avoiding monitoring, or just over-provisioning compute do not align with operational excellence.
Key Term Review: Sustainability and Operational Excellence
Use these flashcards to reinforce core definitions and patterns.
- AWS Well-Architected Framework
- The AWS Well-Architected Framework provides a consistent set of best practices for customers and partners to evaluate architectures, and a set of questions you can use to evaluate how well an architecture is aligned to AWS best practices.
- Sustainability pillar (definition)
- The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value.
- Performance efficiency pillar (definition)
- The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
- Cost optimization pillar (definition)
- The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.
- Operational excellence (concept)
- An AWS Well-Architected lens focused on how you organize, run, and evolve workloads using clear processes, automation, observability, and continuous improvement to support all other pillars.
- Infrastructure as Code (IaC)
- The practice of defining and provisioning infrastructure using machine-readable templates or code (for example, AWS CloudFormation, AWS CDK), enabling repeatable, automated deployments.
- Serverless sustainability advantage
- Serverless services like AWS Lambda, Fargate, and Athena typically run only when needed and scale automatically, reducing idle capacity and improving utilization.
- S3 Lifecycle policy
- Configuration on an S3 bucket or prefix that automatically transitions objects between storage classes or deletes them after specified periods, supporting cost and sustainability goals.
- Observability
- The ability to understand the internal state of a system through telemetry such as logs, metrics, and traces, enabling detection of issues and optimization opportunities.
- Right-sizing
- Adjusting resource types and sizes (for example, EC2 instance types, database classes) to match actual workload needs, avoiding both over-provisioning and under-provisioning.
Common Exam Patterns and Traps
Over-Provisioning Trap
Watch for answers that use very large, always-on instances "to be safe". Prefer auto scaling or serverless when they meet requirements.
Data Hoarding Trap
Keeping all logs or backups forever in S3 Standard is rarely best. Look for lifecycle policies and explicit retention periods.
Manual Ops Trap
Manual checks and ad-hoc changes are red flags. Prefer designs with CloudWatch alarms, Systems Manager, and infrastructure as code.
Telemetry Trap
Disabling monitoring to save a little money usually backfires. Sufficient observability is needed to optimize cost and sustainability long term.
Key Terms
- Serverless
- A cloud execution model where the cloud provider automatically manages infrastructure provisioning, scaling, and maintenance, and you are billed based on actual usage rather than pre-allocated capacity.
- Right-sizing
- Adjusting resource configurations (such as EC2 instance types, database classes, and storage tiers) to closely match actual workload requirements, reducing both over-provisioning and under-provisioning.
- Observability
- The ability to understand the internal state of a system based on the data it produces, such as logs, metrics, and traces, enabling effective monitoring, troubleshooting, and optimization.
- S3 Lifecycle policy
- A configuration on an Amazon S3 bucket or prefix that automatically transitions objects between storage classes or expires (deletes) them after specified periods, supporting cost and sustainability goals.
- Sustainability pillar
- The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value.
- Operational excellence
- An AWS Well-Architected lens focused on how you organize, run, and evolve workloads using clear processes, automation, observability, and continuous improvement to support all other pillars.
- Cost optimization pillar
- The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.
- Infrastructure as Code (IaC)
- The practice of managing and provisioning infrastructure using machine-readable definition files or code, such as AWS CloudFormation or AWS CDK, enabling consistent, automated deployments.
- Performance efficiency pillar
- The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
- AWS Well-Architected Framework
- The AWS Well-Architected Framework provides a consistent set of best practices for customers and partners to evaluate architectures, and a set of questions you can use to evaluate how well an architecture is aligned to AWS best practices.