SkarpSkarp

Chapter 20 of 26

Applying the Cost Optimization Pillar of the AWS Well-Architected Framework

Cost optimization isn’t just about cutting spend; it’s about aligning cost with value over the workload lifecycle. This module ties concrete AWS savings tactics back to the Cost optimization pillar of the AWS Well-Architected Framework.

27 min readen

Cost Optimization in the Well-Architected Framework

The Well-Architected Context

The AWS Well-Architected Framework gives a consistent set of best practices and questions to evaluate how well an architecture aligns to AWS best practices.

Cost Optimization Pillar: Canonical Definition

The cost optimization pillar is defined as: "The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs."

Three Embedded Ideas

Key ideas: 1) Continual process over the workload lifecycle, 2) Cost-aware systems that surface cost to teams, 3) Focus on business outcomes, not just the lowest bill.

Link to Prior Modules

You already saw concrete tactics: EC2 purchase options, right-sizing, Auto Scaling, RDS and caching, and network design. Here we connect those to Well-Architected language and exam-style trade-offs.

Workload Lifecycle and Continuous Optimization

Why Lifecycle Matters

Cost optimization is not a one-time event. The pillar explicitly covers the entire lifecycle: from prototype to production and eventual retirement of a workload.

1. Experiment / Prototype

Early stage: prioritize speed and learning. Use on-demand, managed services, simple architectures. Avoid heavy long-term commitments for workloads that might change or be discarded.

2. Pilot / Pre-Production

As patterns emerge, add tags, budgets, and monitoring. Use CloudWatch, Cost Explorer, and Compute Optimizer to start measuring utilization and identifying right-sizing options.

3. Production / Scale-Out

Once usage stabilizes, introduce Savings Plans, RIs, refined Auto Scaling, and storage class tuning. This is where big savings are typically realized safely.

4. Optimization and Retirement

Continuously remove waste: idle EBS volumes, unused Elastic IPs, idle load balancers, old snapshots. Exam scenarios often reward answers that automate this cleanup.

Monitoring, Visibility, and Cost-Aware Culture

Visibility Enables Optimization

You cannot optimize what you cannot see. Cost optimization assumes you have visibility into usage and ownership of resources across your AWS accounts.

Cost Allocation Tags

Use tags like `Environment`, `Owner`, `Application`. Activate them as cost allocation tags so Cost Explorer can group and report spend by these dimensions.

Budgets vs Cost Explorer

AWS Budgets: define thresholds and send alerts when cost/usage exceeds them. AWS Cost Explorer: explore and visualize historical cost and usage trends.

Anomaly Detection and Optimizer

Cost Anomaly Detection finds unusual spend spikes. AWS Compute Optimizer uses metrics to recommend right-sizing EC2, EBS, Lambda, and some other resources.

Cost-Aware Culture

Tagging, dashboards, and alerts create a cost-aware culture, where teams see the impact of their designs and can iterate toward better cost-performance trade-offs.

Right-Sizing and Auto Scaling: A Concrete Scenario

Scenario Setup

You run a web app on 4 `m5.large` instances behind an ALB. Auto Scaling desired capacity is fixed at 4. CPU is ~12% by day and ~3% at night. Costs are higher than expected.

Step 1: Measure

Use CloudWatch to confirm low utilization. Use AWS Compute Optimizer; it recommends smaller `t3.medium` instances and scale-in at night.

Step 2: Apply Pillars

Cost optimization: match capacity to demand. Performance efficiency: ensure new instance type still meets needs. Reliability: keep at least 2 instances across AZs.

Step 3: Implement

Update the launch template to `t3.medium`. Configure an Auto Scaling group with min=2, max=6, and dynamic or scheduled scaling policies to reduce capacity overnight.

Exam Pattern

When you see "low utilization and fixed capacity," think right-sizing plus Auto Scaling, not just buying long-term commitments for over-sized instances.

Design Trade-Offs: Cost vs Performance, Reliability, and Security

Pillars in Tension

Cost optimization must be balanced with performance, reliability, and security. The goal is not "cheapest" but "best value" for stated requirements.

Performance Efficiency and Reliability

Performance efficiency focuses on using computing resources efficiently as demand and tech evolve. Reliability is about the workload performing correctly and consistently when expected.

Trade-Off: Single-AZ vs Multi-AZ

Single-AZ RDS is cheaper. But when high availability and low RTO are requirements, Multi-AZ is the correct choice even at higher cost.

Trade-Off: Caching and Storage Classes

You might add ElastiCache to reduce DB size or load, or move rarely accessed S3 data to cheaper classes like Glacier. Each change trades cost against latency and complexity.

Security is Non-Negotiable

Under the shared responsibility model, you must configure security in the cloud correctly, even if it costs more. Do not remove encryption or logging just to save money in exam scenarios.

Sustainability and Cost: Aligning Environmental and Financial Efficiency

Sustainability Pillar: Definition

The sustainability pillar "focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value."

Overlap with Cost Optimization

High utilization and minimal resource waste usually mean lower cost and lower environmental impact. Underutilized instances waste both money and energy.

Examples that Help Both

Serverless (Lambda, Fargate), right-sizing, Auto Scaling, and S3 lifecycle policies reduce idle capacity and unnecessary storage, saving cost and energy.

Lifecycle Perspective

Refactoring to more efficient architectures may cost effort now but often pays off over the workload lifecycle in both sustainability and cost terms.

Exam Signal

Options that remove idle resources, reduce data transfer, or move to efficient managed services usually support both cost optimization and sustainability.

Thought Exercise: Picking the Right Cost Strategy by Stage

Work through these short scenarios mentally. The goal is to map lifecycle stage to the most appropriate cost strategy, not just the cheapest.

Scenario A: New Analytics Prototype

  • A data science team is experimenting with a new recommendation model. They are unsure if it will go to production. They need to run irregular, compute-heavy jobs on EC2 and use a temporary RDS database.

Questions to consider:

  1. Would you recommend 3-year Reserved Instances, 1-year Savings Plans, or On-Demand for the EC2 jobs? Why?
  2. How aggressively would you right-size RDS now?

Pause and answer before reading the guidance.

Guidance:

  1. On-Demand is usually best for highly uncertain, short-lived prototypes. Committing to 3-year RIs conflicts with the lifecycle principle.
  2. Basic right-sizing is fine, but do not over-invest in optimization. Simplicity and speed of change matter more at this stage.

Scenario B: Stable Production API

  • A customer-facing API has run for 18 months. Traffic patterns are very predictable: weekday peaks, quiet weekends. It uses EC2 Auto Scaling and RDS Multi-AZ.

Questions to consider:

  1. What cost mechanisms make sense now (RIs, Savings Plans, spot, right-sizing)?
  2. How might you use monitoring data to refine the design?

Guidance:

  1. Now that usage is stable, long-term Savings Plans or RIs for baseline capacity make sense. You can also consider spot for non-critical background jobs.
  2. Use CloudWatch and Cost Explorer to fine-tune instance sizes, Auto Scaling thresholds, and storage classes, aligning capacity closely with observed demand.

Quick Check: Lifecycle and Commitments

Test your understanding of how lifecycle stage affects cost decisions.

A startup is launching a brand-new mobile app on AWS. They expect traffic to change rapidly as they iterate on features and do not yet know typical usage patterns. Which cost strategy best aligns with the cost optimization pillar definition?

  1. Purchase 3-year All Upfront Reserved Instances for all EC2 capacity to minimize hourly rates.
  2. Use On-Demand instances initially, monitor usage with CloudWatch and Cost Explorer, and consider Savings Plans once patterns stabilize.
  3. Use only Spot Instances for all workloads, including the production API, to minimize total spend.
  4. Immediately migrate all compute to AWS Lambda and Fargate, regardless of architecture fit, because serverless is always cheaper.
Show Answer

Answer: B) Use On-Demand instances initially, monitor usage with CloudWatch and Cost Explorer, and consider Savings Plans once patterns stabilize.

The cost optimization pillar emphasizes a continual process over the workload lifecycle. For a brand-new, uncertain workload, it is better to use On-Demand, gather data, and then commit (for example, Savings Plans) once usage stabilizes. Long-term RIs on day one (A) are risky, Spot for all production workloads (C) can violate reliability, and serverless is not always the right fit (D).

Quick Check: Tools for Cost Awareness

Identify the right AWS tool for a given cost optimization need.

Your finance team wants to receive an email if monthly AWS spending for the "Marketing" tagged resources exceeds $10,000. Which AWS feature is the BEST fit?

  1. Create a report in AWS Cost Explorer and download it monthly.
  2. Configure an AWS Budget with a cost filter on the "Marketing" cost allocation tag and set an email alert at $10,000.
  3. Use AWS Cost Anomaly Detection to create a detector for all services.
  4. Enable AWS Compute Optimizer for the account and review right-sizing recommendations.
Show Answer

Answer: B) Configure an AWS Budget with a cost filter on the "Marketing" cost allocation tag and set an email alert at $10,000.

AWS Budgets is designed for setting cost or usage thresholds and sending alerts when they are exceeded. You can filter by cost allocation tag, like "Marketing". Cost Explorer is for analysis, not alerts (A). Cost Anomaly Detection looks for unusual patterns, not fixed thresholds (C). Compute Optimizer is for right-sizing recommendations, not budget alerts (D).

Key Term Review: Cost and Related Pillars

Use these flashcards to reinforce the canonical pillar definitions and a few core concepts.

AWS Well-Architected Framework
The AWS Well-Architected Framework provides a consistent set of best practices for customers and partners to evaluate architectures, and a set of questions you can use to evaluate how well an architecture is aligned to AWS best practices.
Cost optimization pillar (definition)
The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.
Performance efficiency pillar (definition)
The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
Reliability pillar (definition)
The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
Sustainability pillar (definition)
The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization and minimizing the resources required, and by reducing the energy required to deliver business value.
Shared responsibility model (definition)
The AWS shared responsibility model describes how AWS is responsible for security of the cloud, while customers are responsible for security in the cloud, including the configuration of their services and data.
AWS Budgets vs AWS Cost Explorer
AWS Budgets: define cost/usage thresholds and send alerts when exceeded. AWS Cost Explorer: analyze and visualize historical cost and usage; no threshold-based alerts.
AWS Compute Optimizer
A service that analyzes usage metrics (for example, from CloudWatch) to recommend right-sizing for EC2, EBS, Lambda, and some other resources, helping to reduce cost while maintaining performance.
Cost allocation tags
Tags that you activate in the billing console so that AWS can use them to organize and report cost data by tag key and value (for example, Environment, Owner, Application).

Mini Case Study: Balancing Cost, Reliability, and Sustainability

Consider this mini case and choose an approach you would defend in an exam scenario.

Case:

A company runs an internal reporting application. Requirements:

  • Must be available during business hours in one region; occasional short outages are acceptable.
  • Reports are generated in batch overnight and interactively during the day.
  • Data is stored in Amazon S3 and queried using Amazon Athena.
  • There is a small API layer currently running on always-on EC2 instances in one AZ.
  • Management wants to reduce cost and environmental impact without hurting the user experience.

Think through these questions:

  1. Would you move the API layer to AWS Lambda behind Amazon API Gateway, or just right-size the EC2 instances and add Auto Scaling?
  2. How could you further reduce cost and environmental impact in S3 and Athena?

Reflect, then compare to the guidance below.

Possible answer path:

  1. Moving the low-traffic API to Lambda + API Gateway can eliminate idle EC2 time, aligning with both cost optimization and sustainability (scale to zero when idle). If latency and cold starts are acceptable for this internal tool, this is a strong choice.
  2. For S3, enable lifecycle rules (for example, to Intelligent-Tiering or infrequent access classes) for older reports, and compress data to reduce scanned bytes in Athena. Partition data by date or department so Athena queries scan less data, lowering cost and energy used per query.

On the exam, the best answer would clearly state how the chosen design meets availability, cost, and sustainability requirements together, not just one of them.

Key Terms

AWS Budgets
Service to set custom cost and usage budgets and receive alerts when thresholds are exceeded.
Auto Scaling
AWS capability that automatically adjusts compute capacity (for example, EC2 instances) based on demand according to defined policies.
Right-sizing
Adjusting resource types and sizes to better match actual usage, reducing waste while maintaining performance.
AWS Cost Explorer
Tool to visualize and analyze historical AWS cost and usage data.
Reliability pillar
Encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to, including throughout its lifecycle.
Cost allocation tags
Tags activated in billing so that AWS can use them to categorize and report costs by tag key and value.
AWS Compute Optimizer
Service that recommends optimal AWS resources for workloads to reduce costs and improve performance based on usage metrics.
Sustainability pillar
Focuses on minimizing the environmental impacts of running cloud workloads by maximizing utilization, minimizing resources, and reducing energy required for business value.
Cost optimization pillar
Includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.
Shared responsibility model
Describes how AWS is responsible for security of the cloud, while customers are responsible for security in the cloud, including service and data configuration.
Performance efficiency pillar
Focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
AWS Well-Architected Framework
Provides a consistent set of best practices and questions to evaluate how well an architecture aligns to AWS best practices.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself