SkarpSkarp

Chapter 22 of 26

End-to-End Design Workshop: Data and Analytics Workload on AWS

Not all workloads are user-facing web apps. This module walks through the design of a secure, high-performing, and cost-optimized data and analytics pipeline, reinforcing ingestion, storage, and processing patterns.

27 min readen

Scenario and End-to-End Big Picture

Scenario Overview

You will design an end-to-end data and analytics workload on AWS for a retail company needing batch and near-real-time analytics on sales, clickstream, and inventory data.

Data Lake Style

We will use a data lake-style architecture with Amazon S3 as the core storage layer, then add ingestion, processing, and analytics services on top.

Well-Architected Lens

We will constantly align decisions with the Security, Performance efficiency, and Cost optimization pillars of the AWS Well-Architected Framework.

Workshop Roadmap

You will sketch the architecture, design S3 layers, choose ingestion patterns, select processing and query services, apply encryption and KMS, and tune cost vs performance.

Designing the S3 Data Lake Core

Why S3 for Analytics

Amazon S3 offers very high durability, massive scalability, and separation of storage and compute, making it the standard backbone for analytics data lakes.

Data Zones Concept

Design logical zones: raw (landing), cleansed (curated), and analytics (presentation), each with a clear purpose and increasing data quality.

Bucket and Prefix Design

Use one or a few buckets with prefixes like `raw/`, `cleansed/`, and `analytics/`, or split by environment such as `retail-dev-data` and `retail-prod-data`.

Securing S3

Secure S3 using bucket policies, IAM, S3 Block Public Access, and S3 Object Ownership set to Bucket owner enforced to avoid ACLs.

Secure Ingestion Patterns: Batch and Streaming

Batch Ingestion

For batch data like nightly sales files, use AWS DataSync, AWS Transfer Family, or pre-signed S3 URLs, always secured with TLS and strict bucket policies.

Streaming Ingestion

For continuous data like clickstream, use Kinesis Data Streams or MSK, and deliver to S3 via Kinesis Data Firehose or MSK Connect.

Retail Example Flow

In our scenario, web apps send click events to Kinesis Data Streams; Firehose transforms and writes them into the S3 raw/clickstream prefix.

Security Controls

Protect ingestion with VPC endpoints, least-privilege IAM roles for services, and enforced HTTPS; avoid public S3 access in exam scenarios.

Processing and Transformation Choices

ETL Service Options

Use AWS Glue for serverless ETL, EMR for highly customizable big data, DataBrew for visual prep, and Lambda for lightweight event-driven transforms.

Batch ETL Pattern

Raw files in S3 trigger a workflow that starts a Glue job, which reads raw data, cleans it, and writes Parquet into the cleansed zone with partitions.

Near-Real-Time Processing

For streaming transformations, use Kinesis Data Analytics or Lambda consumers on Kinesis to enrich events and store them in S3 or other data stores.

Common Exam Trap

Avoid picking Lambda for massive terabyte-scale ETL; Glue or EMR is more appropriate for large, long-running transformations.

Query and Analytics Layers: Athena, Redshift, EMR

Query Service Options

Use Athena for serverless SQL on S3, Redshift for a managed data warehouse, and EMR for open-source engines and highly customized analytics.

Retail Scenario Design

Glue Data Catalog stores metadata; Athena queries the S3 analytics zone for ad-hoc and BI; Redshift may host curated marts for fast, complex dashboards.

Performance and Cost Tuning

Optimize Athena with Parquet, compression, and partitioning; choose Redshift RA3 for scalable storage with tight S3 integration.

Exam Selection Clues

Look for phrases like 'no infrastructure, query S3 directly' for Athena, and 'enterprise data warehouse, many joins, high concurrency' for Redshift.

Protecting Data with AWS KMS and Encryption In Transit

Shared Responsibility Reminder

Under the AWS shared responsibility model, AWS secures the cloud; you configure encryption settings and protect your data in the cloud.

Encryption at Rest with KMS

Use S3 SSE-KMS with AWS or customer managed keys, and enable KMS encryption for Kinesis, Firehose, Glue, and Redshift resources.

Encryption in Transit

Require TLS (HTTPS) for all access, enforce aws:SecureTransport in S3 bucket policies, and use VPC endpoints for private, encrypted connectivity.

Choosing Key Types

Pick customer managed KMS keys when you need fine-grained control, rotation, and the ability to revoke or share keys across accounts.

Cost and Performance Trade-offs: Storage Classes, Partitioning, Lifecycle

Cost Optimization Lens

The cost optimization pillar emphasizes refining systems over time to stay cost-aware while still meeting business outcomes.

Using Storage Classes

Use S3 Standard for hot data, Standard-IA for warm data, and Glacier classes for cold, rarely accessed historical data.

Lifecycle and Retention

Apply lifecycle policies so data ages from Standard to IA to Glacier, and optionally expires when it is no longer needed.

Partitioning Trade-offs

Partition by date and sometimes region to reduce scanned data; avoid too many tiny partitions that increase query overhead.

Worked Design Walkthrough: Retail Analytics Pipeline

Ingestion Choices

Clickstream data flows through Kinesis Data Streams and Firehose into S3; POS files arrive via AWS Transfer Family over SFTP into the raw zone.

Storage and Lifecycle

All data lands in a single S3 bucket with raw, cleansed, and analytics prefixes, encrypted with KMS and managed via lifecycle rules.

Transformation Flow

S3 events trigger Step Functions, which launch Glue jobs to clean, deduplicate, and convert data to Parquet in partitioned cleansed prefixes.

Analytics Consumption

Glue Data Catalog defines tables; Athena queries S3 for ad-hoc and BI, while Redshift hosts curated marts for fast executive dashboards.

Design Trade-off Thought Exercise

Storage Class Scenario

Decide which S3 storage classes to use for 0–6 months, 6–24 months, and 2–5 years of sales data, and how to express that in lifecycle rules.

Optimizing Athena

Your Athena queries over CSV are slow and costly. Think of three concrete changes to data format, layout, or configuration to improve them.

KMS Strategy

Plan a KMS key strategy that separates dev and prod decryption rights and define how Glue, Athena, and Redshift roles use those keys.

Connect to Pillars

For each decision, identify which Well-Architected pillars you are improving: Security, Performance efficiency, and/or Cost optimization.

Quiz: S3 and Ingestion Patterns

Check your understanding of S3 design and ingestion choices.

A company needs to ingest high-volume clickstream data from a web app into S3 in near real time. They want minimal operations overhead, automatic batching and compression, and server-side encryption with a customer managed KMS key. Which combination best fits these requirements?

  1. Web app writes JSON files directly to S3 over HTTPS using pre-signed URLs with SSE-S3
  2. Web app sends events to Amazon Kinesis Data Streams; an Amazon Kinesis Data Firehose delivery stream delivers to S3 with SSE-KMS using a customer managed key
  3. Web app publishes messages to Amazon SQS; an AWS Lambda function polls SQS and writes to S3 with client-side encryption
  4. Web app writes directly into an Amazon RDS database; a nightly AWS Glue job exports the data to S3
Show Answer

Answer: B) Web app sends events to Amazon Kinesis Data Streams; an Amazon Kinesis Data Firehose delivery stream delivers to S3 with SSE-KMS using a customer managed key

Kinesis Data Firehose is designed for near-real-time delivery into S3 with automatic batching, compression, and integration with SSE-KMS using customer managed keys. Direct S3 uploads lack built-in batching/compression, SQS+Lambda adds more operational code, and RDS+nightly export is not near real time.

Quiz: Analytics and Cost Optimization

Test yourself on query layer selection and cost/performance trade-offs.

You store 50 TB of Parquet data in S3 for analytics. Analysts run ad-hoc SQL queries a few times per day. They do not want to manage servers. You need to minimize cost and still get good performance. Which is the MOST appropriate primary query service?

  1. Run a large Amazon EMR cluster with Hive and keep it running for quick access
  2. Use Amazon Athena with AWS Glue Data Catalog and optimize partitions
  3. Load all data into an Amazon RDS PostgreSQL instance and query from there
  4. Create a large Amazon Redshift cluster and copy all S3 data into Redshift tables
Show Answer

Answer: B) Use Amazon Athena with AWS Glue Data Catalog and optimize partitions

Athena is serverless and charges per TB scanned, ideal for intermittent ad-hoc queries over S3 data. EMR and Redshift require managing clusters and incur ongoing costs; RDS is not ideal for 50 TB of analytical, columnar data.

Key Term Review: Data and Analytics on AWS

Flip these cards mentally to reinforce core concepts and exam language.

Data lake on AWS
A centralized repository, typically built on Amazon S3, that allows you to store all structured and unstructured data at any scale and analyze it using a variety of services like Athena, EMR, Redshift Spectrum, and Glue.
Raw, Cleansed, Analytics zones
Logical layers in an S3 data lake: raw (as-ingested, immutable), cleansed (validated, standardized), and analytics (optimized for querying, often Parquet and partitioned).
Amazon Kinesis Data Firehose
A fully managed service for reliably loading streaming data into destinations like S3, Redshift, and OpenSearch, with built-in batching, compression, and optional transformation.
Amazon Athena
A serverless interactive query service that lets you analyze data directly in Amazon S3 using standard SQL, paying only for the data scanned.
AWS Glue Data Catalog
A centralized metadata repository that stores table definitions, schema, and location information for data in S3 and other sources, used by services like Athena and EMR.
Partitioning (e.g., year/month/day)
Organizing data into directory-like segments (such as `year=2026/month=05/day=28`) so query engines can read only relevant subsets, reducing data scanned and improving performance.
SSE-KMS vs SSE-S3
SSE-KMS uses AWS KMS keys (often customer managed) for server-side encryption with fine-grained control and audit; SSE-S3 uses S3-managed keys with less control but simpler configuration.
Lifecycle policy
An S3 configuration that automatically transitions objects between storage classes and optionally expires them based on object age.
Performance efficiency pillar
The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements and maintain that efficiency as demand changes and technologies evolve.
Cost optimization pillar
The cost optimization pillar includes the continual process of refinement and improvement of a system over its entire lifecycle to build and operate cost-aware systems that achieve business outcomes and minimize costs.

Key Terms

AWS Glue
A serverless data integration service that makes it easy to discover, prepare, move, and integrate data for analytics, machine learning, and application development.
Amazon S3
An object storage service offering high durability, scalability, and a variety of storage classes, commonly used as the core of data lake architectures.
Data lake
A centralized repository, usually on Amazon S3, that stores all structured and unstructured data at any scale and allows multiple analytics tools to access it.
Partitioning
The practice of organizing data into separate segments (such as by date or region) to improve query performance and reduce the amount of data scanned.
Amazon Athena
A serverless interactive query service that lets you analyze data directly in Amazon S3 using standard SQL.
Amazon Redshift
A fully managed, petabyte-scale cloud data warehouse service.
S3 Lifecycle policy
A set of rules that define how Amazon S3 manages objects during their lifetime, including transitions between storage classes and expiration.
Amazon Kinesis Data Streams
A service for capturing and processing real-time streaming data at scale.
Amazon Kinesis Data Firehose
A fully managed service that delivers real-time streaming data to destinations like S3, Redshift, and OpenSearch with built-in buffering and optional transformation.
AWS Key Management Service (KMS)
A managed service that makes it easy to create and control cryptographic keys used to encrypt data across AWS services.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself