A data ingestion approach where data is collected over a period of time (minutes, hours, days) and processed together, typically using services like AWS Glue, EMR, or scheduled jobs over S3 data.

A continuous data ingestion approach where events are processed with low latency (seconds or less), commonly using Kinesis Data Streams, Kinesis Data Firehose, or Amazon MSK.

A scalable, real-time streaming service that uses shards to provide ordered, replayable streams and supports multiple custom consumer applications.

A serverless data integration service based on Apache Spark that simplifies discovering, preparing, and combining data for analytics, machine learning, and application development.

AWS Database Migration Service, a managed service for migrating and replicating databases to AWS with minimal downtime, including change data capture.

A serverless data integration and ETL service based on Apache Spark that discovers, prepares, and transforms data for analytics and machine learning.

A centralized repository that allows you to store all structured and unstructured data at any scale, often built on Amazon S3 in AWS architectures.

High-Performing Data Ingestion and Transformation Pipelines — AWS Solutions Architect Associate (SAA-C03): Complete Exam-Ready Masterclass

Q: You have 5 TB of historical CSV data in S3 and receive 50 GB of new data daily. You must convert it to Parquet, join with reference data, and compute aggregates. The job can run hourly, and cost and maintainability are important. Which service is the best primary engine for this ETL workload?

AWS Glue Spark jobs reading from and writing to S3. AWS Glue is a serverless Spark-based ETL service that is well-suited for large-scale transformations over S3 data, including format conversion, joins, and aggregations. It scales automatically and integrates with the AWS Glue Data Catalog. Lambda is not ideal for 5 TB plus daily 50 GB joins, EC2 scripts add management overhead, and RDS is not optimized for this scale of analytical ETL.

Big Picture: What Is a High-Performing Ingestion Pipeline?

What Is a Data Pipeline?

A data ingestion and transformation pipeline is the path data follows from producers into storage and analytics systems, plus all the transformations applied along the way.

Exam-Relevant Goals

On the exam, you must recognize AWS building blocks that make pipelines fast, reliable, and right-sized, without unnecessary complexity or cost.

Typical Producers and Consumers

Producers: apps, services, devices, on-prem DBs. Consumers: S3 data lakes, Redshift, DynamoDB/RDS, and streaming analytics like Kinesis Data Analytics.

Core AWS Ingestion Services

Key services: Kinesis Data Streams/Firehose, MSK, SQS, AWS IoT Core, AWS DMS, plus Glue, Lambda, EMR, and Redshift for transformations.

Common 4-Step Pattern

Pipelines usually: 1) Collect, 2) Buffer, 3) Transform, 4) Store and serve. We will tie these to Performance efficiency, Reliability, and Cost optimization.

Batch vs Streaming: Choosing the Right Ingestion Mode

Batch Ingestion

Batch collects data over time and processes it together. Use S3 + Glue/EMR/Batch when latency is relaxed and simplicity and cost-efficiency matter.

Streaming Ingestion

Streaming processes data continuously with low latency using Kinesis, MSK, or IoT Core. It suits real-time dashboards, alerts, or fraud detection.

Micro-Batching

Micro-batching groups streaming data into short windows (e.g., 1–5 minutes), common with Kinesis Data Firehose delivering to S3 or Redshift.

Performance Tradeoffs

Batch is cheaper and simpler but higher latency. Streaming is lower latency but more complex. Micro-batch balances cost and near-real-time needs.

Typical Exam Trap

If a scenario says near-real-time but only needs 5-minute updates, Firehose-based micro-batching is usually preferred over a full streaming analytics stack.

Core Ingestion Patterns on AWS

Pattern 1: Logs to Data Lake

App or ALB/CloudFront logs go to Kinesis Data Firehose, then to S3. Later, Glue, EMR, or Redshift run analytics on this durable, cheap storage.

Pattern 2: Real-Time Operational Events

Apps send events to Kinesis Data Streams. Lambda or Kinesis Data Analytics consume and write results to S3, DynamoDB, or Redshift.

Pattern 3: Database CDC

AWS DMS reads change logs from on-prem or RDS databases and streams changes into Kinesis, S3, or Redshift for near-real-time analytics.

Pattern 4: IoT Telemetry

Devices talk to AWS IoT Core. Rules forward messages to Firehose, Streams, or Lambda, which then store data in S3 or DynamoDB.

Key Design Tips

Decouple with Kinesis/SQS, use S3 as a landing zone, and pick Firehose when you want managed delivery instead of custom stream consumers.

Designing High-Throughput Streaming Ingestion

Kinesis Shards and Throughput

Kinesis Data Streams use shards. Each shard supports about 1 MB/s writes and 2 MB/s reads. You scale by adding shards.

Avoiding Hot Shards

Choose a high-cardinality partition key to spread events across shards. Monitor throughput exceeded metrics to catch hot spots.

Enhanced Fan-Out

Use enhanced fan-out when multiple consumers need high throughput without sharing the same read capacity on a stream.

Firehose Buffering

Kinesis Data Firehose buffers by size or time, then writes to S3/Redshift/OpenSearch. Smaller buffers lower latency but create more small files.

When to Use Which

Streams or MSK fit when you need custom consumers and replay. Firehose is best for managed delivery into S3/Redshift with minimal operations.

Transformation Workflows: Lambda, Glue, EMR, and Redshift

Lambda for Lightweight ETL

AWS Lambda is ideal for small, event-driven transformations such as filtering, enrichment, and format changes on streaming or S3-triggered data.

Glue for Large Batch Jobs

AWS Glue runs serverless Spark jobs over S3 data. It suits large batch or micro-batch ETL, especially when using Parquet and partition pruning.

EMR for Complex Pipelines

Amazon EMR gives you managed clusters for Spark, Hadoop, and more, with deep control for complex or custom big data workflows.

Redshift as a Transform Engine

Redshift can transform data via COPY and SQL, often in a pattern where raw S3 data is loaded into staging then into modeled analytics tables.

Common Exam Trap

Do not choose Lambda for huge joins or long-running ETL. For heavy workloads over large datasets, Glue or EMR is usually the correct choice.

End-to-End Example: Clickstream Analytics Pipeline on AWS

Scenario Overview

An e-commerce site wants near-real-time clickstream insights plus daily aggregated reports. We design a pipeline to meet both needs efficiently.

Step 1: Ingestion via Kinesis

Apps send events to API Gateway and Lambda, which writes to Kinesis Data Streams using a session ID partition key to spread load.

Step 2: Real-Time Enrichment

A Lambda consumer enriches events with DynamoDB user data, then sends them to Kinesis Data Firehose, which buffers and writes Parquet to S3.

Step 3: Hourly Aggregation

An hourly AWS Glue job processes S3 Parquet files to build per-user and per-page aggregates, writing them to partitioned S3 prefixes.

Step 4: Analytics Consumption

Athena queries aggregates in S3, and Redshift loads curated data for BI. The design balances low latency with cost-efficient batch ETL.

Integrating Ingestion with Storage and Compute Layers

Storage Options

S3 is the core data lake; Redshift is the data warehouse; DynamoDB and RDS/Aurora serve low-latency operational data.

Pattern: Streaming to S3

Common flow: Kinesis or Firehose to S3, then Glue/EMR for ETL, then Athena or Redshift for analytics over raw and curated data.

Pattern: Streaming to Operational Stores

Kinesis Data Streams feed Lambda, which updates DynamoDB or RDS to maintain real-time views used by applications.

Data Lake plus Warehouse

Use S3 as the system of record and Redshift for curated analytics, loading via COPY or querying S3 directly with Redshift Spectrum.

Performance Efficiency Focus

Align storage and compute to minimize data movement and scan size, and scale compute independently from storage when possible.

Design Exercise: Pick the Right Pattern

Use this thought exercise to practice choosing ingestion and transformation patterns.

Scenario A

A mobile game sends gameplay events that must appear on a live leaderboard within 2 seconds. A nightly batch job also computes detailed player statistics.

Questions to answer mentally:

Which ingestion service(s) would you choose and why?
What would handle the real-time leaderboard updates?
How would you implement the nightly batch job?

One reasonable design

Ingestion: Kinesis Data Streams to capture gameplay events.
Real-time: Lambda consumer reads from Kinesis, updates DynamoDB leaderboard table (partitioned by leaderboard ID, sort by score or timestamp).
Batch: Raw events are also delivered to S3 (via another consumer or Firehose). A nightly Glue job aggregates stats from S3 into a curated S3 prefix or Redshift.

Scenario B

A financial system needs to replicate an on-prem Oracle database into AWS for near-real-time analytics, with minimal impact on the source database.

Questions:

Which service best handles ongoing replication?
Where would you land the replicated data first?
Which analytics layer would you add?

One reasonable design

Replication: AWS DMS with ongoing replication from Oracle.
Landing zone: S3 for full and incremental data.
Analytics: Redshift or Athena over S3, with periodic ETL via Glue.

As you go through scenarios like these, explicitly think: latency requirement, data volume, operational effort, and which AWS services map cleanly to those needs.

Quiz 1: Ingestion Mode and Service Choice

Check your understanding of ingestion modes and AWS services.

A startup wants to collect application logs and load them into S3 for analysis. They need data available in S3 within 3–5 minutes, do not want to manage consumer applications, and expect traffic spikes. Which option best fits these requirements while keeping operations simple?

Send logs directly from the app to an S3 bucket using the AWS SDK
Send logs to Kinesis Data Streams and run a custom EC2-based consumer to write to S3
Send logs to Kinesis Data Firehose configured to deliver to S3 with a small buffer interval
Write logs to an EBS volume and run a nightly batch job to upload them to S3

Show Answer

Answer: C) Send logs to Kinesis Data Firehose configured to deliver to S3 with a small buffer interval

Kinesis Data Firehose is a fully managed delivery service that buffers data and writes it to S3 without requiring you to build and manage consumer applications. Configuring a small buffer interval gives 3–5 minute latency while handling traffic spikes. Direct S3 writes from apps are harder to buffer and retry, Streams + EC2 adds operational overhead, and nightly batch does not meet the 3–5 minute requirement.

Quiz 2: Transformation Engine Selection

Test your ability to choose the right transformation service.

You have 5 TB of historical CSV data in S3 and receive 50 GB of new data daily. You must convert it to Parquet, join with reference data, and compute aggregates. The job can run hourly, and cost and maintainability are important. Which service is the best primary engine for this ETL workload?

AWS Lambda functions triggered for each new file
AWS Glue Spark jobs reading from and writing to S3
A fleet of EC2 instances running custom Python scripts on attached EBS volumes
Amazon RDS for PostgreSQL with periodic bulk imports and SQL transforms

Show Answer

Answer: B) AWS Glue Spark jobs reading from and writing to S3

AWS Glue is a serverless Spark-based ETL service that is well-suited for large-scale transformations over S3 data, including format conversion, joins, and aggregations. It scales automatically and integrates with the AWS Glue Data Catalog. Lambda is not ideal for 5 TB plus daily 50 GB joins, EC2 scripts add management overhead, and RDS is not optimized for this scale of analytical ETL.

Key Terms Review

Flip these cards mentally to reinforce core concepts and services.

Batch ingestion: A data ingestion approach where data is collected over a period of time (minutes, hours, days) and processed together, typically using services like AWS Glue, EMR, or scheduled jobs over S3 data.
Streaming ingestion: A continuous data ingestion approach where events are processed with low latency (seconds or less), commonly using Kinesis Data Streams, Kinesis Data Firehose, or Amazon MSK.
Kinesis Data Streams: A scalable, real-time streaming service that uses shards to provide ordered, replayable streams and supports multiple custom consumer applications.
Kinesis Data Firehose: A fully managed service that reliably loads streaming data into destinations such as Amazon S3, Amazon Redshift, and Amazon OpenSearch Service, handling buffering, scaling, and retries for you.
AWS Glue: A serverless data integration service based on Apache Spark that simplifies discovering, preparing, and combining data for analytics, machine learning, and application development.
AWS DMS (Database Migration Service): A managed service that helps migrate and replicate databases to AWS with minimal downtime, including ongoing change data capture from source databases.
Data lake on Amazon S3: A centralized repository on S3 that stores all structured and unstructured data at any scale, often in columnar formats like Parquet, for analytics using services such as Athena, Glue, EMR, and Redshift Spectrum.
Micro-batching: A hybrid approach where streaming data is grouped into small time windows (for example, every 1–5 minutes) before transformation or storage, balancing latency and cost.
Enhanced fan-out (Kinesis): A Kinesis Data Streams feature that provides each consumer with dedicated throughput, reducing read contention and improving performance for multiple consumers.
Redshift Spectrum: A feature of Amazon Redshift that allows you to run SQL queries directly against data in S3 without loading it into Redshift tables, useful for extending a data warehouse with a data lake.

Key Terms

AWS DMS: AWS Database Migration Service, a managed service for migrating and replicating databases to AWS with minimal downtime, including change data capture.
AWS Glue: A serverless data integration and ETL service based on Apache Spark that discovers, prepares, and transforms data for analytics and machine learning.
Data lake: A centralized repository that allows you to store all structured and unstructured data at any scale, often built on Amazon S3 in AWS architectures.
Amazon EMR: A managed big data service that makes it easy to run frameworks like Apache Spark, Hadoop, and Presto on scalable clusters.
Micro-batching: Grouping streaming data into small, fixed time windows before processing, combining benefits of streaming and batch.
Batch ingestion: A data ingestion approach where data is collected over a period of time and processed together in bulk jobs, often with relaxed latency requirements.
Redshift Spectrum: A feature of Amazon Redshift that enables querying data directly in S3 using SQL, without loading it into Redshift tables.
Streaming ingestion: A continuous ingestion approach where events are processed as they arrive with low latency using streaming services like Kinesis Data Streams or Amazon MSK.
Kinesis Data Streams: A scalable, real-time streaming service that uses shards to provide ordered, replayable event streams with support for multiple consumer applications.
Kinesis Data Firehose: A fully managed service that loads streaming data into destinations such as S3, Redshift, and OpenSearch, handling buffering, scaling, and retries.

Big Picture: What Is a High-Performing Ingestion Pipeline?

What Is a Data Pipeline?

Exam-Relevant Goals

Typical Producers and Consumers

Core AWS Ingestion Services

Common 4-Step Pattern

Batch vs Streaming: Choosing the Right Ingestion Mode

Batch Ingestion

Streaming Ingestion

Micro-Batching

Performance Tradeoffs

Typical Exam Trap

Core Ingestion Patterns on AWS

Pattern 1: Logs to Data Lake

Pattern 2: Real-Time Operational Events

Pattern 3: Database CDC

Pattern 4: IoT Telemetry

Key Design Tips

Designing High-Throughput Streaming Ingestion

Kinesis Shards and Throughput

Avoiding Hot Shards

Enhanced Fan-Out

Firehose Buffering

When to Use Which

Transformation Workflows: Lambda, Glue, EMR, and Redshift

Lambda for Lightweight ETL

Glue for Large Batch Jobs

EMR for Complex Pipelines

Redshift as a Transform Engine

Common Exam Trap

End-to-End Example: Clickstream Analytics Pipeline on AWS

Scenario Overview

Step 1: Ingestion via Kinesis

Step 2: Real-Time Enrichment

Step 3: Hourly Aggregation

Step 4: Analytics Consumption

Integrating Ingestion with Storage and Compute Layers

Storage Options

Pattern: Streaming to S3

Pattern: Streaming to Operational Stores

Data Lake plus Warehouse

Performance Efficiency Focus

Design Exercise: Pick the Right Pattern

Quiz 1: Ingestion Mode and Service Choice

Quiz 2: Transformation Engine Selection

Key Terms Review

Key Terms

Finished reading?