SkarpSkarp

Chapter 16 of 26

Data Ingestion and Transformation Patterns for Analytics and Streaming

Modern workloads often need to ingest and process data at scale. This module surveys high-performing ingestion and transformation patterns that appear in associate-level exam scenarios.

27 min readen

Module Overview: Why Ingestion Patterns Matter

Where This Fits

Here we focus on the front of the data lifecycle: how data enters AWS and is transformed for analytics and streaming workloads.

Exam Relevance

Expect questions about log processing, real-time dashboards, and integrating apps with analytics. You must choose appropriate services and patterns.

Core Learning Goals

You will learn to distinguish batch vs streaming, design S3-based ETL, sketch Kinesis + Lambda flows, and address security and performance.

Key Services

We emphasize S3, AWS Glue, AWS Lambda, Kinesis Data Streams, Kinesis Data Firehose, plus helpers like SQS, API Gateway, and CloudFront.

Well-Architected Lens

Ingestion designs must align with the Security, Reliability, and Performance efficiency pillars of the AWS Well-Architected Framework.

Batch vs Streaming Ingestion: Core Concepts

Batch Ingestion

Batch collects data over time and sends it in chunks. It usually lands in S3, processed by scheduled jobs (Glue, EMR, or Lambda). Latency is minutes to hours.

Streaming Ingestion

Streaming handles a continuous flow of small records via Kinesis Data Streams or Firehose, with consumers like Lambda for near real-time processing.

When to Use Batch

Use batch for nightly exports, partner CSVs, and historical reporting where up-to-the-second freshness is not required.

When to Use Streaming

Use streaming for clickstreams, IoT, fraud detection, and dashboards that must update within seconds.

Hybrid Patterns

Many designs stream for real-time needs but also persist to S3 for long-term analytics and cost-effective storage.

S3-Based Batch Ingestion: Pattern and Architecture

S3 as Landing Zone

S3 is your data lake landing zone. Apps, partners, and on-prem systems all upload files into S3 buckets for batch analytics.

Organizing Data

Use prefixes like `raw/year=2026/month=05/day=28/` to partition data. This speeds up Athena and Redshift Spectrum queries.

Triggering Workflows

S3 Event Notifications on ObjectCreated can call Lambda, send to SQS, or EventBridge to kick off ETL jobs.

Transform with Glue

AWS Glue jobs convert raw CSV/JSON into columnar formats (Parquet/ORC) and write curated data back to S3.

Downstream Analytics

Services like Athena, Redshift, and OpenSearch consume the processed data for reporting, BI, and search.

AWS Glue and Lambda for Transformation (High-Level)

Role of AWS Glue

AWS Glue is a serverless ETL service, ideal for large batch transformations, format conversion, and schema management via the Glue Data Catalog.

Role of AWS Lambda

AWS Lambda runs short, event-driven functions, perfect for lightweight transformations, validation, and routing during ingestion.

Batch Pattern

In S3 workflows, Lambda can validate new files; Glue then performs heavy ETL, writing optimized Parquet data back to S3.

Streaming Pattern

In streaming, Lambda commonly consumes Kinesis streams to apply near real-time business logic per batch of events.

Common Exam Trap

Avoid choosing Lambda for huge, long-running ETL over terabytes. Glue or EMR is more appropriate for that scale.

Kinesis Data Streams vs Kinesis Data Firehose

Kinesis Data Streams

KDS gives you a configurable stream with shards, multiple consumers, and replay over a retention window for custom processing.

Kinesis Data Firehose

Firehose is a fully managed delivery service that buffers and sends data to S3, OpenSearch, or third parties, with minimal configuration.

Control vs Simplicity

Use KDS when you need fine control and multiple consumers; use Firehose when you mainly need to land data into a destination.

Transformation Options

KDS relies on consumers (like Lambda) for logic. Firehose can call Lambda for light transforms and compress data before writing.

Exam Clues

“Custom processing, replay, multiple apps” → KDS. “Easily deliver logs/events into S3/OpenSearch” → Firehose.

Streaming Architecture: Kinesis + Lambda + S3

Scenario

An e-commerce app streams click events for real-time dashboards while keeping all history for later analytics.

Step 1: Produce Events

Front-end or backend services send JSON click events to Kinesis Data Streams using the AWS SDK.

Step 2: KDS Ingestion

KDS shards partition events by a key (like user ID) and retain them for a set period to enable replay.

Step 3: Lambda Consumers

Lambda consumes records from KDS, enriches them, and updates a low-latency store such as DynamoDB or ElastiCache.

Step 4–5: Archive + Analyze

Lambda (or another consumer) writes events to S3. Glue crawlers catalog the data so Athena or Redshift can query it later.

Secure, High-Performance Ingestion Endpoints

Security Pillar Focus

Protect ingestion with TLS, encryption at rest, IAM-based access control, and VPC endpoints to keep traffic inside AWS.

Performance Efficiency

Use multipart S3 uploads, scale Kinesis shards, and keep ingestion in the same Region as producers to reduce latency.

Network Design

Leverage Gateway VPC endpoints for S3 and Interface endpoints for Kinesis so private subnets can ingest without public IPs.

Global Producers

Front S3 or API Gateway with CloudFront to improve performance and offload TLS for users around the world.

Common Exam Mistake

Avoid public S3 buckets as ingestion endpoints. Use pre-signed URLs or an authenticated API layer instead.

Design Exercise: Pick the Ingestion Pattern

Work through these thought exercises. For each scenario, decide if you would use batch (S3-based) or streaming (Kinesis-based) ingestion, and which transformation tool is the best fit.

  1. Mobile game telemetry
  • Millions of small events per minute: scores, moves, errors.
  • Need a near real-time dashboard for game balancing and cheat detection.
  • Also need to store all events for weekly machine learning training.
  • Question: Would you choose Kinesis Data Streams or Firehose (or both)? Where do you store history? Would Lambda or Glue do the heavy lifting?
  1. Monthly billing reports from partners
  • Once per month, each partner sends one large CSV file (5–10 GB) containing all their transactions.
  • You generate monthly invoices and ad hoc financial reports.
  • Question: Is streaming necessary here? What S3 pattern and ETL tool would you use? How would you organize S3 prefixes?
  1. IoT temperature sensors in factories
  • Thousands of sensors sending readings every few seconds.
  • You must trigger alerts within 5 seconds if temperature exceeds a threshold, and keep full history for compliance.
  • Question: What streaming service is appropriate? How do you implement low-latency alerts and long-term storage?

Pause and sketch simple architectures for each: list producers → ingestion service → transformation → storage → analytics. Then compare your answers with the patterns from previous steps.

Quiz 1: Batch vs Streaming and Kinesis Choices

Test your understanding of ingestion patterns and Kinesis services.

A company needs to ingest application logs from thousands of EC2 instances and deliver them to Amazon S3 and Amazon OpenSearch Service with minimal operational overhead. They do not need to replay data once it has been delivered. Which service is the BEST fit for the ingestion layer?

  1. Amazon Kinesis Data Streams with custom consumer applications
  2. Amazon Kinesis Data Firehose
  3. AWS Lambda writing directly to Amazon S3 and OpenSearch
  4. Amazon SQS with EC2-based workers
Show Answer

Answer: B) Amazon Kinesis Data Firehose

Kinesis Data Firehose is designed to ingest streaming data and deliver it to destinations like S3 and OpenSearch with minimal management. It auto-scales, buffers, optionally transforms, and delivers data. Kinesis Data Streams requires you to manage shards and consumers. Lambda alone is not a streaming ingestion service. SQS is a message queue, not optimized for continuous log delivery to multiple analytics destinations.

Quiz 2: Transformation and Security

Check your understanding of transformation tools and secure ingestion.

You ingest large CSV files (hundreds of GB per day) into Amazon S3 and must convert them to partitioned Parquet for Athena queries. The job can run hourly and may take more than 15 minutes. Which combination is MOST appropriate?

  1. Use AWS Lambda triggered by S3 events to read the CSV files and write Parquet back to S3.
  2. Use AWS Glue Jobs scheduled hourly to read from S3 and write partitioned Parquet to another S3 prefix.
  3. Use Amazon Kinesis Data Streams to read the CSV files and write Parquet to S3.
  4. Use Amazon SQS with EC2 workers to convert CSV to Parquet.
Show Answer

Answer: B) Use AWS Glue Jobs scheduled hourly to read from S3 and write partitioned Parquet to another S3 prefix.

AWS Glue Jobs are designed for large-scale, long-running ETL over data in S3 and integrate well with the Glue Data Catalog for Athena. Lambda has a 15-minute max runtime and is not ideal for heavy ETL over hundreds of GB. Kinesis Data Streams is for streaming records, not bulk file conversion. SQS + EC2 could work but requires more management; Glue is the serverless, exam-preferred approach.

Key Term Flashcards: Ingestion and Streaming

Flip through these flashcards to reinforce key terms and patterns.

Batch ingestion
A pattern where data is collected over a period of time and ingested in chunks (e.g., hourly or daily) into a landing zone such as Amazon S3 for later processing.
Streaming ingestion
A pattern where data is continuously ingested as a flow of small records, typically using services like Amazon Kinesis Data Streams or Kinesis Data Firehose for near real-time processing.
Amazon Kinesis Data Streams
A managed streaming platform that uses shards to store records for a configurable retention period and supports multiple consumers and replay for custom real-time processing.
Amazon Kinesis Data Firehose
A fully managed service that reliably ingests and automatically delivers streaming data to destinations such as Amazon S3 and Amazon OpenSearch Service, with optional lightweight Lambda-based transformations.
AWS Glue
A serverless ETL service used for large-scale batch transformations, schema management via the Glue Data Catalog, and converting raw data into analytics-ready formats like Parquet.
AWS Lambda in ingestion pipelines
A serverless compute service used for short-lived, event-driven tasks such as validating records, enriching streaming events, or routing data in response to S3 or Kinesis triggers.
S3 data lake landing zone
An Amazon S3 bucket or set of prefixes where raw data from multiple producers is initially stored, often partitioned by time or other keys, before transformation and analytics.
S3 Event Notifications
A feature that triggers actions (such as invoking Lambda, sending to SQS, or publishing to EventBridge) when events like object creation occur in an S3 bucket.
VPC endpoint for S3
A Gateway VPC endpoint that allows resources in a VPC to privately access Amazon S3 without using public IP addresses or traversing the public internet.
Replay capability
The ability of a streaming system such as Kinesis Data Streams to retain and re-read historical records within a configured retention window for reprocessing or debugging.

Key Terms

Replay
The capability to read historical records from a stream again within its retention window, useful for reprocessing or recovery.
AWS Glue
A serverless ETL service used for large-scale batch transformations, schema management via the Glue Data Catalog, and converting data into analytics-ready formats.
AWS Lambda
A serverless compute service that runs short-lived functions in response to events from services like S3 and Kinesis, often used for lightweight transformations.
S3 data lake
An architecture where Amazon S3 acts as the central, durable, and cost-effective storage layer for raw and processed analytical data.
VPC endpoint
A private connection that allows resources in a VPC to access supported AWS services (such as S3 or Kinesis) without using public IPs or the public internet.
Batch ingestion
A pattern where data is collected over a period of time and ingested in bulk into a landing zone such as Amazon S3 for later processing.
Streaming ingestion
A pattern where data is continuously ingested as a flow of small records using services such as Amazon Kinesis Data Streams or Kinesis Data Firehose.
S3 Event Notifications
An S3 feature that triggers targets such as Lambda, SQS, or EventBridge when objects are created, removed, or modified in a bucket.
Amazon Kinesis Data Streams
A managed streaming platform that uses shards to store records for a configurable retention period and supports multiple consumers and replay.
Amazon Kinesis Data Firehose
A fully managed delivery service that ingests streaming data and automatically delivers it to destinations like Amazon S3 and Amazon OpenSearch Service.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself