Chapter 16 of 26
Data Ingestion and Transformation Patterns for Analytics and Streaming
Modern workloads often need to ingest and process data at scale. This module surveys high-performing ingestion and transformation patterns that appear in associate-level exam scenarios.
Module Overview: Why Ingestion Patterns Matter
Where This Fits
Here we focus on the front of the data lifecycle: how data enters AWS and is transformed for analytics and streaming workloads.
Exam Relevance
Expect questions about log processing, real-time dashboards, and integrating apps with analytics. You must choose appropriate services and patterns.
Core Learning Goals
You will learn to distinguish batch vs streaming, design S3-based ETL, sketch Kinesis + Lambda flows, and address security and performance.
Key Services
We emphasize S3, AWS Glue, AWS Lambda, Kinesis Data Streams, Kinesis Data Firehose, plus helpers like SQS, API Gateway, and CloudFront.
Well-Architected Lens
Ingestion designs must align with the Security, Reliability, and Performance efficiency pillars of the AWS Well-Architected Framework.
Batch vs Streaming Ingestion: Core Concepts
Batch Ingestion
Batch collects data over time and sends it in chunks. It usually lands in S3, processed by scheduled jobs (Glue, EMR, or Lambda). Latency is minutes to hours.
Streaming Ingestion
Streaming handles a continuous flow of small records via Kinesis Data Streams or Firehose, with consumers like Lambda for near real-time processing.
When to Use Batch
Use batch for nightly exports, partner CSVs, and historical reporting where up-to-the-second freshness is not required.
When to Use Streaming
Use streaming for clickstreams, IoT, fraud detection, and dashboards that must update within seconds.
Hybrid Patterns
Many designs stream for real-time needs but also persist to S3 for long-term analytics and cost-effective storage.
S3-Based Batch Ingestion: Pattern and Architecture
S3 as Landing Zone
S3 is your data lake landing zone. Apps, partners, and on-prem systems all upload files into S3 buckets for batch analytics.
Organizing Data
Use prefixes like `raw/year=2026/month=05/day=28/` to partition data. This speeds up Athena and Redshift Spectrum queries.
Triggering Workflows
S3 Event Notifications on ObjectCreated can call Lambda, send to SQS, or EventBridge to kick off ETL jobs.
Transform with Glue
AWS Glue jobs convert raw CSV/JSON into columnar formats (Parquet/ORC) and write curated data back to S3.
Downstream Analytics
Services like Athena, Redshift, and OpenSearch consume the processed data for reporting, BI, and search.
AWS Glue and Lambda for Transformation (High-Level)
Role of AWS Glue
AWS Glue is a serverless ETL service, ideal for large batch transformations, format conversion, and schema management via the Glue Data Catalog.
Role of AWS Lambda
AWS Lambda runs short, event-driven functions, perfect for lightweight transformations, validation, and routing during ingestion.
Batch Pattern
In S3 workflows, Lambda can validate new files; Glue then performs heavy ETL, writing optimized Parquet data back to S3.
Streaming Pattern
In streaming, Lambda commonly consumes Kinesis streams to apply near real-time business logic per batch of events.
Common Exam Trap
Avoid choosing Lambda for huge, long-running ETL over terabytes. Glue or EMR is more appropriate for that scale.
Kinesis Data Streams vs Kinesis Data Firehose
Kinesis Data Streams
KDS gives you a configurable stream with shards, multiple consumers, and replay over a retention window for custom processing.
Kinesis Data Firehose
Firehose is a fully managed delivery service that buffers and sends data to S3, OpenSearch, or third parties, with minimal configuration.
Control vs Simplicity
Use KDS when you need fine control and multiple consumers; use Firehose when you mainly need to land data into a destination.
Transformation Options
KDS relies on consumers (like Lambda) for logic. Firehose can call Lambda for light transforms and compress data before writing.
Exam Clues
“Custom processing, replay, multiple apps” → KDS. “Easily deliver logs/events into S3/OpenSearch” → Firehose.
Streaming Architecture: Kinesis + Lambda + S3
Scenario
An e-commerce app streams click events for real-time dashboards while keeping all history for later analytics.
Step 1: Produce Events
Front-end or backend services send JSON click events to Kinesis Data Streams using the AWS SDK.
Step 2: KDS Ingestion
KDS shards partition events by a key (like user ID) and retain them for a set period to enable replay.
Step 3: Lambda Consumers
Lambda consumes records from KDS, enriches them, and updates a low-latency store such as DynamoDB or ElastiCache.
Step 4–5: Archive + Analyze
Lambda (or another consumer) writes events to S3. Glue crawlers catalog the data so Athena or Redshift can query it later.
Secure, High-Performance Ingestion Endpoints
Security Pillar Focus
Protect ingestion with TLS, encryption at rest, IAM-based access control, and VPC endpoints to keep traffic inside AWS.
Performance Efficiency
Use multipart S3 uploads, scale Kinesis shards, and keep ingestion in the same Region as producers to reduce latency.
Network Design
Leverage Gateway VPC endpoints for S3 and Interface endpoints for Kinesis so private subnets can ingest without public IPs.
Global Producers
Front S3 or API Gateway with CloudFront to improve performance and offload TLS for users around the world.
Common Exam Mistake
Avoid public S3 buckets as ingestion endpoints. Use pre-signed URLs or an authenticated API layer instead.
Design Exercise: Pick the Ingestion Pattern
Work through these thought exercises. For each scenario, decide if you would use batch (S3-based) or streaming (Kinesis-based) ingestion, and which transformation tool is the best fit.
- Mobile game telemetry
- Millions of small events per minute: scores, moves, errors.
- Need a near real-time dashboard for game balancing and cheat detection.
- Also need to store all events for weekly machine learning training.
- Question: Would you choose Kinesis Data Streams or Firehose (or both)? Where do you store history? Would Lambda or Glue do the heavy lifting?
- Monthly billing reports from partners
- Once per month, each partner sends one large CSV file (5–10 GB) containing all their transactions.
- You generate monthly invoices and ad hoc financial reports.
- Question: Is streaming necessary here? What S3 pattern and ETL tool would you use? How would you organize S3 prefixes?
- IoT temperature sensors in factories
- Thousands of sensors sending readings every few seconds.
- You must trigger alerts within 5 seconds if temperature exceeds a threshold, and keep full history for compliance.
- Question: What streaming service is appropriate? How do you implement low-latency alerts and long-term storage?
Pause and sketch simple architectures for each: list producers → ingestion service → transformation → storage → analytics. Then compare your answers with the patterns from previous steps.
Quiz 1: Batch vs Streaming and Kinesis Choices
Test your understanding of ingestion patterns and Kinesis services.
A company needs to ingest application logs from thousands of EC2 instances and deliver them to Amazon S3 and Amazon OpenSearch Service with minimal operational overhead. They do not need to replay data once it has been delivered. Which service is the BEST fit for the ingestion layer?
- Amazon Kinesis Data Streams with custom consumer applications
- Amazon Kinesis Data Firehose
- AWS Lambda writing directly to Amazon S3 and OpenSearch
- Amazon SQS with EC2-based workers
Show Answer
Answer: B) Amazon Kinesis Data Firehose
Kinesis Data Firehose is designed to ingest streaming data and deliver it to destinations like S3 and OpenSearch with minimal management. It auto-scales, buffers, optionally transforms, and delivers data. Kinesis Data Streams requires you to manage shards and consumers. Lambda alone is not a streaming ingestion service. SQS is a message queue, not optimized for continuous log delivery to multiple analytics destinations.
Quiz 2: Transformation and Security
Check your understanding of transformation tools and secure ingestion.
You ingest large CSV files (hundreds of GB per day) into Amazon S3 and must convert them to partitioned Parquet for Athena queries. The job can run hourly and may take more than 15 minutes. Which combination is MOST appropriate?
- Use AWS Lambda triggered by S3 events to read the CSV files and write Parquet back to S3.
- Use AWS Glue Jobs scheduled hourly to read from S3 and write partitioned Parquet to another S3 prefix.
- Use Amazon Kinesis Data Streams to read the CSV files and write Parquet to S3.
- Use Amazon SQS with EC2 workers to convert CSV to Parquet.
Show Answer
Answer: B) Use AWS Glue Jobs scheduled hourly to read from S3 and write partitioned Parquet to another S3 prefix.
AWS Glue Jobs are designed for large-scale, long-running ETL over data in S3 and integrate well with the Glue Data Catalog for Athena. Lambda has a 15-minute max runtime and is not ideal for heavy ETL over hundreds of GB. Kinesis Data Streams is for streaming records, not bulk file conversion. SQS + EC2 could work but requires more management; Glue is the serverless, exam-preferred approach.
Key Term Flashcards: Ingestion and Streaming
Flip through these flashcards to reinforce key terms and patterns.
- Batch ingestion
- A pattern where data is collected over a period of time and ingested in chunks (e.g., hourly or daily) into a landing zone such as Amazon S3 for later processing.
- Streaming ingestion
- A pattern where data is continuously ingested as a flow of small records, typically using services like Amazon Kinesis Data Streams or Kinesis Data Firehose for near real-time processing.
- Amazon Kinesis Data Streams
- A managed streaming platform that uses shards to store records for a configurable retention period and supports multiple consumers and replay for custom real-time processing.
- Amazon Kinesis Data Firehose
- A fully managed service that reliably ingests and automatically delivers streaming data to destinations such as Amazon S3 and Amazon OpenSearch Service, with optional lightweight Lambda-based transformations.
- AWS Glue
- A serverless ETL service used for large-scale batch transformations, schema management via the Glue Data Catalog, and converting raw data into analytics-ready formats like Parquet.
- AWS Lambda in ingestion pipelines
- A serverless compute service used for short-lived, event-driven tasks such as validating records, enriching streaming events, or routing data in response to S3 or Kinesis triggers.
- S3 data lake landing zone
- An Amazon S3 bucket or set of prefixes where raw data from multiple producers is initially stored, often partitioned by time or other keys, before transformation and analytics.
- S3 Event Notifications
- A feature that triggers actions (such as invoking Lambda, sending to SQS, or publishing to EventBridge) when events like object creation occur in an S3 bucket.
- VPC endpoint for S3
- A Gateway VPC endpoint that allows resources in a VPC to privately access Amazon S3 without using public IP addresses or traversing the public internet.
- Replay capability
- The ability of a streaming system such as Kinesis Data Streams to retain and re-read historical records within a configured retention window for reprocessing or debugging.
Key Terms
- Replay
- The capability to read historical records from a stream again within its retention window, useful for reprocessing or recovery.
- AWS Glue
- A serverless ETL service used for large-scale batch transformations, schema management via the Glue Data Catalog, and converting data into analytics-ready formats.
- AWS Lambda
- A serverless compute service that runs short-lived functions in response to events from services like S3 and Kinesis, often used for lightweight transformations.
- S3 data lake
- An architecture where Amazon S3 acts as the central, durable, and cost-effective storage layer for raw and processed analytical data.
- VPC endpoint
- A private connection that allows resources in a VPC to access supported AWS services (such as S3 or Kinesis) without using public IPs or the public internet.
- Batch ingestion
- A pattern where data is collected over a period of time and ingested in bulk into a landing zone such as Amazon S3 for later processing.
- Streaming ingestion
- A pattern where data is continuously ingested as a flow of small records using services such as Amazon Kinesis Data Streams or Kinesis Data Firehose.
- S3 Event Notifications
- An S3 feature that triggers targets such as Lambda, SQS, or EventBridge when objects are created, removed, or modified in a bucket.
- Amazon Kinesis Data Streams
- A managed streaming platform that uses shards to store records for a configurable retention period and supports multiple consumers and replay.
- Amazon Kinesis Data Firehose
- A fully managed delivery service that ingests streaming data and automatically delivers it to destinations like Amazon S3 and Amazon OpenSearch Service.