Chapter 15 of 20
Analytics, AI, and Machine Learning Services on AWS
Go beyond core infrastructure to see how AWS helps organizations analyze data and build intelligent applications with AI and machine learning.
Big Picture: Where Analytics and ML Fit in AWS
From Infrastructure to Intelligence
You previously learned how to run infrastructure on AWS. Now we move up a layer: how organizations use AWS to analyze data and build intelligent applications.
Three Conceptual Layers
Think of AWS as: 1) Foundation: EC2, S3, RDS, VPC, IAM, IaC. 2) Data & Analytics: services to store, move, transform, and analyze data. 3) AI & ML: services that learn patterns and make predictions.
Exam-Relevant Skills
For CLF-C02, you must recognize analytics categories, map basic ML concepts to AWS services, and connect them to storage and databases you already know.
Simple Data-to-Intelligence Flow
Mental model: Store data (S3, DBs) → Collect/move (Kinesis, Glue) → Analyze (Athena, Redshift, EMR, QuickSight) → Learn from data (SageMaker, AI services like Bedrock, Comprehend, Rekognition).
Core Categories of AWS Analytics Services
Why Categories Matter
CLF-C02 focuses on recognizing which type of analytics service fits a scenario, not deep configs. Learn categories: data lake, warehouse, ETL, streaming, BI.
Data Lakes and Query-on-S3
S3 often stores the data lake. Amazon Athena lets you run SQL directly on S3 data without servers, paying per query and data scanned.
Data Warehousing with Redshift
Amazon Redshift is a fully managed data warehouse for large-scale analytical queries on structured, curated data (e.g., finance, sales).
Big Data Processing and ETL
AWS Glue handles serverless ETL and a central Data Catalog. Amazon EMR runs big data frameworks like Spark and Hadoop on managed clusters.
Streaming and BI
Amazon Kinesis processes real-time streams like logs or IoT. Amazon QuickSight provides serverless BI dashboards and visualizations for decision-makers.
Example Walkthrough: Building a Simple Analytics Stack
Scenario: E-commerce Analytics
An online store wants to understand customer behavior and site performance using AWS. Let’s follow their data from source to insights.
Step 1: Data Sources and Storage
Web servers on EC2 write logs to S3 and streams to Kinesis. Orders live in RDS or DynamoDB. S3 becomes the central data lake.
Step 2: Integration and Preparation
AWS Glue crawlers infer schemas for S3 data. Glue ETL jobs clean and transform raw logs into structured tables in efficient formats.
Step 3: Analytics and Dashboards
Analysts use Athena for ad hoc SQL on S3 and Redshift for heavy analytical queries. QuickSight builds dashboards from both sources.
Step 4: Ready for Machine Learning
The same curated data in S3 and Redshift can power SageMaker models for recommendations, churn prediction, or anomaly detection.
Machine Learning vs Traditional Analytics
Traditional Analytics vs ML
Traditional analytics runs explicit rules or SQL you write. Machine learning learns patterns from data instead of you coding all the rules.
How ML Learns
You provide input features and labels; an algorithm finds patterns that map inputs to outputs, then uses them to predict for new cases.
Key ML Concepts for CLF-C02
Supervised learning uses labeled examples, unsupervised finds patterns in unlabeled data, and inference is using a trained model to predict.
Two AWS Approaches
AI services are pre-built APIs for tasks like vision or language. Amazon SageMaker is a full ML platform to build, train, and deploy custom models.
Common Exam Trap
No ML expertise and fast time-to-value → AI services. Data scientists, custom models, or special algorithms → SageMaker.
Key AWS AI and ML Services (2026 Landscape)
SageMaker: The ML Platform
Amazon SageMaker is the end-to-end platform for building, training, and deploying custom ML models, mainly used by data scientists.
Generative AI: Bedrock and Amazon Q
Amazon Bedrock exposes foundation models for text, images, and more via API. Amazon Q is a generative AI assistant for AWS and business data.
Language and Text AI Services
Comprehend handles NLP like sentiment and entities. Translate converts languages, Transcribe turns speech to text, and Polly turns text to speech.
Vision and Document AI
Rekognition analyzes images and videos. Textract extracts text, forms, and tables from scanned documents and PDFs.
Recommendations and Forecasting
Amazon Personalize builds recommendation systems; Amazon Forecast provides time-series demand or usage forecasts from historical data.
Thought Exercise: Match Scenarios to Services
How to Approach Scenarios
For each scenario, first decide: batch vs streaming, structured vs unstructured data, ad hoc queries vs dashboards, pre-built AI vs custom ML.
Scenario 1: Real-Time Clickstream
Near real-time traffic spikes → streaming analytics → think Amazon Kinesis Data Streams plus processing (Kinesis Data Analytics or Lambda).
Scenario 2: Logs on S3
CloudTrail logs already in S3 and ad hoc queries → serverless query-on-S3 → Amazon Athena is a strong fit.
Scenario 3: Recommendations
Retailer wants "customers who bought X also bought Y" with no ML team → AI service for personalization → Amazon Personalize.
Scenario 4: Document Summaries
Summarizing and querying long PDFs → generative AI on documents → Amazon Bedrock models, possibly with Textract or Amazon Q.
Connecting Analytics and ML to Core AWS Services
Storage Feeds Analytics and ML
S3 is the usual data lake backing Athena, EMR, Glue, and Redshift Spectrum. RDS, Aurora, and DynamoDB act as sources or sinks for pipelines.
Compute and Orchestration
Lambda and Step Functions orchestrate data and ML workflows. EC2 or containers can run custom analytics or ML services when needed.
Networking and Security Context
Analytics and ML services integrate with your VPC. Remember the AWS shared responsibility model still applies to data and access.
Identity and Access
IAM roles let services like Glue, Redshift, and SageMaker securely access S3, Kinesis, and databases without embedding credentials.
Key Exam Idea
Analytics and ML sit on top of your existing AWS foundation. They reuse Regions, AZs, VPCs, IAM, and encryption practices you already learned.
Quick Check: Analytics Service Categories
Test your ability to match scenarios to the right high-level analytics service.
A company stores terabytes of application logs in Amazon S3 and wants analysts to run occasional SQL queries on this data without managing any servers. Which AWS service is the BEST fit?
- Amazon Redshift
- Amazon Athena
- Amazon EMR
- AWS Glue
Show Answer
Answer: B) Amazon Athena
Amazon Athena is designed for serverless, ad hoc SQL queries directly against data stored in S3. Redshift is a data warehouse that requires provisioning a cluster. EMR runs big data frameworks on managed clusters. Glue is mainly for ETL and cataloging, not direct interactive querying by analysts.
Quick Check: AI vs ML on AWS
Test your understanding of when to use pre-built AI services versus SageMaker.
An insurance company wants to automatically extract fields like policy number, customer name, and total amount from scanned claim forms. They have no ML team and want the fastest path to value. Which service is the BEST fit?
- Amazon SageMaker
- Amazon Textract
- Amazon Comprehend
- Amazon Bedrock
Show Answer
Answer: B) Amazon Textract
Amazon Textract is a pre-built AI service for extracting text, forms, and tables from scanned documents. SageMaker is for building custom models. Comprehend is for natural language understanding on text, not document layout extraction. Bedrock is for foundation models and generative AI, not specifically for structured form extraction.
Key Terms Review
Use these flashcards to reinforce the main services and concepts.
- Amazon Athena
- Serverless interactive query service that lets you use standard SQL to analyze data directly in Amazon S3, paying per query and data scanned.
- Amazon Redshift
- Fully managed, petabyte-scale data warehouse service optimized for complex analytical queries over structured and semi-structured data.
- AWS Glue
- Serverless data integration service used for ETL (extract, transform, load) and maintaining a central Data Catalog of datasets.
- Amazon Kinesis
- Family of services (Data Streams, Data Firehose, Data Analytics) for collecting, processing, and analyzing real-time streaming data.
- Amazon QuickSight
- Serverless business intelligence (BI) service for interactive dashboards, visualizations, and basic ML-powered insights.
- Amazon SageMaker
- End-to-end machine learning platform on AWS to build, train, and deploy custom ML models at scale.
- Amazon Bedrock
- Fully managed service that provides access to foundation models from AWS and partners via API for generative AI tasks like text and image generation.
- Amazon Comprehend
- Natural language processing (NLP) service for sentiment analysis, key phrase extraction, entity recognition, and topic modeling.
- Amazon Rekognition
- AI service for image and video analysis, including object detection, face detection, and unsafe content detection.
- Amazon Personalize
- AI service for building real-time personalized recommendations without requiring ML expertise.
Typical Business Use Cases and Exam Patterns
Use Case: Log Analytics
Logs in S3 → Glue for catalog → Athena for SQL → QuickSight for dashboards. Real-time alerts add Kinesis or CloudWatch Logs plus Lambda.
Use Case: Customer 360
CRM + web + transactions into S3 data lake, Glue ETL, Redshift for curated analytics, QuickSight dashboards, then Personalize or Forecast on top.
Use Case: Document Processing
Textract extracts fields from invoices or forms; Comprehend can enrich text; results go to S3 or databases for reporting and search.
Use Case: Generative AI Assistants
Amazon Bedrock and Amazon Q can power chatbots, code helpers, and document summarization across your internal knowledge base.
Exam Pattern Reminders
Real time → Kinesis; SQL on S3 → Athena; big curated BI → Redshift + QuickSight; no ML skills → AI services; data scientists → SageMaker.
Key Terms
- ETL
- Extract, Transform, Load: a process to move data from source systems, clean and reshape it, and load it into a target store for analytics.
- Data lake
- A centralized repository that allows you to store all your structured and unstructured data at any scale, typically on Amazon S3, and run different types of analytics on it.
- Inference
- The process of using a trained machine learning model to make predictions or generate outputs on new, unseen data.
- AI service
- A pre-built, managed AWS service that exposes a specific artificial intelligence capability via API, such as translation, image recognition, or text analysis.
- Data warehouse
- A centralized store of structured and semi-structured data optimized for fast analytical queries and reporting, such as Amazon Redshift.
- Streaming data
- Data that is generated continuously from sources like application logs, clickstreams, or IoT sensors and processed in near real time.
- Foundation model
- A large, pre-trained machine learning model (often used in generative AI) that can be adapted to many downstream tasks such as text or image generation.
- Machine learning
- A field of artificial intelligence where systems learn patterns from data to make predictions or decisions without being explicitly programmed with all rules.
- Serverless analytics
- Analytics services where you do not manage servers or clusters; the provider automatically provisions and scales resources, and you pay per use (e.g., Athena, Glue).
- Business intelligence (BI)
- Tools and processes that transform raw data into meaningful visualizations and dashboards to support business decision-making.