SkarpSkarp

Chapter 15 of 20

Analytics, AI, and Machine Learning Services on AWS

Go beyond core infrastructure to see how AWS helps organizations analyze data and build intelligent applications with AI and machine learning.

27 min readen

Big Picture: Where Analytics and ML Fit in AWS

From Infrastructure to Intelligence

You previously learned how to run infrastructure on AWS. Now we move up a layer: how organizations use AWS to analyze data and build intelligent applications.

Three Conceptual Layers

Think of AWS as: 1) Foundation: EC2, S3, RDS, VPC, IAM, IaC. 2) Data & Analytics: services to store, move, transform, and analyze data. 3) AI & ML: services that learn patterns and make predictions.

Exam-Relevant Skills

For CLF-C02, you must recognize analytics categories, map basic ML concepts to AWS services, and connect them to storage and databases you already know.

Simple Data-to-Intelligence Flow

Mental model: Store data (S3, DBs) → Collect/move (Kinesis, Glue) → Analyze (Athena, Redshift, EMR, QuickSight) → Learn from data (SageMaker, AI services like Bedrock, Comprehend, Rekognition).

Core Categories of AWS Analytics Services

Why Categories Matter

CLF-C02 focuses on recognizing which type of analytics service fits a scenario, not deep configs. Learn categories: data lake, warehouse, ETL, streaming, BI.

Data Lakes and Query-on-S3

S3 often stores the data lake. Amazon Athena lets you run SQL directly on S3 data without servers, paying per query and data scanned.

Data Warehousing with Redshift

Amazon Redshift is a fully managed data warehouse for large-scale analytical queries on structured, curated data (e.g., finance, sales).

Big Data Processing and ETL

AWS Glue handles serverless ETL and a central Data Catalog. Amazon EMR runs big data frameworks like Spark and Hadoop on managed clusters.

Streaming and BI

Amazon Kinesis processes real-time streams like logs or IoT. Amazon QuickSight provides serverless BI dashboards and visualizations for decision-makers.

Example Walkthrough: Building a Simple Analytics Stack

Scenario: E-commerce Analytics

An online store wants to understand customer behavior and site performance using AWS. Let’s follow their data from source to insights.

Step 1: Data Sources and Storage

Web servers on EC2 write logs to S3 and streams to Kinesis. Orders live in RDS or DynamoDB. S3 becomes the central data lake.

Step 2: Integration and Preparation

AWS Glue crawlers infer schemas for S3 data. Glue ETL jobs clean and transform raw logs into structured tables in efficient formats.

Step 3: Analytics and Dashboards

Analysts use Athena for ad hoc SQL on S3 and Redshift for heavy analytical queries. QuickSight builds dashboards from both sources.

Step 4: Ready for Machine Learning

The same curated data in S3 and Redshift can power SageMaker models for recommendations, churn prediction, or anomaly detection.

Machine Learning vs Traditional Analytics

Traditional Analytics vs ML

Traditional analytics runs explicit rules or SQL you write. Machine learning learns patterns from data instead of you coding all the rules.

How ML Learns

You provide input features and labels; an algorithm finds patterns that map inputs to outputs, then uses them to predict for new cases.

Key ML Concepts for CLF-C02

Supervised learning uses labeled examples, unsupervised finds patterns in unlabeled data, and inference is using a trained model to predict.

Two AWS Approaches

AI services are pre-built APIs for tasks like vision or language. Amazon SageMaker is a full ML platform to build, train, and deploy custom models.

Common Exam Trap

No ML expertise and fast time-to-value → AI services. Data scientists, custom models, or special algorithms → SageMaker.

Key AWS AI and ML Services (2026 Landscape)

SageMaker: The ML Platform

Amazon SageMaker is the end-to-end platform for building, training, and deploying custom ML models, mainly used by data scientists.

Generative AI: Bedrock and Amazon Q

Amazon Bedrock exposes foundation models for text, images, and more via API. Amazon Q is a generative AI assistant for AWS and business data.

Language and Text AI Services

Comprehend handles NLP like sentiment and entities. Translate converts languages, Transcribe turns speech to text, and Polly turns text to speech.

Vision and Document AI

Rekognition analyzes images and videos. Textract extracts text, forms, and tables from scanned documents and PDFs.

Recommendations and Forecasting

Amazon Personalize builds recommendation systems; Amazon Forecast provides time-series demand or usage forecasts from historical data.

Thought Exercise: Match Scenarios to Services

How to Approach Scenarios

For each scenario, first decide: batch vs streaming, structured vs unstructured data, ad hoc queries vs dashboards, pre-built AI vs custom ML.

Scenario 1: Real-Time Clickstream

Near real-time traffic spikes → streaming analytics → think Amazon Kinesis Data Streams plus processing (Kinesis Data Analytics or Lambda).

Scenario 2: Logs on S3

CloudTrail logs already in S3 and ad hoc queries → serverless query-on-S3 → Amazon Athena is a strong fit.

Scenario 3: Recommendations

Retailer wants "customers who bought X also bought Y" with no ML team → AI service for personalization → Amazon Personalize.

Scenario 4: Document Summaries

Summarizing and querying long PDFs → generative AI on documents → Amazon Bedrock models, possibly with Textract or Amazon Q.

Connecting Analytics and ML to Core AWS Services

Storage Feeds Analytics and ML

S3 is the usual data lake backing Athena, EMR, Glue, and Redshift Spectrum. RDS, Aurora, and DynamoDB act as sources or sinks for pipelines.

Compute and Orchestration

Lambda and Step Functions orchestrate data and ML workflows. EC2 or containers can run custom analytics or ML services when needed.

Networking and Security Context

Analytics and ML services integrate with your VPC. Remember the AWS shared responsibility model still applies to data and access.

Identity and Access

IAM roles let services like Glue, Redshift, and SageMaker securely access S3, Kinesis, and databases without embedding credentials.

Key Exam Idea

Analytics and ML sit on top of your existing AWS foundation. They reuse Regions, AZs, VPCs, IAM, and encryption practices you already learned.

Quick Check: Analytics Service Categories

Test your ability to match scenarios to the right high-level analytics service.

A company stores terabytes of application logs in Amazon S3 and wants analysts to run occasional SQL queries on this data without managing any servers. Which AWS service is the BEST fit?

  1. Amazon Redshift
  2. Amazon Athena
  3. Amazon EMR
  4. AWS Glue
Show Answer

Answer: B) Amazon Athena

Amazon Athena is designed for serverless, ad hoc SQL queries directly against data stored in S3. Redshift is a data warehouse that requires provisioning a cluster. EMR runs big data frameworks on managed clusters. Glue is mainly for ETL and cataloging, not direct interactive querying by analysts.

Quick Check: AI vs ML on AWS

Test your understanding of when to use pre-built AI services versus SageMaker.

An insurance company wants to automatically extract fields like policy number, customer name, and total amount from scanned claim forms. They have no ML team and want the fastest path to value. Which service is the BEST fit?

  1. Amazon SageMaker
  2. Amazon Textract
  3. Amazon Comprehend
  4. Amazon Bedrock
Show Answer

Answer: B) Amazon Textract

Amazon Textract is a pre-built AI service for extracting text, forms, and tables from scanned documents. SageMaker is for building custom models. Comprehend is for natural language understanding on text, not document layout extraction. Bedrock is for foundation models and generative AI, not specifically for structured form extraction.

Key Terms Review

Use these flashcards to reinforce the main services and concepts.

Amazon Athena
Serverless interactive query service that lets you use standard SQL to analyze data directly in Amazon S3, paying per query and data scanned.
Amazon Redshift
Fully managed, petabyte-scale data warehouse service optimized for complex analytical queries over structured and semi-structured data.
AWS Glue
Serverless data integration service used for ETL (extract, transform, load) and maintaining a central Data Catalog of datasets.
Amazon Kinesis
Family of services (Data Streams, Data Firehose, Data Analytics) for collecting, processing, and analyzing real-time streaming data.
Amazon QuickSight
Serverless business intelligence (BI) service for interactive dashboards, visualizations, and basic ML-powered insights.
Amazon SageMaker
End-to-end machine learning platform on AWS to build, train, and deploy custom ML models at scale.
Amazon Bedrock
Fully managed service that provides access to foundation models from AWS and partners via API for generative AI tasks like text and image generation.
Amazon Comprehend
Natural language processing (NLP) service for sentiment analysis, key phrase extraction, entity recognition, and topic modeling.
Amazon Rekognition
AI service for image and video analysis, including object detection, face detection, and unsafe content detection.
Amazon Personalize
AI service for building real-time personalized recommendations without requiring ML expertise.

Typical Business Use Cases and Exam Patterns

Use Case: Log Analytics

Logs in S3 → Glue for catalog → Athena for SQL → QuickSight for dashboards. Real-time alerts add Kinesis or CloudWatch Logs plus Lambda.

Use Case: Customer 360

CRM + web + transactions into S3 data lake, Glue ETL, Redshift for curated analytics, QuickSight dashboards, then Personalize or Forecast on top.

Use Case: Document Processing

Textract extracts fields from invoices or forms; Comprehend can enrich text; results go to S3 or databases for reporting and search.

Use Case: Generative AI Assistants

Amazon Bedrock and Amazon Q can power chatbots, code helpers, and document summarization across your internal knowledge base.

Exam Pattern Reminders

Real time → Kinesis; SQL on S3 → Athena; big curated BI → Redshift + QuickSight; no ML skills → AI services; data scientists → SageMaker.

Key Terms

ETL
Extract, Transform, Load: a process to move data from source systems, clean and reshape it, and load it into a target store for analytics.
Data lake
A centralized repository that allows you to store all your structured and unstructured data at any scale, typically on Amazon S3, and run different types of analytics on it.
Inference
The process of using a trained machine learning model to make predictions or generate outputs on new, unseen data.
AI service
A pre-built, managed AWS service that exposes a specific artificial intelligence capability via API, such as translation, image recognition, or text analysis.
Data warehouse
A centralized store of structured and semi-structured data optimized for fast analytical queries and reporting, such as Amazon Redshift.
Streaming data
Data that is generated continuously from sources like application logs, clickstreams, or IoT sensors and processed in near real time.
Foundation model
A large, pre-trained machine learning model (often used in generative AI) that can be adapted to many downstream tasks such as text or image generation.
Machine learning
A field of artificial intelligence where systems learn patterns from data to make predictions or decisions without being explicitly programmed with all rules.
Serverless analytics
Analytics services where you do not manage servers or clusters; the provider automatically provisions and scales resources, and you pay per use (e.g., Athena, Glue).
Business intelligence (BI)
Tools and processes that transform raw data into meaningful visualizations and dashboards to support business decision-making.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself