Chapter 5 of 9
Probability and Uncertainty: Thinking in Chances
Move beyond gut feeling by learning to think in terms of chances, events, and likelihoods—building the intuition that underpins modern predictive models.
From Gut Feeling to Chances
Living With Uncertainty
We constantly face uncertainty: Will it rain? Will the bus be late? We usually answer with vague words like "probably" or "maybe".
Why Probability?
Probability turns vague words into clearer, numeric chances. It helps us think more consistently about uncertain events.
What You Will Learn
You will learn probabilities as numbers between 0 and 1, what events and outcomes are, independence vs dependence, and how randomness and sampling connect to data.
Link to Earlier Modules
You already know data types and descriptive statistics. Now we add a new layer: thinking about what might happen next, and how confident we are.
Basic Probability: Outcomes, Events, and the 0–1 Scale
Outcomes
An outcome is a single result of a process. Coin toss: Heads or Tails. Die roll: 1, 2, 3, 4, 5, or 6.
Events
An event is a set of outcomes we care about. Example: "even number" when rolling a die is {2, 4, 6}.
Probability as a Number
Probability measures how likely an event is. It is always between 0 (impossible) and 1 (certain).
Percentages
We often use percentages: 0 → 0%, 0.5 → 50%, 1 → 100%. A fair coin has 50% chance of Heads and 50% of Tails.
Real-World Probabilities
In practice we rarely know exact probabilities. We estimate them from data or use them as degrees of belief, like a 30% chance of rain.
Your Intuition: Translate Words to Numbers
Try this quick thought exercise. There are no perfectly right answers; the goal is to connect your intuition to numbers.
For each statement, write down (mentally or on paper) a probability between 0 and 1, or a percentage.
- "It might rain later today."
- What probability would you assign? (For example 0.2, 0.5, 0.8?)
- "My phone battery will last until tonight."
- Based on your experience today, what chance would you give it?
- "A random student in my class has watched at least one full season of a streaming series this month."
- What is your guess as a percentage?
- "If I randomly pick a day of the week, it is a weekend."
- Now this one you can calculate: out of 7 days, 2 are weekend days.
- So: `P(weekend) = 2/7 ≈ 0.286 ≈ 28.6%`.
Reflect:
- Which of your answers were guesses based on belief?
- Which were calculations based on counting outcomes?
This difference (belief vs counting) is important. In data science, we often start with beliefs, then update them using data.
Probability as Long-Run Frequency
Long-Run Frequency Idea
Probability can be seen as the long-run frequency of an event when we repeat a random process many times.
Coin Toss Example
Toss a fair coin many times. As the number of tosses grows, the fraction of Heads tends to get closer to 0.5.
From Data to Probability
Website example: 320 purchases out of 10,000 visits gives an observed frequency of 3.2%. We use this as an estimated probability.
Link to Predictive Models
Modern predictive systems often estimate probabilities from large datasets, then use them to make predictions about future events.
Independent vs Dependent Events (Concept Only)
Independent Events
Events are independent if knowing one happened does not change the chance of the other. Example: two separate fair coin tosses.
Dependent Events
Events are dependent if knowing one happened changes the chance of the other. Example: drawing cards without replacement.
Everyday Dependence
Weather: morning rain and afternoon rain are often dependent. If it rains in the morning, afternoon rain becomes more likely.
Why It Matters
In data science, variables are often dependent. Some models assume independence; others handle complex relationships.
Key Idea
Remember: independent = no effect on chance; dependent = changes the chance when you know one event happened.
Classify Events: Independent or Dependent?
Decide whether each pair of events is independent or dependent. Reason it out in your own words.
- Two dice
- Event A: "Die 1 shows a 6".
- Event B: "Die 2 shows a 6".
- Are A and B independent or dependent?
- Same class
- Event A: "Student 1 in your class passes the exam".
- Event B: "Student 2 in your class passes the exam".
- Think about shared study conditions, teaching quality, etc.
- Drawing marbles without replacement
- A bag has 3 red and 3 blue marbles.
- Event A: "First marble drawn is red".
- Event B: "Second marble drawn is red".
- Drawing marbles with replacement
- Same bag, but after each draw you put the marble back and mix.
- Event A: "First marble drawn is red".
- Event B: "Second marble drawn is red".
Pause and decide for each.
Suggested answers:
- Two dice: independent (one die does not affect the other).
- Same class: dependent (if the course is easy or hard, it affects many students together).
- Without replacement: dependent (the first draw changes the bag).
- With replacement: independent (the bag resets each time).
Randomness, Uncertainty, and Sampling
Randomness
Random processes have unpredictable individual outcomes but stable long-run patterns, like coin tosses or die rolls.
Uncertainty
Uncertainty is our lack of full knowledge. Even non-random systems can be modeled with probability when we cannot observe everything.
What Is a Sample?
A sample is a subset of a larger population, like surveying 200 students out of 20,000 at a university.
Sampling Variability
Different random samples from the same population give slightly different results. This natural variation is sampling variability.
Why Probability Matters
Because samples vary, our estimates are uncertain. Probability provides a language to describe and manage this uncertainty.
Simulating Sampling Variability (Optional, Python)
If you know a bit of Python, you can see sampling variability in action.
This code:
- Simulates a population of 100,000 people
- Each person has a 30% chance of liking a new app
- Draws many random samples of size 200
- Shows how the sample proportion changes from sample to sample
```python
import numpy as np
Set a seed so results are reproducible
np.random.seed(42)
1. Create a population: 1 = likes app, 0 = does not
populationsize = 100000
trueproblike = 0.30
population = np.random.binomial(1, trueproblike, size=population_size)
2. Function to take a random sample and compute proportion who like the app
def sampleproportion(population, samplesize=200):
sampleindices = np.random.choice(len(population), size=samplesize, replace=False)
sample = population[sample_indices]
return sample.mean()
3. Take many samples and store their proportions
num_samples = 20
proportions = [sampleproportion(population) for in range(num_samples)]
print("True probability of liking the app:", trueproblike)
print("Sample proportions (each from 200 people):")
print(proportions)
print("Average of sample proportions:", np.mean(proportions))
```
What you should notice:
- Each sample proportion is close to 0.30, but not exactly.
- This is sampling variability.
- As you increase `sample_size`, the sample proportions tend to get closer to the true probability.
Check Understanding: Probability and Events
Answer this question to check your understanding of basic probability.
You roll a fair six-sided die once. Which statement is correct?
- The probability of getting a 7 is 1/7 because there are 7 possible integers.
- The probability of getting an even number is 3/6 because there are three even outcomes.
- The probability of getting a 3 changes if you already rolled a 3 earlier today.
- The probability of getting a 1 is 0 because that is very unlikely.
Show Answer
Answer: B) The probability of getting an even number is 3/6 because there are three even outcomes.
A fair die has outcomes {1,2,3,4,5,6}. The event "even" is {2,4,6}, which has 3 outcomes out of 6, so the probability is 3/6 = 1/2. Getting a 7 is impossible, earlier rolls today do not affect a new roll, and unlikely events can still have non-zero probability.
Check Understanding: Independence and Sampling
Answer this question about independent events and sampling variability.
A university surveys two different random samples of 200 students each about whether they have a part-time job. In sample 1, 40% say yes. In sample 2, 46% say yes. Which is the best interpretation?
- The survey is useless because the results are not exactly the same.
- This difference is expected due to sampling variability, even if the true proportion is fixed.
- The university must have changed its policy between the two samples.
- It is impossible for random samples to give different results if they are honest.
Show Answer
Answer: B) This difference is expected due to sampling variability, even if the true proportion is fixed.
Different random samples from the same population naturally give slightly different results. This is called sampling variability. It does not mean the survey is useless or that a policy changed.
Review Key Terms
Use these flashcards to review the main ideas from this module.
- Outcome
- A single possible result of a process, like "Heads" in a coin toss or "4" in a die roll.
- Event
- A set of outcomes we care about, such as "even number" when rolling a die (outcomes {2, 4, 6}).
- Probability
- A number between 0 and 1 (or 0% to 100%) that measures how likely an event is to happen.
- Independent events
- Events where knowing one happened does not change the chance of the other (for example, two separate fair coin tosses).
- Dependent events
- Events where knowing one happened changes the chance of the other (for example, drawing cards without replacement).
- Randomness
- A property of processes where individual outcomes are unpredictable, but long-run patterns are stable.
- Sample
- A subset of a larger population, often selected randomly to learn about the whole group.
- Sampling variability
- The natural differences in results that appear when we take different random samples from the same population.
Key Terms
- Event
- A collection of one or more outcomes that we are interested in, like "rolling an even number".
- Sample
- A subset of individuals or observations taken from a larger population for analysis.
- Outcome
- A single possible result of a random process, such as "Heads" in a coin toss.
- Randomness
- The feature of a process where individual outcomes cannot be predicted with certainty, though overall patterns may be stable.
- Probability
- A number between 0 and 1 (or 0% to 100%) that describes how likely an event is to occur.
- Dependent events
- Two events where the occurrence of one changes the probability of the other.
- Independent events
- Two events where the occurrence of one does not affect the probability of the other.
- Sampling variability
- The fact that statistics (like averages or proportions) computed from different random samples of the same population will differ from each other.