The arithmetic average: add all values and divide by the number of values. Sensitive to extreme values (outliers).

The middle value when data are sorted. If there is an even number of values, average the two middle ones. Robust to outliers.

The value that appears most frequently in the data. There can be more than one mode, or none if all values are equally frequent.

A simple measure of spread: maximum value minus minimum value. Very sensitive to extreme values.

The average of the squared deviations from the mean. Measures how spread out values are around the mean.

The arithmetic average of a set of numbers, found by adding them and dividing by the count.

The most frequently occurring value in a dataset.

The difference between the maximum and minimum values in a dataset.

The middle value of ordered data; if there are two middle values, their average.

Descriptive Statistics: Summarizing Data at a Glance — Data Science Foundations: From Raw Data to Insight

Why Descriptive Statistics Matter

What Are Descriptive Statistics?

Descriptive statistics are tools that help you quickly summarize and understand a dataset before any complex analysis or machine learning.

Why They Matter

They answer questions like: What is a typical value? How spread out are the values? Are there any extreme values (outliers)?

Three Main Groups

We group them into: measures of central tendency (mean, median, mode), measures of spread (range, variance, standard deviation, IQR), and visual summaries (histograms, boxplots).

Our Focus

We will use small datasets and do manual, conceptual calculations. Descriptive stats give a quick snapshot that guides deeper analysis.

A Running Example Dataset

Quiz Score Dataset

We will use one dataset: quiz scores of 7 students (out of 10). The scores are: 4, 6, 7, 7, 8, 9, 10.

Why One Dataset?

Using the same numbers lets you see how different descriptive statistics (mean, median, spread) relate to each other.

What You Will Do

You will find typical values (mean, median, mode), measure spread (range, variance, SD, IQR), and imagine the distribution (histogram, boxplot).

Central Tendency: Mean, Median, Mode

Central Tendency

Central tendency describes where the center of the data is. The main measures are mean, median, and mode.

Mean (Average)

Add all values and divide by how many there are. Our scores sum to 51, count is 7, so mean ≈ 51/7 ≈ 7.29.

Median (Middle)

Sort the data and pick the middle value. Sorted: 4, 6, 7, 7, 8, 9, 10. The 4th value is 7, so the median is 7.

Mode (Most Frequent)

The mode is the value that appears most often. Only 7 appears twice, so the mode is 7.

Mean vs Median

Mean uses all values and is sensitive to extremes. Median and mode are more robust when data are skewed or have outliers.

Try It: Central Tendency in a Small Dataset

Work with this new mini dataset of daily study hours for 5 days:

`1, 2, 2, 4, 11`

Sort the data (if needed). Write the sorted list.
Find the mean:

Add all 5 numbers.
Divide by 5.

Find the median:

Since there are 5 values (odd number), the median is the 3rd value in the sorted list.

Find the mode:

Which value appears most often?

Then think:

Which measure (mean or median) better represents a "typical" study day here, and why?
Hint: One day is very different from the others.

Pause and actually do the calculations before moving on. You can check yourself:

If your mean is much higher than most of the values, ask: is one extreme value pulling it up?
If your median seems closer to where most days fall, that is a sign the median might be more representative.

Spread: Range, Variance, and Standard Deviation

Why Measure Spread?

Two groups can share the same average but have very different variability. Measures of spread tell you how tightly values cluster around the center.

Range

Range = max − min. For scores 4, 6, 7, 7, 8, 9, 10, max = 10, min = 4, so range = 6. Simple but sensitive to extremes.

Variance Concept

Variance is the average squared distance from the mean. It uses all data and reflects how far values typically lie from the mean.

Variance Steps

1) Subtract mean from each value. 2) Square each deviation and add them. 3) Divide by number of values (here, 7) to get variance ≈ 3.34.

Standard Deviation

Standard deviation is the square root of variance. For our data, SD ≈ √3.34 ≈ 1.83: scores are about 1.8 points from the mean on average.

n vs n−1

For simple understanding we divided by n. In practice, when using a sample to estimate a population, analysts often divide by n−1 instead.

Quartiles and the Interquartile Range (IQR)

Quartiles

Quartiles split ordered data into four equal parts: Q1 (25%), Q2 (median, 50%), Q3 (75%). They describe spread around the median.

Finding Quartiles

Sorted scores: 4, 6, 7, 7, 8, 9, 10. Median (Q2) = 7. Lower half: 4, 6, 7 so Q1 = 6. Upper half: 7, 8, 9, 10 so Q3 = 8.5.

Interquartile Range (IQR)

IQR = Q3 − Q1. Here IQR = 8.5 − 6 = 2.5. The middle 50% of scores lie in a band 2.5 points wide.

Why Use IQR?

IQR ignores the lowest and highest 25% of values, so it is robust to outliers and skew. Median + IQR work well for skewed data.

Distributions, Histograms, and Boxplots (Conceptually)

What Is a Distribution?

A distribution describes how often different values occur. Picture each data point as a dot on a number line and see where dots pile up.

Histogram Concept

A histogram groups values into bins and shows how many fall in each bin. Taller bars mean more data in that range, revealing peaks and skew.

Boxplot Concept

A boxplot shows a box from Q1 to Q3, a line at the median, whiskers toward min and max, and sometimes dots for outliers.

Our Quiz Boxplot

For our scores: min 4, Q1 6, median 7, Q3 8.5, max 10. The box spans 6–8.5 with a median line at 7 and whiskers toward 4 and 10.

Shape vs Summary

Numerical summaries give quick numbers; histograms and boxplots reveal the overall shape, skew, and outliers of the data.

Check Your Understanding: Choosing the Right Statistic

Answer this question to test your understanding of when to use mean vs median.

A small company has monthly salaries (in $1000s): 3, 3, 3, 4, 4, 60 (the CEO). Which measure better represents a "typical" employee salary?

Mean salary, because it uses all values and is around (3+3+3+4+4+60)/6
Median salary, because it is not pulled up by the CEO's very high salary
Mode salary, because it only considers the most frequent value and ignores others

Show Answer

Answer: B) Median salary, because it is not pulled up by the CEO's very high salary

The CEO's salary is an extreme outlier that makes the mean much higher than most employees' pay. The median sits in the middle of the ordered data and is less affected by this outlier, so it better represents a typical employee salary.

Apply It: Summarizing a Tiny Dataset

You are given the ages (in years) of 6 people in a study group:

`18, 19, 19, 20, 35, 36`

Sort the data (if needed) and write them down.
Find the mean age.
Find the median age (even number of values: average the 3rd and 4th values).
Find the range.
Find Q1 and Q3 (hint: split into lower half and upper half after finding the median, then find medians of each half).
Compute the IQR.

Then reflect:

Which seems more representative of a "typical" age here: mean or median?
Are the ages tightly clustered or spread out?

Try to reason before checking with any calculator. Focus on the process of summarizing, not perfect arithmetic.

Review Key Terms

Use these flashcards to review the core concepts from this module.

Mean: The arithmetic average: add all values and divide by the number of values. Sensitive to extreme values (outliers).
Median: The middle value when data are sorted. If there is an even number of values, average the two middle ones. Robust to outliers.
Mode: The value that appears most frequently in the data. There can be more than one mode, or none if all values are equally frequent.
Range: A simple measure of spread: maximum value minus minimum value. Very sensitive to extreme values.
Variance: The average of the squared deviations from the mean. Measures how spread out values are around the mean.
Standard Deviation: The square root of the variance. Expresses typical distance from the mean in the same units as the data.
Quartiles (Q1, Q2, Q3): Values that split ordered data into four equal parts: Q1 at 25%, Q2 at 50% (median), Q3 at 75%.
Interquartile Range (IQR): The distance between Q3 and Q1 (IQR = Q3 − Q1). Describes the spread of the middle 50% of the data and is robust to outliers.
Distribution: A description of how often different values occur in a dataset. Can be shown with histograms, boxplots, or dot plots.
Histogram: A plot that groups data into bins and shows how many values fall into each bin, revealing the shape of the distribution.
Boxplot: A visual summary using median, quartiles, and whiskers (and sometimes outliers) to show the distribution of data.

Key Terms

Mean: The arithmetic average of a set of numbers, found by adding them and dividing by the count.
Mode: The most frequently occurring value in a dataset.
Range: The difference between the maximum and minimum values in a dataset.
Median: The middle value of ordered data; if there are two middle values, their average.
Boxplot: A compact graphical summary of a distribution using median, quartiles, and whiskers to show spread and potential outliers.
Outlier: A data point that is unusually far from most other values in the dataset.
Variance: A measure of spread that is the average squared distance of each value from the mean.
Histogram: A bar-like plot that shows how many data points fall within each of several value ranges (bins).
Quartiles: Values that divide ordered data into four equal parts: Q1 (25%), Q2 (50%, median), Q3 (75%).
Distribution: How values in a dataset are spread out or clustered, often described by its shape, center, and spread.
Standard deviation: The square root of the variance, representing a typical distance from the mean in the same units as the data.
Interquartile range (IQR): The difference between the third and first quartiles (Q3 − Q1), showing spread of the middle 50% of data.