Chapter 3 of 9
Knowing Your Data: Types, Structure, and Quality
Peek inside real datasets to see how numbers, categories, text, and timestamps are represented—and why understanding structure and quality is essential before any analysis or modeling.
Step 1: Why Knowing Your Data Comes First
Why This Matters
Before any model, you must understand what your data actually looks like. This step zooms in on the raw material of data science: the data itself.
What You Will Learn
You will see different data structures, learn common data types, and spot basic data quality issues like missing values and outliers.
Key Idea
Think of this as reading the ingredients label on your dataset. Good analysis starts with knowing your data; skipping this can lead to very wrong conclusions.
Step 2: Structured, Semi-Structured, and Unstructured Data
Three Structure Types
Data can be structured, semi-structured, or unstructured. Recognizing which you have guides your cleaning and analysis steps.
Structured Data
Structured data lives in tables with rows and columns. Examples: gradebooks, sales tables. Each column has a clear type like number or date.
Semi-Structured Data
Semi-structured data has some organization but not a fixed table. Common formats are JSON or XML, where different records can have different fields.
Unstructured Data
Unstructured data has no simple row-column form: free text, images, audio. It usually must be transformed into more structured features before modeling.
Step 3: Spot the Structure (Real-World Examples)
Example 1: Bank Spreadsheet
Transactions with columns like amount and timestamp in a spreadsheet are structured data: neat rows and columns, clear types.
Example 2: Server Logs
Server logs in JSON where some records have extra fields are semi-structured: same key-value style, but not identical columns.
Example 3: Emails
Customer support emails are unstructured: each message is free text, with no fixed set of fields.
Example 4: Mixed Reviews
Product reviews stored as JSON with stars and product_id plus a text review are mixed: structured fields plus unstructured text together.
Step 4: Common Variable Types (How Columns Differ)
Why Variable Types Matter
Each column in a dataset behaves differently. Knowing its variable type tells you what math, charts, and models make sense.
Numeric: Continuous vs Discrete
Continuous variables take many values on a range (height, price). Discrete variables are counts or integers (number of purchases).
Categorical: Nominal vs Ordinal
Nominal categories have no order (country, payment method). Ordinal categories have a natural order (low, medium, high).
Datetime and Text
Datetime variables store dates and times and allow durations. Text variables are free-form strings like reviews and usually need extra processing.
Step 5: Classify These Variables
For each column below, decide which type it is. Think before checking the answers.
- `age_years` in a health survey: values like 18, 19, 20, ...
- `satisfaction_score` from 1 to 5 (where 1 = very unhappy, 5 = very happy)
- `country` of residence
- `order_timestamp` in an e-commerce dataset
- `review_text` describing a product
Scroll down for suggested answers.
Suggested answers:
- `age_years`: Numeric, discrete (integer count of years; often treated like continuous in practice)
- `satisfaction_score`: Usually ordinal categorical (ordered levels 1–5). Some analyses also treat it as numeric.
- `country`: Categorical, nominal (no natural order).
- `order_timestamp`: Datetime.
- `review_text`: Text (unstructured).
Step 6: Basic Data Quality Issues
Messy Data Is Normal
Real datasets are messy. Before analysis, you must detect issues like missing values, outliers, inconsistencies, duplicates, and type errors.
Missing Values and Outliers
Missing values are absent entries; outliers are values far from most others. Both can break models or bias results if ignored.
Inconsistencies and Duplicates
Inconsistencies are the same concept recorded differently. Duplicates are repeated records that can double-count events.
Type Errors
Type errors happen when a column has the wrong data type in software, like numbers stored as text, blocking correct calculations.
Step 7: A Messy Mini-Dataset
A Tiny Orders Table
Consider a small orders table with columns like order_date, amount, and country. It contains several realistic issues to spot.
Visible Problems
You can see inconsistent date formats, a missing amount, a duplicate order_id, a huge outlier amount, and a suspicious negative amount.
More Subtle Issues
There is also an invalid date with month 13 and inconsistent country values like USA, United States, U.S., and usa.
Lesson
This is typical of real data. Inspecting and cleaning such problems is a crucial early step before any analysis or modeling.
Step 8: Peeking at Data with Code (Python + pandas)
Here is a short Python example using `pandas`, a popular data analysis library. It shows how to quickly inspect structure and quality.
```python
import pandas as pd
Create a small DataFrame similar to the example
data = {
"order_id": [1, 2, 3, 3, 4],
"user_id": [101, 102, 102, 102, 103],
"order_date": [
"2026-05-01",
"01/05/2026",
"2026-05-03",
"2026-05-03",
"2026-13-01", # invalid
],
"amount": [49.99, 79.99, None, 120000, -10],
"country": ["USA", "United States", "U.S.", "USA", "usa"],
}
df = pd.DataFrame(data)
print("\nHead of data:")
print(df.head())
print("\nColumn types:")
print(df.dtypes)
print("\nMissing values per column:")
print(df.isna().sum())
print("\nBasic statistics for numeric columns:")
print(df.describe())
print("\nUnique country values:")
print(df["country"].unique())
```
What this does:
- `df.dtypes` shows data types (you may see `object` for text/date-like columns before conversion).
- `df.isna().sum()` counts missing values.
- `df.describe()` gives min, max, mean, and more for numeric columns (helps spot outliers).
- `df["country"].unique()` reveals inconsistent categories.
You do not need to memorize this code now. The important idea is: tools like pandas make it easy to quickly scan structure and quality issues.
Step 9: Quick Knowledge Check
Answer this question to check your understanding of variable types and data quality.
You have a column `delivery_time_days` that measures how many days it took for an order to arrive (0, 1, 2, 3, ...). You also see a few values like 999. Which statement is most accurate?
- It is a numeric, discrete variable, and values like 999 are likely outliers or data errors.
- It is a categorical, nominal variable, and 999 is just another valid category.
- It is unstructured text data, and 999 means the value is missing.
Show Answer
Answer: A) It is a numeric, discrete variable, and values like 999 are likely outliers or data errors.
Delivery time in days is a count, so it is numeric and discrete. Very large values like 999, compared with typical delivery times, are likely outliers or coding errors (for example, using 999 as a placeholder for missing).
Step 10: Review Key Terms
Use these flashcards to review the main concepts from this module.
- Structured data
- Data organized in tables with rows and columns, where each column has a defined type (for example, a spreadsheet of sales with price and date).
- Semi-structured data
- Data with some organization (like JSON or XML) but not a fixed table schema; different records can have different fields.
- Unstructured data
- Data without a predefined tabular structure, such as free text, images, or audio, which must be transformed before most analyses.
- Continuous variable
- A numeric variable that can take many values on a range (for example, height, temperature, price).
- Discrete variable
- A numeric variable that takes separate values, often counts or integers (for example, number of purchases).
- Categorical variable
- A variable that represents groups or labels, not amounts; can be nominal (unordered) or ordinal (ordered).
- Ordinal variable
- A categorical variable with a clear order (for example, low/medium/high), but unequal gaps between levels.
- Datetime variable
- A variable that stores dates, times, or both, allowing operations like sorting by time and computing durations.
- Missing value
- An expected data point that is absent, often shown as NA, null, blank, or a special placeholder code.
- Outlier
- A value that is very far from most others; may represent a rare event or a data error.
- Inconsistency
- The same concept recorded in different ways (for example, "USA", "United States", "U.S."), which can break grouping and counting.
- Duplicate record
- The same row or event stored more than once in a dataset, leading to double-counting.
Key Terms
- outlier
- A data point that is very different from most others in the dataset, potentially indicating an error or rare event.
- duplicate
- A record that appears more than once in a dataset, which can distort counts and statistics.
- text data
- Unstructured strings of characters, such as comments or reviews, that often require extra processing before analysis.
- type error
- A mismatch between the stored data and its intended type, such as numbers stored as text.
- inconsistency
- A situation where the same information is recorded in different ways, making analysis harder.
- missing value
- An entry in a dataset where a value was expected but is not present.
- structured data
- Data stored in a clear row-and-column format, such as spreadsheets or relational databases, where each column has a defined type.
- nominal variable
- A categorical variable with no inherent order among its categories, such as colors or countries.
- numeric variable
- A variable that represents quantities or counts and can be used in arithmetic operations.
- ordinal variable
- A categorical variable whose categories have a natural order, such as ratings from 1 to 5 or small/medium/large.
- datetime variable
- A variable that stores dates, times, or both, enabling time-based operations like sorting and duration calculation.
- discrete variable
- A numeric variable that takes separate, often integer, values, such as number of items bought.
- unstructured data
- Data without a predefined structure, such as text documents, images, or audio recordings.
- continuous variable
- A numeric variable that can take many possible values within a range, such as height or temperature.
- categorical variable
- A variable that represents categories or labels rather than amounts, such as country or payment method.
- semi-structured data
- Data that has some structure (like key-value pairs in JSON) but does not fit neatly into a fixed table schema.