Data organized in tables with rows and columns, where each column has a defined type (for example, a spreadsheet of sales with price and date).

Data with some organization (like JSON or XML) but not a fixed table schema; different records can have different fields.

Data without a predefined tabular structure, such as free text, images, or audio, which must be transformed before most analyses.

A numeric variable that can take many values on a range (for example, height, temperature, price).

A numeric variable that takes separate values, often counts or integers (for example, number of purchases).

A data point that is very different from most others in the dataset, potentially indicating an error or rare event.

A record that appears more than once in a dataset, which can distort counts and statistics.

Unstructured strings of characters, such as comments or reviews, that often require extra processing before analysis.

A mismatch between the stored data and its intended type, such as numbers stored as text.

Knowing Your Data: Types, Structure, and Quality — Data Science Foundations: From Raw Data to Insight

Step 1: Why Knowing Your Data Comes First

Why This Matters

Before any model, you must understand what your data actually looks like. This step zooms in on the raw material of data science: the data itself.

What You Will Learn

You will see different data structures, learn common data types, and spot basic data quality issues like missing values and outliers.

Key Idea

Think of this as reading the ingredients label on your dataset. Good analysis starts with knowing your data; skipping this can lead to very wrong conclusions.

Step 2: Structured, Semi-Structured, and Unstructured Data

Three Structure Types

Data can be structured, semi-structured, or unstructured. Recognizing which you have guides your cleaning and analysis steps.

Structured Data

Structured data lives in tables with rows and columns. Examples: gradebooks, sales tables. Each column has a clear type like number or date.

Semi-Structured Data

Semi-structured data has some organization but not a fixed table. Common formats are JSON or XML, where different records can have different fields.

Unstructured Data

Unstructured data has no simple row-column form: free text, images, audio. It usually must be transformed into more structured features before modeling.

Step 3: Spot the Structure (Real-World Examples)

Example 1: Bank Spreadsheet

Transactions with columns like amount and timestamp in a spreadsheet are structured data: neat rows and columns, clear types.

Example 2: Server Logs

Server logs in JSON where some records have extra fields are semi-structured: same key-value style, but not identical columns.

Example 3: Emails

Customer support emails are unstructured: each message is free text, with no fixed set of fields.

Example 4: Mixed Reviews

Product reviews stored as JSON with stars and product_id plus a text review are mixed: structured fields plus unstructured text together.

Step 4: Common Variable Types (How Columns Differ)

Why Variable Types Matter

Each column in a dataset behaves differently. Knowing its variable type tells you what math, charts, and models make sense.

Numeric: Continuous vs Discrete

Continuous variables take many values on a range (height, price). Discrete variables are counts or integers (number of purchases).

Categorical: Nominal vs Ordinal

Nominal categories have no order (country, payment method). Ordinal categories have a natural order (low, medium, high).

Datetime and Text

Datetime variables store dates and times and allow durations. Text variables are free-form strings like reviews and usually need extra processing.

Step 5: Classify These Variables

For each column below, decide which type it is. Think before checking the answers.

`age_years` in a health survey: values like 18, 19, 20, ...
`satisfaction_score` from 1 to 5 (where 1 = very unhappy, 5 = very happy)
`country` of residence
`order_timestamp` in an e-commerce dataset
`review_text` describing a product

Scroll down for suggested answers.

Suggested answers:

`age_years`: Numeric, discrete (integer count of years; often treated like continuous in practice)
`satisfaction_score`: Usually ordinal categorical (ordered levels 1–5). Some analyses also treat it as numeric.
`country`: Categorical, nominal (no natural order).
`order_timestamp`: Datetime.
`review_text`: Text (unstructured).

Step 6: Basic Data Quality Issues

Messy Data Is Normal

Real datasets are messy. Before analysis, you must detect issues like missing values, outliers, inconsistencies, duplicates, and type errors.

Missing Values and Outliers

Missing values are absent entries; outliers are values far from most others. Both can break models or bias results if ignored.

Inconsistencies and Duplicates

Inconsistencies are the same concept recorded differently. Duplicates are repeated records that can double-count events.

Type Errors

Type errors happen when a column has the wrong data type in software, like numbers stored as text, blocking correct calculations.

Step 7: A Messy Mini-Dataset

A Tiny Orders Table

Consider a small orders table with columns like order_date, amount, and country. It contains several realistic issues to spot.

Visible Problems

You can see inconsistent date formats, a missing amount, a duplicate order_id, a huge outlier amount, and a suspicious negative amount.

More Subtle Issues

There is also an invalid date with month 13 and inconsistent country values like USA, United States, U.S., and usa.

Lesson

This is typical of real data. Inspecting and cleaning such problems is a crucial early step before any analysis or modeling.

Step 8: Peeking at Data with Code (Python + pandas)

Here is a short Python example using `pandas`, a popular data analysis library. It shows how to quickly inspect structure and quality.

```python

import pandas as pd

Create a small DataFrame similar to the example

data = {

"order_id": [1, 2, 3, 3, 4],

"user_id": [101, 102, 102, 102, 103],

"order_date": [

"2026-05-01",

"01/05/2026",

"2026-05-03",

"2026-13-01", # invalid

"amount": [49.99, 79.99, None, 120000, -10],

"country": ["USA", "United States", "U.S.", "USA", "usa"],

}

df = pd.DataFrame(data)

print("\nHead of data:")

print(df.head())

print("\nColumn types:")

print(df.dtypes)

print("\nMissing values per column:")

print(df.isna().sum())

print("\nBasic statistics for numeric columns:")

print(df.describe())

print("\nUnique country values:")

print(df["country"].unique())

```

What this does:

`df.dtypes` shows data types (you may see `object` for text/date-like columns before conversion).
`df.isna().sum()` counts missing values.
`df.describe()` gives min, max, mean, and more for numeric columns (helps spot outliers).
`df["country"].unique()` reveals inconsistent categories.

You do not need to memorize this code now. The important idea is: tools like pandas make it easy to quickly scan structure and quality issues.

Step 9: Quick Knowledge Check

Answer this question to check your understanding of variable types and data quality.

You have a column `delivery_time_days` that measures how many days it took for an order to arrive (0, 1, 2, 3, ...). You also see a few values like 999. Which statement is most accurate?

It is a numeric, discrete variable, and values like 999 are likely outliers or data errors.
It is a categorical, nominal variable, and 999 is just another valid category.
It is unstructured text data, and 999 means the value is missing.

Show Answer

Answer: A) It is a numeric, discrete variable, and values like 999 are likely outliers or data errors.

Delivery time in days is a count, so it is numeric and discrete. Very large values like 999, compared with typical delivery times, are likely outliers or coding errors (for example, using 999 as a placeholder for missing).

Step 10: Review Key Terms

Use these flashcards to review the main concepts from this module.

Structured data: Data organized in tables with rows and columns, where each column has a defined type (for example, a spreadsheet of sales with price and date).
Semi-structured data: Data with some organization (like JSON or XML) but not a fixed table schema; different records can have different fields.
Unstructured data: Data without a predefined tabular structure, such as free text, images, or audio, which must be transformed before most analyses.
Continuous variable: A numeric variable that can take many values on a range (for example, height, temperature, price).
Discrete variable: A numeric variable that takes separate values, often counts or integers (for example, number of purchases).
Categorical variable: A variable that represents groups or labels, not amounts; can be nominal (unordered) or ordinal (ordered).
Ordinal variable: A categorical variable with a clear order (for example, low/medium/high), but unequal gaps between levels.
Datetime variable: A variable that stores dates, times, or both, allowing operations like sorting by time and computing durations.
Missing value: An expected data point that is absent, often shown as NA, null, blank, or a special placeholder code.
Outlier: A value that is very far from most others; may represent a rare event or a data error.
Inconsistency: The same concept recorded in different ways (for example, "USA", "United States", "U.S."), which can break grouping and counting.
Duplicate record: The same row or event stored more than once in a dataset, leading to double-counting.

Key Terms

outlier: A data point that is very different from most others in the dataset, potentially indicating an error or rare event.
duplicate: A record that appears more than once in a dataset, which can distort counts and statistics.
text data: Unstructured strings of characters, such as comments or reviews, that often require extra processing before analysis.
type error: A mismatch between the stored data and its intended type, such as numbers stored as text.
inconsistency: A situation where the same information is recorded in different ways, making analysis harder.
missing value: An entry in a dataset where a value was expected but is not present.
structured data: Data stored in a clear row-and-column format, such as spreadsheets or relational databases, where each column has a defined type.
nominal variable: A categorical variable with no inherent order among its categories, such as colors or countries.
numeric variable: A variable that represents quantities or counts and can be used in arithmetic operations.
ordinal variable: A categorical variable whose categories have a natural order, such as ratings from 1 to 5 or small/medium/large.
datetime variable: A variable that stores dates, times, or both, enabling time-based operations like sorting and duration calculation.
discrete variable: A numeric variable that takes separate, often integer, values, such as number of items bought.
unstructured data: Data without a predefined structure, such as text documents, images, or audio recordings.
continuous variable: A numeric variable that can take many possible values within a range, such as height or temperature.
categorical variable: A variable that represents categories or labels rather than amounts, such as country or payment method.
semi-structured data: Data that has some structure (like key-value pairs in JSON) but does not fit neatly into a fixed table schema.