A concrete change you want to see in the real world, such as fewer dropouts, more sales, or lower waiting times. It comes before any data or models.

The general kind of problem you are solving with data, such as prediction, classification, estimation, segmentation, ranking, or causal comparison.

Target (outcome, label)

The specific thing you are trying to predict or explain, which is observable and usually tied to a clear time window.

Pieces of information known before the outcome, used to predict or explain the target, subject to timing, cost, legal, and ethical constraints.

A rule for scoring how well your solution works, such as accuracy, recall, average error, or a business impact measure.

Of all actual positive cases, the fraction that the model correctly identifies as positive.

Of the cases predicted as positive, the fraction that are actually positive.

What is Data leakage?

Using information in training or features that would not be available when making real predictions, often leading to misleadingly high performance.

What is Data task type?

The broad category of problem you are solving with data, like prediction, classification, estimation, segmentation, ranking, or causal comparison.

From Real-World Questions to Data Problems — Data Science Foundations: From Raw Data to Insight

From mess to model: what this module is about

Messy real-world questions

In real life, questions are messy: "Are our customers happy?" "Is this drug effective?" "How can we use AI to improve our app?" These are important but too vague for data analysis or machine learning.

Why translation is needed

To use data well, we must translate messy questions into clear data problems. That means turning fuzzy ideas into precise goals that data and models can actually work on.

Link to earlier modules

You have seen how to think in chances and uncertainty, and how analytics differs from machine learning and AI. Now we connect those ideas to practice: framing real questions as data problems.

Your new skills

You will learn to spot fuzzy questions, rewrite them as precise goals, and identify four pieces: what we do (task), the target we care about, the features we use, and how we judge success (metric).

Step 1: Clarify the real-world goal

Start with the real goal

Before thinking about data, ask: "If this project works, what changes in the real world?" Focus on the outcome, not the tool or algorithm.

Common real-world goals

Typical goals: increase something (revenue, scores), reduce something (churn, waiting time), understand something (user types), or choose between options (which ad or treatment).

Business example

Messy: "Can we use AI to boost sales?" Clearer: "We want more visitors to our website to actually buy something." That is a concrete real-world change.

Research example

Messy: "Is this new teaching method good?" Clearer: "We want students to score higher on the final exam without increasing study time."

Mini-checklist

Ask: Who cares about this question? What do they want more or less of? What decision will this analysis support? If you cannot answer, you are not ready for data.

Step 2: Choose the type of data problem

Pick the task type

After clarifying the goal, choose the kind of data task: prediction, classification, estimation, segmentation, ranking, or causal comparison. This links your question to the right tools.

Prediction and classification

Prediction asks: given what we know now, what is likely to happen? Classification is a special case: which category does this belong to, like spam vs not spam.

Estimation and segmentation

Estimation asks how big an effect or number is. Segmentation (or clustering) groups similar cases, like finding user types or customer segments.

Ranking and causal questions

Ranking decides the order of options, like recommendations. Causal comparison asks if changing X causes Y to change, often using experiments or careful study design.

Example: boosting sales

Goal: more visitors should buy. Task: predict which visitors are likely to buy, so we can, for example, show them special offers or support.

Step 3: Define the target in plain language

What is a target?

The target (or outcome, or label) is the thing you are trying to predict or explain. It is the main result your data problem focuses on.

Good targets

Good targets are specific, observable, and time-bound. You should be able to say exactly how and when they are measured for each person or case.

Target examples

Examples: Did this user buy within 7 days? What was the student's final exam score? Did the patient return to the hospital within 30 days?

From vague to concrete

Instead of "Are our customers happy?" use concrete targets like: Did they rate us 4 or 5 stars in the last month, or did they stay subscribed for at least 6 months?

Target and chance

Once the target is clear, you can talk about the chance it happens, like the chance a user buys. This is where your probability thinking becomes practical.

Step 4: Define features (inputs) and constraints

What are features?

Features are the pieces of information you use to predict or explain the target. They are the inputs to your analysis or model.

Before vs after

Think of features as what you know before the outcome, and the target as what you learn after. For prediction, only use information available at decision time.

Feature examples

Predicting purchases: device type, pages viewed, time on site. Predicting exam scores: homework scores, attendance. Predicting readmission: age, diagnosis, lab results.

Constraints on features

Check timing (is it known in time?), cost (is it expensive to collect?), and ethics and law (is it acceptable and legal to use, given privacy rules such as GDPR in the EU?).

Good vs bad feature

Target: buy within 7 days. Good feature: pages viewed in first session. Bad feature: total amount spent in the next 30 days, because it happens after the outcome.

Step 5: Choose an evaluation metric in simple terms

What is a metric?

The evaluation metric is a rule that gives your solution a score. It says how well your predictions or analysis match reality and your real-world goal.

Simple metrics

Basic metrics include accuracy (fraction correct), precision (how many predicted positives were real), recall (how many real positives we caught), and average error for numbers.

Business or impact metrics

Beyond technical scores, we care about impact: extra profit, cost saved, time saved, or health outcomes. These link metrics back to the real-world goal.

Churn example

Goal: reduce churn. Task: predict who will churn. Target: cancel in next 30 days. Metric: we might focus on recall, so we catch as many likely churners as possible.

Dual tracking

In practice, teams track both technical metrics (accuracy, recall) and impact metrics (money or outcomes). Together they show if the model is both good and useful.

Worked example: Turning a messy question into a data problem

Messy question

Messy: "Can we use AI to improve our university's dropout problem?" This is vague. We need to turn it into a clear data problem with a target, features, and metrics.

Goal and task

Real goal: fewer students drop out before finishing. Data task: predict which students are at high risk of dropping out so the university can offer support.

Target definition

Target: Did the student drop out within the next academic year? Yes or No. It is specific, observable, and tied to a clear time window.

Features and constraints

Features: high school grades, first semester GPA, attendance, failed courses, financial aid. Constraints: only use data known in time and avoid unethical or restricted features.

Metrics and decisions

Metrics: recall and precision for at-risk students, plus change in dropout rate. Decisions: advisors contact high-risk students and invite them to support programs.

Final framed problem

We now have a framed problem: predict dropout risk using allowed features, evaluate with recall, precision, and dropout reduction, under timing and ethics constraints.

Your turn: Frame a data problem

Try this thought exercise. There is no single perfect answer; focus on being clear and concrete.

Scenario:

A city government says: "We want to use data and AI to make our public transport better."

Clarify the real-world goal

Write down one specific, measurable goal. For example, "reduce average bus waiting time" or "increase on-time arrivals".

Choose the type of data problem

Is your goal mainly about prediction, estimation, segmentation, ranking, or something else?

Define a target

Describe a clear target in one sentence, including the time window. For example, "Was the bus more than 5 minutes late on this trip?" (Yes/No).

List 3–5 possible features

Only include information that could be known before the bus trip starts (for prediction) or at the time of decision.

Pick a simple metric

State how you would judge success. For example, "percentage of trips arriving on time" or "average waiting time in minutes".

Write your answers in a notebook or text editor. If you like, try framing a second version with a different goal (for example, "increase passenger satisfaction") and notice how the target, features, and metric change.

Common framing pitfalls: leaky targets and vagueness

Pitfall 1: Leakage

Leakage happens when features use information that would not be available at decision time, often from the future. It makes models look great in tests but fail in real life.

Leakage example

Predicting readmission using 'number of follow-up visits after discharge' is leakage, because you do not know that number when making the original decision.

Pitfall 2: Vague targets

If your target is fuzzy, like undefined 'engagement', the team will be confused. You must agree on a precise meaning, such as time on site or number of comments.

Pitfall 3: Bad metrics

A metric can be misleading. High accuracy on a rare event may hide that you miss most important cases. Metrics must match the real-world goal.

Pitfall 4: Ignoring constraints

Using illegal, unethical, or too costly features can block deployment. Modern privacy laws, like the GDPR in Europe, restrict using some personal data.

Habits to avoid pitfalls

Ask if each feature is known in time, define and share your target clearly, align metrics with stakeholder goals, and check your data use against current laws and ethics.

Quick check: Targets, features, and metrics

Test your understanding of the core ideas.

You want to help a hospital reduce 30-day readmissions. Which option best describes a good target, a feature, and a metric for this data problem?

Target: patient satisfaction score; Feature: number of readmissions in the next 30 days; Metric: model accuracy
Target: whether a patient is readmitted within 30 days (Yes/No); Feature: lab results at discharge; Metric: percentage of high-risk patients correctly identified
Target: hospital revenue; Feature: patient age; Metric: number of patients in the hospital

Show Answer

Answer: B) Target: whether a patient is readmitted within 30 days (Yes/No); Feature: lab results at discharge; Metric: percentage of high-risk patients correctly identified

Option 2 is best. The target is clear and time-bound (readmitted within 30 days). The feature (lab results at discharge) is known at decision time. The metric (percentage of high-risk patients correctly identified) matches the goal of catching readmissions. Option 1 uses future information as a feature (leakage). Option 3's metric does not match the goal of reducing readmissions.

Review key terms

Use these flashcards to review the main concepts from this module.

Real-world goal: A concrete change you want to see in the real world, such as fewer dropouts, more sales, or lower waiting times. It comes before any data or models.
Data task type: The general kind of problem you are solving with data, such as prediction, classification, estimation, segmentation, ranking, or causal comparison.
Target (outcome, label): The specific thing you are trying to predict or explain, which is observable and usually tied to a clear time window.
Features (inputs): Pieces of information known before the outcome, used to predict or explain the target, subject to timing, cost, legal, and ethical constraints.
Evaluation metric: A rule for scoring how well your solution works, such as accuracy, recall, average error, or a business impact measure.
Data leakage: A framing error where features include information that would not be available at decision time, often leading to unrealistically good performance in testing.
Vague target: A poorly defined outcome, such as undefined 'engagement' or 'happiness', that makes it hard to design, test, or interpret a data solution.

Key Terms

Recall: Of all actual positive cases, the fraction that the model correctly identifies as positive.
Precision: Of the cases predicted as positive, the fraction that are actually positive.
Data leakage: Using information in training or features that would not be available when making real predictions, often leading to misleadingly high performance.
Data task type: The broad category of problem you are solving with data, like prediction, classification, estimation, segmentation, ranking, or causal comparison.
Real-world goal: A specific, practical change you want to achieve outside the computer, such as reducing churn or improving exam scores.
Causal comparison: A type of analysis that aims to determine whether changing one factor (like a treatment or policy) causes a change in an outcome.
Evaluation metric: A numeric measure of how well a model or analysis performs with respect to the target and the real-world goal.
Features (inputs): The data you use to predict or explain the target, which must be available at decision time and acceptable to use under legal and ethical rules.
Target (outcome, label): The variable or event you want to predict or explain; it should be clearly defined, observable, and often time-bound.
Segmentation (clustering): Grouping similar individuals or items based on their features, without predefined labels, to understand types or segments.