The stage where a vague idea (like "use AI to improve the app") is turned into a clear, answerable question with a defined goal, metric, and constraints.

Any person or group affected by a data project or using its results, such as business leaders, product managers, engineers, legal teams, or end users.

The process of finding, understanding, and getting access to relevant data sources for a project, including checking availability, quality, and legal constraints.

Fixing, removing, or standardizing incorrect, missing, or inconsistent data so that analyses and models are reliable.

Using algorithms to learn patterns from data and make predictions or classifications.

Releasing a model or analysis into a real environment where it can be used by systems or people.

Continuously checking the technical, model, and business performance after deployment.

Anyone affected by or using the results of a data project, including business, product, engineering, legal, and end users.

The Data Science Workflow: From Question to Value — Data Science Foundations: From Raw Data to Insight

1. From Vague Question to Clear Problem

From Vague to Clear

Data science rarely starts as a clean math problem. It often begins with a vague question like "Are our users happy?" or "How can we make more money?" The first job is to turn this into a clear, answerable problem.

Three Levels of a Question

Business question: big picture, e.g., "How can we reduce customer churn?" 2. Analytical question: how data can help, e.g., "Can we predict who will churn?" 3. Data question: what we actually do, e.g., "Can we train a weekly churn model?"

Work With Stakeholders

Data scientists work with business stakeholders to define success, constraints, and metrics. Ask: What counts as a win? What limits do we have (budget, time, privacy)? Which metric will we try to move?

Common Failure Modes

Projects fail when questions are too vague ("Use AI to improve the app"), when there is no clear success metric, or when stakeholders do not agree on the goal from the start.

Quick Checklist

For a good problem definition, answer: Who is this for? What decision will this help? What metric will we improve, and roughly by how much? By when do we need a first result?

2. Practice: Sharpen the Question

Try this thought exercise.

You are a data scientist at a food delivery app.

A manager says: "Use data to make customers happier."

Rewrite this as a business question.
Then as an analytical question.
Then as a data question.

Write your answers (mentally, on paper, or in a text editor), then compare with the sample answers:

Possible answers:

Business question:

"How can we reduce the number of customers who stop ordering after a bad delivery experience?"

Analytical question:

"Can we identify customers who had a bad experience and are at high risk of not ordering again in the next 30 days?"

Data question:

"Using order history, delivery delays, and complaint logs, can we build a weekly score for each customer that estimates their risk of not ordering again?"

3. Understanding Stakeholders and Value

Why Stakeholders Matter

A data science project only creates value if it helps someone make a better decision or power a useful product. So you must understand who cares about the result and how they will use it.

Typical Stakeholders

Stakeholders include business leaders, product managers, engineers, legal/privacy teams, and end users. Each group has different goals: revenue, UX, reliability, compliance, or usability.

Key Early Questions

Ask: Who will use the result? How often? What decision will change because of it? These answers shape everything else: accuracy needs, speed, and allowed data.

Example: Churn Prediction

In a churn project, marketing wants a list to target, finance wants better revenue forecasts, and engineering must integrate the model. Understanding this alignment keeps the project useful.

4. Data Discovery and Collection

Data Discovery

After defining the problem, you ask: What data do we have, and what data do we need? This is the data discovery or data collection stage of the workflow.

Where Data Comes From

You explore internal databases, logs, and transactions, request access from data engineers, collect new data (like surveys), or use external datasets and APIs, respecting licenses and privacy rules.

Check Relevance and Legality

For each source, ask: Is it relevant? Is it available for the users and time we care about? Is it legal and ethical to use, for example under GDPR and similar privacy laws?

Example Sources for Churn

For churn, you might use sign-up and cancel dates, subscription plans, app usage logs, support tickets, and payment failures. Together they help explain why users leave.

Common Data Problems

You may find that key data does not exist yet, is hard to access, poorly documented, or biased. Here you often work closely with data engineers and IT/security teams.

5. Example: Small Data Discovery in Python

Exploring a Dataset

Here is a small Python example using pandas to explore a churn-like dataset. Focus less on syntax and more on the questions we ask about the data.

Key Questions in Code

We ask: What columns do we have? How many rows? Any missing values? What are basic summaries? How does sessions per week differ between churned and active users?

Why This Matters

These quick checks are part of data discovery and initial exploration. They help you see structure, quality, and rough patterns before deeper modeling.

6. Cleaning, Exploring, and Feature Building

Messy to Usable Data

Real-world data is messy. The longest stage is often cleaning, exploring, and turning raw data into useful features that models and humans can work with.

Data Cleaning

Cleaning means fixing invalid values, handling missing data, and standardizing formats like dates and categories so they are consistent and usable.

Exploratory Data Analysis

EDA uses plots and summaries to understand distributions and relationships, and to check for issues like data leakage or strange outliers.

Feature Engineering

Feature engineering turns raw logs into meaningful variables, like sessionsperweek, failedpaymentslast3months, or hasrecentcomplaint for a churn model.

Collaboration and Impact

Data scientists work with domain experts and data engineers here. Clean data and good features often matter more than using the most complex model.

7. Quick Check: Cleaning vs Modeling

Test your understanding of where projects often fail.

A team jumps straight into training a complex neural network on a messy dataset with many missing values and unknown column meanings. What is the main risk?

They will violate data privacy laws automatically.
The model may be inaccurate or misleading because the data is not well understood or cleaned.
Complex models always fix data quality issues, so there is no real risk.

Show Answer

Answer: B) The model may be inaccurate or misleading because the data is not well understood or cleaned.

Complex models cannot fix bad data. Without proper cleaning and understanding of each column, the model may learn patterns that are wrong or unstable, leading to misleading results.

8. Modeling, Evaluation, and the Iterative Loop

When Modeling Happens

Modeling usually comes after you define the problem and prepare the data. You often start with simple baseline models before trying anything complex.

Evaluating Models

You pick metrics that match the business goal, use train/validation/test splits, and check performance across user groups to avoid overfitting and unfair bias.

Iteration, Not a Line

If results are poor, you often loop back: improve features, fix data, or even adjust the problem definition. Data science is a cycle, not a strict one-way pipeline.

Stakeholders in This Stage

Data scientists lead modeling; product managers judge if metric gains matter; legal and ethics teams may review fairness and compliance under modern AI regulations.

9. Deployment, Monitoring, and Feedback

From Model to Real Use

A model or analysis only creates value when it is used. Deployment means putting it into production systems or dashboards so people can act on it.

Monitoring in Practice

After deployment, teams monitor technical health (errors, speed), model health (performance over time), and business impact (are key metrics improving?).

Feedback Loops

User and stakeholder feedback helps reveal if the model stays useful or causes side effects. Teams retrain or redesign when needed, creating a continuous loop.

Example: Churn in Production

A daily job scores users for churn risk. Marketing uses a dashboard to target offers. Teams then track whether churn rates actually go down after campaigns.

Lifecycle and Governance

By 2026, many sectors must document and monitor important models for compliance and ethics. The focus is on governing models across their full lifecycle.

10. Map the Workflow for a Simple Scenario

Imagine you work for a university and get this request:

"Use data to improve first-year student success."

Try to outline the workflow steps yourself:

Problem definition: How would you rewrite this as a clear question?
Stakeholders: Who cares about the result (students, advisors, admin)?
Data discovery: What data might you use (grades, attendance, surveys)?
Cleaning/EDA/features: What issues or features might appear?
Modeling/evaluation: What could you try to predict or explain?
Deployment/feedback: How would advisors or students actually use the result?

Write short bullet answers. Then compare with this possible framing:

Problem: "Can we identify students at high risk of failing 3 or more courses in their first year so advisors can reach out early?"
Stakeholders: students, academic advisors, department heads.
Data: high school grades, first-semester grades, attendance, LMS logins.
Features: average grade, missed classes, late assignments, login frequency.
Modeling: predict risk of failing 3+ courses; evaluate with recall on at-risk students.
Deployment: a dashboard for advisors updated weekly, plus regular reviews to ensure fairness across different student groups.

11. Review Key Terms

Flip through these cards to review the core ideas from the workflow.

Problem definition: The stage where a vague idea (like "use AI to improve the app") is turned into a clear, answerable question with a defined goal, metric, and constraints.
Stakeholder: Any person or group affected by a data project or using its results, such as business leaders, product managers, engineers, legal teams, or end users.
Data discovery: The process of finding, understanding, and getting access to relevant data sources for a project, including checking availability, quality, and legal constraints.
Data cleaning: Fixing, removing, or standardizing incorrect, missing, or inconsistent data so that analyses and models are reliable.
Exploratory data analysis (EDA): Early analysis using summaries and visualizations to understand distributions, relationships, and potential problems in the data.
Feature engineering: Creating meaningful input variables (features) from raw data, such as turning raw logs into "sessions per week" or "failed payments last 3 months".
Model evaluation: Measuring how well a model performs using metrics and test data, and checking for overfitting and unfair bias.
Deployment: Putting a model or analysis into real use, for example via an API, batch job, or dashboard, so that people or systems can act on it.
Monitoring: Ongoing tracking of technical performance, model quality, and business impact after deployment, to detect issues and trigger updates.
Iterative workflow: The idea that data science projects loop back through earlier stages (data, features, problem) instead of following a one-way, linear pipeline.

Key Terms

Modeling: Using algorithms to learn patterns from data and make predictions or classifications.
Deployment: Releasing a model or analysis into a real environment where it can be used by systems or people.
Monitoring: Continuously checking the technical, model, and business performance after deployment.
Stakeholder: Anyone affected by or using the results of a data project, including business, product, engineering, legal, and end users.
Data cleaning: Fixing or removing incorrect, missing, or inconsistent data to make it reliable for analysis and modeling.
Data question: A concrete, technical question about what to do with data, like "Can we train a model that outputs churn probabilities weekly?"
Data discovery: Finding and understanding relevant data sources, and checking their availability, quality, and legal constraints.
Model evaluation: Assessing model performance using metrics and test data, and checking fairness and robustness.
Business question: A high-level question about goals or problems, such as "How can we reduce customer churn?"
Iterative workflow: A cyclical way of working where you often return to earlier steps (data, features, problem) based on what you learn later.
Analytical question: A question about how data and analysis can address the business question, such as "Can we predict which customers are likely to churn soon?"
Feature engineering: Transforming raw data into meaningful variables (features) used by models.
Exploratory data analysis (EDA): Early investigation of data using summaries and plots to understand patterns and spot problems.