VM-LEARNING /class.xii ·track.ai ·ch-b2 session: 2026_27
$cd ..

~/Data Science Methodology

root@vm-learning ~ $ open ch-b2
PART B ▪ UNIT 2
02
Data Science Methodology
An Analytic Approach to the Capstone Project
Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists follow to approach a problem and find a solution. It enables the capacity to handle and comprehend data, and organises an AI project systematically without losing time and cost.

Introduction — John B. Rollins's 10-Step Methodology

The methodology presented here was put forward by John B. Rollins, a Data Scientist at IBM Analytics. It consists of 10 steps grouped into 5 modules, each containing two stages.

🔹 The Five Modules
  1. From Problem to Approach
  2. From Requirements to Collection
  3. From Understanding to Preparation
  4. From Modelling to Evaluation
  5. From Deployment to Feedback
🔹 Key Concepts You'll Learn
  1. Introduction to Data Science Methodology
  2. Steps for Data Science Methodology
  3. Model Validation Techniques
  4. Model Performance — Evaluation Metrics

Prerequisite: Class XI AI concepts + familiarity with Capstone Projects and their objectives.

Learning Outcome 1: Integrate Data Science Methodology steps into the Capstone Project

1.1 Module 1 — From Problem to Approach

❓ Step 1 · Business Understanding

"What is the problem that you are trying to solve?"

Capstone Scenario — Mr Pavan Sankar visits a food festival and cannot identify cuisines of unfamiliar dishes (Bubble & Squeak · Phan Pyat · Jadoh). The problem: "Given the ingredients / image of a dish, predict its cuisine."

🧭 Step 2 · Analytic Approach

"How can you use the data to answer the question?"

The team picks the most appropriate approach by asking:

  1. How much / how many? → Regression.
  2. Which category does the data belong to? → Classification.
  3. Can the data be grouped? → Clustering.
  4. Any unusual pattern? → Anomaly Detection.
  5. Which option should we give the customer? → Recommendation.
🔹 Four Main Types of Data Analytics
AnalyticsFocusPurposeExample
DescriptiveSummarise past data — what has happenedIdentify patterns, trends, anomalies in past data (mean/median/mode, range/variance/SD; graphs/charts)Average marks in an exam · last year's sales
DiagnosticWhy something happenedRoot-cause analysis, hypothesis testing, correlationWhy did sales drop? Poor customer service? Low quality?
PredictiveWhat will happen (future)Forecast future behaviour — regression, classification, clusteringForecast next month's sales or demand
PrescriptiveWhat should we doRecommend actions — optimisation, simulation, decision analysisRight strategy to increase festival-season sales

1.2 Module 2 — From Requirements to Collection

📋 Step 3 · Data Requirements

"What are the data requirements?"

Requirements are driven by the analytic approach chosen. Using the 5W1H method, identify:

🔹 Three Types of Data

📥 Step 4 · Data Collection

"What occurs during data collection?" — a systematic process of gathering observations or measurements. After this step, the team decides if more or less data is required.

🔹 Primary Data Source

Original, raw, unprocessed data collected firsthand: direct observation, experimentation, surveys, interviews. Examples — marketing campaigns, feedback forms, IoT sensor data.

🔹 Secondary Data Source

Already-stored data ready to reuse: books, journals, websites, internal transactional databases. Methods: social-media tracking, web scraping, satellite data tracking, smart forms.

Popular online sources: data.gov · World Bank Open Data · UNICEF · Open Data Network · Kaggle · WHO · Google.

1.3 Module 3 — From Understanding to Preparation

🔎 Step 5 · Data Understanding

"Is the data collected representative of the problem to be solved?"

🧹 Step 6 · Data Preparation

"What additional work is required to manipulate and work with the data?" This is the most time-consuming step.

🔹 Feature Engineering
Feature Engineering — the process of selecting, modifying or creating new features (variables) from raw data to improve the performance of machine-learning models.
To predict house prices from raw data (area, bedrooms, year built), create new features:
Age of house = Current year − Year built.
Price per sq ft = Price / Area.
These derived features help the model make more accurate predictions.

1.4 Module 4 — From Modelling to Evaluation

🤖 Step 7 · AI Modelling

"In what way can the data be visualised to get to the required answer?" Modelling is iterative — may loop back to Data Preparation. Data scientists test multiple algorithms to find the best fit.

📊 Step 8 · Evaluation

"Does the model really answer the initial question or does it need adjustment?" Use test data to measure accuracy, precision, recall, F1 score. Two phases:

  1. Diagnostic Measures — ensure the model works as intended. A decision tree can evaluate a predictive model; a testing set with known outcomes can refine a descriptive model.
  2. Statistical Significance Test — verifies the model accurately processes and interprets data. Avoids second-guessing when the answer is revealed.

1.5 Module 5 — From Deployment to Feedback

🚀 Step 9 · Deployment

"How does the solution reach the hands of the user?" Once evaluated, the trained model is made available to users — sometimes first to a limited group or test environment to build confidence, then rolled out fully. Deployment often involves additional internal teams, skills and technology.

💬 Step 10 · Feedback

"Is the problem solved? Has the question been satisfactorily answered?" Feedback comes from:

The process from Modelling → Feedback is highly iterative. Each step sets the stage for the next, making the methodology cyclical and ensuring refinement at every stage.

Learning Outcome 2: Identify the best way to represent a solution to a problem

2.1 Solution Representation — What Fits the Problem?

The analytic approach chosen in Step 2 dictates how the solution is represented:

Problem TypeBest Representation
Predict a continuous numberRegression line / scatter plot
Classify into categoriesDecision tree · confusion matrix · classification report
Group similar dataClusters shown on scatter plot; dendrogram (hierarchical clustering)
Detect unusual patternsBox plot with outliers · anomaly-flag table
Recommend next-best actionRanked list · decision matrix
🔹 Represent with Stakeholders in Mind
Learning Outcome 3: Understand the importance of validating machine learning models

3.1 What Is Model Validation?

Model Validation — the step conducted after Model Training, where the effectiveness of the trained model is assessed using a testing dataset. It offers a systematic way to measure accuracy and reliability and shows how well the model generalises to new, unseen data.
🔹 Benefits of Model Validation

3.2 Common Validation Techniques

3.3 Train-Test Split

A technique for evaluating any supervised learning algorithm — works for both classification and regression.

🔹 Configuring the Split

Main parameter: the size of train vs test, expressed as a ratio between 0 and 1.

🔹 Common Split Percentages

There is no single optimal split — pick one that meets the project's objectives.

3.4 K-Fold Cross Validation

Splits data into k folds (subsets). The model is trained on some folds and tested on the remainder — the process is repeated so every fold is used as a holdout exactly once.

With k = 5, each fold = 20% of data. Experiment 1 holds out fold 1, trains on folds 2–5. Experiment 2 holds out fold 2 … Experiment 5 holds out fold 5. 100% of the data is used as holdout at some point.

3.5 Train-Test Split vs Cross-Validation

Train-Test SplitCross-Validation
Normally applied on large datasets.Normally applied on small datasets.
Divides data into train set + test set.Divides into k folds; rotates through them.
Clear demarcation between train and test.Every data point may be in either training or testing at some stage.
Learning Outcome 4: Use key evaluation metrics for various machine learning tasks

4.1 Why Evaluation Metrics?

Evaluation metrics help assess the performance of a trained model on the test set. They enable comparison between models to pick the best one.

Hence, different metrics for each task.

4.2 Evaluation Metrics for Classification

📊 1. Confusion Matrix

Confusion Matrix — an N × N table summarising predictions against actual outcomes (N = number of classes). For a binary problem, N = 2 → a 2 × 2 matrix.
Predicted: YesPredicted: No
Actual: YesTP — True PositiveFN — False Negative
Actual: NoFP — False PositiveTN — True Negative

🎯 2. Precision

What proportion of predicted positives is truly positive?

Precision = TP / (TP + FP). Should be as high as possible.

🔁 3. Recall

What proportion of actual positives is correctly classified?

Recall = TP / (TP + FN).

⚖️ 4. F1 Score

A good F1 score means low false positives and low false negatives — you correctly identify real threats and are not disturbed by false alarms. F1 = 1 is perfect; F1 = 0 is a total failure.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

✅ 5. Accuracy

Accuracy = (TP + TN) / (TP + FP + FN + TN) = number of correct predictions / total predictions.

4.3 Evaluation Metrics for Regression

📐 1. MAE — Mean Absolute Error

Sum of absolute differences between predictions and actual values. MAE = 0 means perfect predictions.

📐 2. MSE — Mean Square Error

The average of squared distances between the target and predicted values. Most commonly used metric for regression. Penalises larger errors more.

📐 3. RMSE — Root Mean Square Error

Standard deviation of the residuals (prediction errors). Preferred over MSE because it's in the same units as the target variable — easier to interpret.

4.4 Practical 1 — MSE & RMSE in MS Excel

Predicted vs Actual values for 10 data points:

PredictedActualResidual (Actual − Predicted)Squared Residual
141739
1918-11
171811
131524
1218636
711416
2420-416
2318-525
1713-416
181911
Sum of Squared Residuals125
MSE = 125 / 1012.5
RMSE = √12.53.54

4.5 Practical 2 — Classification Metrics from a Confusion Matrix

Given: TP = 35, TN = 50, FP = 10, FN = 5.

4.6 Practical 3 — Python Code to Evaluate a Model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('Salary_Data.csv', sep=',')
print(df.head())
print(df.shape)           # (30, 2)
print(df.isnull().sum())  # no nulls

# Data Preparation
X = np.array(df['YearsExperience']).reshape(-1, 1)
Y = np.array(df['Salary']).reshape(-1, 1)
print(X.shape, Y.shape)

# Train-Test Split
X_train, x_test, Y_train, y_test = train_test_split(
    X, Y, test_size=0.2, shuffle=True, random_state=10)

# Fit the model
model = LinearRegression()
model.fit(X_train, Y_train)

# Score
print("Train R²:", model.score(X_train, Y_train))
print("Test  R²:", model.score(x_test,  y_test))

# Mean Squared Error
y_pred = model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
Capstone practical set: (a) compute MSE & RMSE in MS Excel for the 10-pair table, (b) compute Precision / Recall / F1 / Accuracy from the given confusion matrix, (c) on Salary_Data.csv, build a Linear Regression, do 80:20 split, and report training & testing R² and MSE.
Check Your Progress — quick MCQ pointers:
  • Hardest stage in Data Science Methodology → Business Understanding (a).
  • Business Sponsors define the problem from a Business perspective.
  • Data Modelling focuses on Predictive or Descriptive models.
  • "No optimal split percentage" is True; "most common is 20:80" is also correct (i.e., 80:20 train : test).
  • train_test_split is imported from sklearn.model_selection.
  • Cross-validation is preferred with small data and gives a more reliable measure — but it does not take short time (it takes longer).
  • Identifying data content, format, sources → Step 3 Data Requirements.
  • "Edge" is not an online data source (UNICEF, WHO, Google are).
  • Historical data with known outcomes = Training set.
  • Data set to evaluate the fit model = Test set.
  • In test_size=0.2, the training size is 0.8 (80 %).
  • In k-fold CV, k = number of subsets / folds.
  • MSE is Mean Squared Error (not Median) · used for regression · penalises large errors.

Quick Revision — Key Points to Remember

  • Data Science Methodology = iterative framework for building AI solutions; proposed by John B. Rollins (IBM Analytics).
  • 10 steps · 5 modules — each module has 2 stages.
  • Module 1 — Problem to Approach: Business Understanding (5W1H, DT Framework, Problem Scoping) → Analytic Approach (Regression / Classification / Clustering / Anomaly / Recommendation).
  • 4 types of analytics: Descriptive · Diagnostic · Predictive · Prescriptive.
  • Module 2 — Requirements to Collection: Data Requirements (types, format, source, prep) → Data Collection (Primary + Secondary sources — data.gov, World Bank, UNICEF, Kaggle, WHO, Google).
  • 3 data types: Structured · Unstructured · Semi-structured.
  • Module 3 — Understanding to Preparation: Data Understanding (descriptive stats, histograms) → Data Preparation (clean, combine, transform + Feature Engineering). Prep is the most time-consuming step.
  • Module 4 — Modelling to Evaluation: Descriptive vs Predictive modelling → Evaluation (Diagnostic Measures + Statistical Significance Test).
  • Module 5 — Deployment to Feedback: Deployment (test environment → full rollout) → Feedback (iterative refinement).
  • Model Validation = post-training check using a test set; prevents overfitting / underfitting.
  • 4 validation techniques: Train-Test Split · K-Fold CV · Leave-One-Out CV · Time-Series CV.
  • Common splits: 80:20 · 70:30 · 67:33 — no single optimal ratio.
  • K-Fold CV — k rotating holdout folds; every row serves as validation once; more reliable but slower.
  • Train-Test vs CV: TT for large datasets (clear split); CV for small datasets (rotating folds).
  • Classification metrics: Confusion Matrix (TP/TN/FP/FN), Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2·P·R/(P+R), Accuracy = (TP+TN)/Total.
  • Regression metrics: MAE · MSE (penalises large errors) · RMSE (same units as target).
  • Worked example: 10 predicted/actual pairs → Sum of Sq Residuals = 125 → MSE = 12.5 · RMSE = 3.54.
  • Confusion-matrix example: TP 35 · TN 50 · FP 10 · FN 5 → Precision 77.8 % · Recall 87.5 % · F1 82.3 % · Accuracy 85 %.
  • Python pipeline: pandas → numpy → sklearn.model_selection.train_test_split → sklearn.linear_model.LinearRegression → sklearn.metrics.mean_squared_error.
🧠Practice Quiz — test yourself on this chapter