VM-LEARNING /class.xii ·track.ai ·ch-b2 session: 2026_27
$cd ..

~/Data Science Methodology

root@vm-learning ~ $ open ch-b2
PART B ▪ UNIT 2
02
Data Science Methodology
An Analytic Approach to the Capstone Project
Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists follow to approach a problem and find a solution. It enables the capacity to handle and comprehend data, and organises an AI project systematically without losing time and cost.

Introduction — John B. Rollins's 10-Step Methodology

The methodology presented here was put forward by John B. Rollins, a Data Scientist at IBM Analytics. It consists of 10 steps grouped into 5 modules, each containing two stages.

The Five Modules
  1. From Problem to Approach
  2. From Requirements to Collection
  3. From Understanding to Preparation
  4. From Modelling to Evaluation
  5. From Deployment to Feedback
Key Concepts You'll Learn
  1. Introduction to Data Science Methodology
  2. Steps for Data Science Methodology
  3. Model Validation Techniques
  4. Model Performance — Evaluation Metrics

Prerequisite: Class XI AI concepts + familiarity with Capstone Projects and their objectives.

Learning Outcome 1: Integrate Data Science Methodology steps into the Capstone Project

1.1 Module 1 — From Problem to Approach

Step 1 · Business Understanding

"What is the problem that you are trying to solve?"

Capstone Scenario — Mr Pavan Sankar visits a food festival and cannot identify cuisines of unfamiliar dishes (Bubble & Squeak · Phan Pyat · Jadoh). The problem: "Given the ingredients / image of a dish, predict its cuisine."

Step 2 · Analytic Approach

"How can you use the data to answer the question?"

The team picks the most appropriate approach by asking:

  1. How much / how many? → Regression.
  2. Which category does the data belong to? → Classification.
  3. Can the data be grouped? → Clustering.
  4. Any unusual pattern? → Anomaly Detection.
  5. Which option should we give the customer? → Recommendation.
Four Main Types of Data Analytics
AnalyticsFocusPurposeExample
DescriptiveSummarise past data — what has happenedIdentify patterns, trends, anomalies in past data (mean/median/mode, range/variance/SD; graphs/charts)Average marks in an exam · last year's sales
DiagnosticWhy something happenedRoot-cause analysis, hypothesis testing, correlationWhy did sales drop? Poor customer service? Low quality?
PredictiveWhat will happen (future)Forecast future behaviour — regression, classification, clusteringForecast next month's sales or demand
PrescriptiveWhat should we doRecommend actions — optimisation, simulation, decision analysisRight strategy to increase festival-season sales

1.2 Module 2 — From Requirements to Collection

Step 3 · Data Requirements

"What are the data requirements?"

Requirements are driven by the analytic approach chosen. Using the 5W1H method, identify:

Three Types of Data

Step 4 · Data Collection

"What occurs during data collection?" — a systematic process of gathering observations or measurements. After this step, the team decides if more or less data is required.

Primary Data Source

Original, raw, unprocessed data collected firsthand: direct observation, experimentation, surveys, interviews. Examples — marketing campaigns, feedback forms, IoT sensor data.

Secondary Data Source

Already-stored data ready to reuse: books, journals, websites, internal transactional databases. Methods: social-media tracking, web scraping, satellite data tracking, smart forms.

Popular online sources: data.gov · World Bank Open Data · UNICEF · Open Data Network · Kaggle · WHO · Google.

1.3 Module 3 — From Understanding to Preparation

Step 5 · Data Understanding

"Is the data collected representative of the problem to be solved?"

Step 6 · Data Preparation

"What additional work is required to manipulate and work with the data?" This is the most time-consuming step.

Feature Engineering
Feature Engineering — the process of selecting, modifying or creating new features (variables) from raw data to improve the performance of machine-learning models.
To predict house prices from raw data (area, bedrooms, year built), create new features:
Age of house = Current year − Year built.
Price per sq ft = Price / Area.
These derived features help the model make more accurate predictions.

1.4 Module 4 — From Modelling to Evaluation

Step 7 · AI Modelling

"In what way can the data be visualised to get to the required answer?" Modelling is iterative — may loop back to Data Preparation. Data scientists test multiple algorithms to find the best fit.

Step 8 · Evaluation

"Does the model really answer the initial question or does it need adjustment?" Use test data to measure accuracy, precision, recall, F1 score. Two phases:

  1. Diagnostic Measures — ensure the model works as intended. A decision tree can evaluate a predictive model; a testing set with known outcomes can refine a descriptive model.
  2. Statistical Significance Test — verifies the model accurately processes and interprets data. Avoids second-guessing when the answer is revealed.

1.5 Module 5 — From Deployment to Feedback

Step 9 · Deployment

"How does the solution reach the hands of the user?" Once evaluated, the trained model is made available to users — sometimes first to a limited group or test environment to build confidence, then rolled out fully. Deployment often involves additional internal teams, skills and technology.

Step 10 · Feedback

"Is the problem solved? Has the question been satisfactorily answered?" Feedback comes from:

The process from Modelling → Feedback is highly iterative. Each step sets the stage for the next, making the methodology cyclical and ensuring refinement at every stage.

Learning Outcome 2: Identify the best way to represent a solution to a problem

2.1 Solution Representation — What Fits the Problem?

The analytic approach chosen in Step 2 dictates how the solution is represented:

Problem TypeBest Representation
Predict a continuous numberRegression line / scatter plot
Classify into categoriesDecision tree · confusion matrix · classification report
Group similar dataClusters shown on scatter plot; dendrogram (hierarchical clustering)
Detect unusual patternsBox plot with outliers · anomaly-flag table
Recommend next-best actionRanked list · decision matrix
Represent with Stakeholders in Mind
Learning Outcome 3: Understand the importance of validating machine learning models

3.1 What Is Model Validation?

Model Validation — the step conducted after Model Training, where the effectiveness of the trained model is assessed using a testing dataset. It offers a systematic way to measure accuracy and reliability and shows how well the model generalises to new, unseen data.
Benefits of Model Validation

3.2 Common Validation Techniques

3.3 Train-Test Split

A technique for evaluating any supervised learning algorithm — works for both classification and regression.

Configuring the Split

Main parameter: the size of train vs test, expressed as a ratio between 0 and 1.

Common Split Percentages

There is no single optimal split — pick one that meets the project's objectives.

3.4 K-Fold Cross Validation

Splits data into k folds (subsets). The model is trained on some folds and tested on the remainder — the process is repeated so every fold is used as a holdout exactly once.

With k = 5, each fold = 20% of data. Experiment 1 holds out fold 1, trains on folds 2–5. Experiment 2 holds out fold 2 … Experiment 5 holds out fold 5. 100% of the data is used as holdout at some point.

3.5 Train-Test Split vs Cross-Validation

Train-Test SplitCross-Validation
Normally applied on large datasets.Normally applied on small datasets.
Divides data into train set + test set.Divides into k folds; rotates through them.
Clear demarcation between train and test.Every data point may be in either training or testing at some stage.
Learning Outcome 4: Use key evaluation metrics for various machine learning tasks

4.1 Why Evaluation Metrics?

Evaluation metrics help assess the performance of a trained model on the test set. They enable comparison between models to pick the best one.

Hence, different metrics for each task.

4.2 Evaluation Metrics for Classification

1. Confusion Matrix

Confusion Matrix — an N × N table summarising predictions against actual outcomes (N = number of classes). For a binary problem, N = 2 → a 2 × 2 matrix.
Predicted: YesPredicted: No
Actual: YesTP — True PositiveFN — False Negative
Actual: NoFP — False PositiveTN — True Negative

2. Precision

What proportion of predicted positives is truly positive?

Precision = TP / (TP + FP). Should be as high as possible.

3. Recall

What proportion of actual positives is correctly classified?

Recall = TP / (TP + FN).

4. F1 Score

A good F1 score means low false positives and low false negatives — you correctly identify real threats and are not disturbed by false alarms. F1 = 1 is perfect; F1 = 0 is a total failure.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

5. Accuracy

Accuracy = (TP + TN) / (TP + FP + FN + TN) = number of correct predictions / total predictions.

4.3 Evaluation Metrics for Regression

1. MAE — Mean Absolute Error

Sum of absolute differences between predictions and actual values. MAE = 0 means perfect predictions.

2. MSE — Mean Square Error

The average of squared distances between the target and predicted values. Most commonly used metric for regression. Penalises larger errors more.

3. RMSE — Root Mean Square Error

Standard deviation of the residuals (prediction errors). Preferred over MSE because it's in the same units as the target variable — easier to interpret.

4.4 Practical 1 — MSE & RMSE in MS Excel

Predicted vs Actual values for 10 data points:

PredictedActualResidual (Actual − Predicted)Squared Residual
141739
1918-11
171811
131524
1218636
711416
2420-416
2318-525
1713-416
181911
Sum of Squared Residuals125
MSE = 125 / 1012.5
RMSE = √12.53.54

4.5 Practical 2 — Classification Metrics from a Confusion Matrix

Given: TP = 35, TN = 50, FP = 10, FN = 5.

4.6 Practical 3 — Python Code to Evaluate a Model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('Salary_Data.csv', sep=',')
print(df.head())
print(df.shape)           # (30, 2)
print(df.isnull().sum())  # no nulls

# Data Preparation
X = np.array(df['YearsExperience']).reshape(-1, 1)
Y = np.array(df['Salary']).reshape(-1, 1)
print(X.shape, Y.shape)

# Train-Test Split
X_train, x_test, Y_train, y_test = train_test_split(
    X, Y, test_size=0.2, shuffle=True, random_state=10)

# Fit the model
model = LinearRegression()
model.fit(X_train, Y_train)

# Score
print("Train R²:", model.score(X_train, Y_train))
print("Test  R²:", model.score(x_test,  y_test))

# Mean Squared Error
y_pred = model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
Capstone practical set: (a) compute MSE & RMSE in MS Excel for the 10-pair table, (b) compute Precision / Recall / F1 / Accuracy from the given confusion matrix, (c) on Salary_Data.csv, build a Linear Regression, do 80:20 split, and report training & testing R² and MSE.
Check Your Progress — quick MCQ pointers:
  • Hardest stage in Data Science Methodology → Business Understanding (a).
  • Business Sponsors define the problem from a Business perspective.
  • Data Modelling focuses on Predictive or Descriptive models.
  • "No optimal split percentage" is True; "most common is 20:80" is also correct (i.e., 80:20 train : test).
  • train_test_split is imported from sklearn.model_selection.
  • Cross-validation is preferred with small data and gives a more reliable measure — but it does not take short time (it takes longer).
  • Identifying data content, format, sources → Step 3 Data Requirements.
  • "Edge" is not an online data source (UNICEF, WHO, Google are).
  • Historical data with known outcomes = Training set.
  • Data set to evaluate the fit model = Test set.
  • In test_size=0.2, the training size is 0.8 (80 %).
  • In k-fold CV, k = number of subsets / folds.
  • MSE is Mean Squared Error (not Median) · used for regression · penalises large errors.

Quick Revision — Key Points to Remember

  • Data Science Methodology = iterative framework for building AI solutions; proposed by John B. Rollins (IBM Analytics).
  • 10 steps · 5 modules — each module has 2 stages.
  • Module 1 — Problem to Approach: Business Understanding (5W1H, DT Framework, Problem Scoping) → Analytic Approach (Regression / Classification / Clustering / Anomaly / Recommendation).
  • 4 types of analytics: Descriptive · Diagnostic · Predictive · Prescriptive.
  • Module 2 — Requirements to Collection: Data Requirements (types, format, source, prep) → Data Collection (Primary + Secondary sources — data.gov, World Bank, UNICEF, Kaggle, WHO, Google).
  • 3 data types: Structured · Unstructured · Semi-structured.
  • Module 3 — Understanding to Preparation: Data Understanding (descriptive stats, histograms) → Data Preparation (clean, combine, transform + Feature Engineering). Prep is the most time-consuming step.
  • Module 4 — Modelling to Evaluation: Descriptive vs Predictive modelling → Evaluation (Diagnostic Measures + Statistical Significance Test).
  • Module 5 — Deployment to Feedback: Deployment (test environment → full rollout) → Feedback (iterative refinement).
  • Model Validation = post-training check using a test set; prevents overfitting / underfitting.
  • 4 validation techniques: Train-Test Split · K-Fold CV · Leave-One-Out CV · Time-Series CV.
  • Common splits: 80:20 · 70:30 · 67:33 — no single optimal ratio.
  • K-Fold CV — k rotating holdout folds; every row serves as validation once; more reliable but slower.
  • Train-Test vs CV: TT for large datasets (clear split); CV for small datasets (rotating folds).
  • Classification metrics: Confusion Matrix (TP/TN/FP/FN), Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2·P·R/(P+R), Accuracy = (TP+TN)/Total.
  • Regression metrics: MAE · MSE (penalises large errors) · RMSE (same units as target).
  • Worked example: 10 predicted/actual pairs → Sum of Sq Residuals = 125 → MSE = 12.5 · RMSE = 3.54.
  • Confusion-matrix example: TP 35 · TN 50 · FP 10 · FN 5 → Precision 77.8 % · Recall 87.5 % · F1 82.3 % · Accuracy 85 %.
  • Python pipeline: pandas → numpy → sklearn.model_selection.train_test_split → sklearn.linear_model.LinearRegression → sklearn.metrics.mean_squared_error.
Practice Quiz — test yourself on this chapter
©2026 VM Technologies · Vivek Maheshwari (MCA)