PART B ▪ UNIT 2 · Data Science Methodology

  root@vm-learning
  ~
  $
  open
  ch-b2
  

PART B ▪ UNIT 2

Data Science Methodology

An Analytic Approach to the Capstone Project

Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists follow to approach a problem and find a solution. It enables the capacity to handle and comprehend data, and organises an AI project systematically without losing time and cost.

Introduction — John B. Rollins's 10-Step Methodology

The methodology presented here was put forward by John B. Rollins, a Data Scientist at IBM Analytics. It consists of 10 steps grouped into 5 modules, each containing two stages.

The Five Modules

From Problem to Approach
From Requirements to Collection
From Understanding to Preparation
From Modelling to Evaluation
From Deployment to Feedback

Key Concepts You'll Learn

Introduction to Data Science Methodology
Steps for Data Science Methodology
Model Validation Techniques
Model Performance — Evaluation Metrics

Prerequisite: Class XI AI concepts + familiarity with Capstone Projects and their objectives.

Learning Outcome 1: Integrate Data Science Methodology steps into the Capstone Project

1.1 Module 1 — From Problem to Approach

Step 1 · Business Understanding

"What is the problem that you are trying to solve?"

Understand the customer's problem by asking questions and discussing with stakeholders.
Define objectives that support the customer's goal — also called Problem Scoping and defining.
Use the 5W1H Problem Canvas (Who, What, When, Where, Why, How) to deeply understand the issue.
Apply the Design Thinking (DT) Framework.

Capstone Scenario — Mr Pavan Sankar visits a food festival and cannot identify cuisines of unfamiliar dishes (Bubble & Squeak · Phan Pyat · Jadoh). The problem: "Given the ingredients / image of a dish, predict its cuisine."

Step 2 · Analytic Approach

"How can you use the data to answer the question?"

The team picks the most appropriate approach by asking:

How much / how many? → Regression.
Which category does the data belong to? → Classification.
Can the data be grouped? → Clustering.
Any unusual pattern? → Anomaly Detection.
Which option should we give the customer? → Recommendation.

Four Main Types of Data Analytics

Analytics	Focus	Purpose	Example
Descriptive	Summarise past data — what has happened	Identify patterns, trends, anomalies in past data (mean/median/mode, range/variance/SD; graphs/charts)	Average marks in an exam · last year's sales
Diagnostic	Why something happened	Root-cause analysis, hypothesis testing, correlation	Why did sales drop? Poor customer service? Low quality?
Predictive	What will happen (future)	Forecast future behaviour — regression, classification, clustering	Forecast next month's sales or demand
Prescriptive	What should we do	Recommend actions — optimisation, simulation, decision analysis	Right strategy to increase festival-season sales

1.2 Module 2 — From Requirements to Collection

Step 3 · Data Requirements

"What are the data requirements?"

Requirements are driven by the analytic approach chosen. Using the 5W1H method, identify:

Types of data required — numbers, words, images.
Structure — table, text file, database.
Sources of data.
Any cleaning / organisation steps needed.

Three Types of Data

Structured data — organised in tables (customer databases).
Unstructured data — no predefined structure (social media posts, images).
Semi-structured data — some organisation (emails, XML files).

Step 4 · Data Collection

"What occurs during data collection?" — a systematic process of gathering observations or measurements. After this step, the team decides if more or less data is required.

Primary Data Source

Original, raw, unprocessed data collected firsthand: direct observation, experimentation, surveys, interviews. Examples — marketing campaigns, feedback forms, IoT sensor data.

Secondary Data Source

Already-stored data ready to reuse: books, journals, websites, internal transactional databases. Methods: social-media tracking, web scraping, satellite data tracking, smart forms.

Popular online sources: data.gov · World Bank Open Data · UNICEF · Open Data Network · Kaggle · WHO · Google.

1.3 Module 3 — From Understanding to Preparation

Step 5 · Data Understanding

"Is the data collected representative of the problem to be solved?"

Evaluate relevance, comprehensiveness and suitability for the problem.
Apply descriptive statistics (univariate analysis, pairwise correlation).
Use visualisation (histograms) to assess content, quality and initial insights.

Step 6 · Data Preparation

"What additional work is required to manipulate and work with the data?" This is the most time-consuming step.

Cleaning — handle invalid/missing values, remove duplicates, assign suitable formats.
Combine data from multiple sources (archives, tables, platforms).
Transform data into meaningful input variables.

Feature Engineering

Feature Engineering — the process of selecting, modifying or creating new features (variables) from raw data to improve the performance of machine-learning models.

To predict house prices from raw data (area, bedrooms, year built), create new features:
• Age of house = Current year − Year built.
• Price per sq ft = Price / Area.
These derived features help the model make more accurate predictions.

1.4 Module 4 — From Modelling to Evaluation

Step 7 · AI Modelling

"In what way can the data be visualised to get to the required answer?" Modelling is iterative — may loop back to Data Preparation. Data scientists test multiple algorithms to find the best fit.

Descriptive Modelling — summarises & understands data without predicting. Uses Summary Statistics (mean, median, mode, SD, variance, range, percentiles/quartiles) and visualisations (bar charts, histograms, pie charts, box plots, scatter plots).
Predictive Modelling — uses historical data to predict future outcomes. Techniques: regression, classification, time-series forecasting. Needs a training set (historical data with known outcomes) that acts as a gauge to calibrate the model.

Step 8 · Evaluation

"Does the model really answer the initial question or does it need adjustment?" Use test data to measure accuracy, precision, recall, F1 score. Two phases:

Diagnostic Measures — ensure the model works as intended. A decision tree can evaluate a predictive model; a testing set with known outcomes can refine a descriptive model.
Statistical Significance Test — verifies the model accurately processes and interprets data. Avoids second-guessing when the answer is revealed.

1.5 Module 5 — From Deployment to Feedback

Step 9 · Deployment

"How does the solution reach the hands of the user?" Once evaluated, the trained model is made available to users — sometimes first to a limited group or test environment to build confidence, then rolled out fully. Deployment often involves additional internal teams, skills and technology.

Step 10 · Feedback

"Is the problem solved? Has the question been satisfactorily answered?" Feedback comes from:

Results of deployment.
User and client feedback on model performance.
Observations from how the model works in the live environment.

The process from Modelling → Feedback is highly iterative. Each step sets the stage for the next, making the methodology cyclical and ensuring refinement at every stage.

Learning Outcome 2: Identify the best way to represent a solution to a problem

2.1 Solution Representation — What Fits the Problem?

The analytic approach chosen in Step 2 dictates how the solution is represented:

Problem Type	Best Representation
Predict a continuous number	Regression line / scatter plot
Classify into categories	Decision tree · confusion matrix · classification report
Group similar data	Clusters shown on scatter plot; dendrogram (hierarchical clustering)
Detect unusual patterns	Box plot with outliers · anomaly-flag table
Recommend next-best action	Ranked list · decision matrix

Represent with Stakeholders in Mind

Business users → dashboards, simple charts.
Technical team → metrics, code and model parameters.
End users → one-line recommendations or alerts.

Learning Outcome 3: Understand the importance of validating machine learning models

3.1 What Is Model Validation?

Model Validation — the step conducted after Model Training, where the effectiveness of the trained model is assessed using a testing dataset. It offers a systematic way to measure accuracy and reliability and shows how well the model generalises to new, unseen data.

Benefits of Model Validation

Enhances model quality.
Reduces risk of errors.
Prevents overfitting and underfitting.

3.2 Common Validation Techniques

Train-Test Split
K-Fold Cross Validation
Leave-One-Out Cross Validation (LOOCV)
Time-Series Cross Validation

3.3 Train-Test Split

A technique for evaluating any supervised learning algorithm — works for both classification and regression.

Train dataset → used to fit the machine-learning model.
Test dataset → used to evaluate predictions against expected values.
Estimates model performance on new data.

Configuring the Split

Main parameter: the size of train vs test, expressed as a ratio between 0 and 1.

Train 67 % → test size 0.33 → Train 0.67 / Test 0.33.
Consider: computational cost of training & evaluating · representation of each set.

Common Split Percentages

80 : 20 (most common)
70 : 30
67 : 33

There is no single optimal split — pick one that meets the project's objectives.

3.4 K-Fold Cross Validation

Splits data into k folds (subsets). The model is trained on some folds and tested on the remainder — the process is repeated so every fold is used as a holdout exactly once.

With k = 5, each fold = 20% of data. Experiment 1 holds out fold 1, trains on folds 2–5. Experiment 2 holds out fold 2 … Experiment 5 holds out fold 5. 100% of the data is used as holdout at some point.

Gives a more accurate measure of model quality — especially when many modelling decisions are made.
Takes more time — estimates models once per fold.

3.5 Train-Test Split vs Cross-Validation

Train-Test Split	Cross-Validation
Normally applied on large datasets.	Normally applied on small datasets.
Divides data into train set + test set.	Divides into k folds; rotates through them.
Clear demarcation between train and test.	Every data point may be in either training or testing at some stage.

Learning Outcome 4: Use key evaluation metrics for various machine learning tasks

4.1 Why Evaluation Metrics?

Evaluation metrics help assess the performance of a trained model on the test set. They enable comparison between models to pick the best one.

Classification problems have a finite set of target classes.
Regression problems have a continuous target variable.

Hence, different metrics for each task.

4.2 Evaluation Metrics for Classification

1. Confusion Matrix

Confusion Matrix — an N × N table summarising predictions against actual outcomes (N = number of classes). For a binary problem, N = 2 → a 2 × 2 matrix.

	Predicted: Yes	Predicted: No
Actual: Yes	TP — True Positive	FN — False Negative
Actual: No	FP — False Positive	TN — True Negative

True Positives (TP) — model predicted Yes, real was Yes.
True Negatives (TN) — model predicted No, real was No.
False Positives (FP) — model predicted Yes, but actually No.
False Negatives (FN) — model predicted No, but actually Yes.

2. Precision

What proportion of predicted positives is truly positive?

Precision = TP / (TP + FP). Should be as high as possible.

3. Recall

What proportion of actual positives is correctly classified?

Recall = TP / (TP + FN).

4. F1 Score

A good F1 score means low false positives and low false negatives — you correctly identify real threats and are not disturbed by false alarms. F1 = 1 is perfect; F1 = 0 is a total failure.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

5. Accuracy

Accuracy = (TP + TN) / (TP + FP + FN + TN) = number of correct predictions / total predictions.

4.3 Evaluation Metrics for Regression

1. MAE — Mean Absolute Error

Sum of absolute differences between predictions and actual values. MAE = 0 means perfect predictions.

2. MSE — Mean Square Error

The average of squared distances between the target and predicted values. Most commonly used metric for regression. Penalises larger errors more.

3. RMSE — Root Mean Square Error

Standard deviation of the residuals (prediction errors). Preferred over MSE because it's in the same units as the target variable — easier to interpret.

4.4 Practical 1 — MSE & RMSE in MS Excel

Predicted vs Actual values for 10 data points:

Predicted	Actual	Residual (Actual − Predicted)	Squared Residual
14	17	3	9
19	18	-1	1
17	18	1	1
13	15	2	4
12	18	6	36
7	11	4	16
24	20	-4	16
23	18	-5	25
17	13	-4	16
18	19	1	1
Sum of Squared Residuals			125
MSE = 125 / 10			12.5
RMSE = √12.5			3.54

4.5 Practical 2 — Classification Metrics from a Confusion Matrix

Given: TP = 35, TN = 50, FP = 10, FN = 5.

Precision = 35 / (35 + 10) = 35/45 = 77.8 %.
Recall = 35 / (35 + 5) = 35/40 = 87.5 %.
F1 Score = 2 × (0.778 × 0.875) / (0.778 + 0.875) ≈ 82.3 %.
Accuracy = (35 + 50) / (35 + 10 + 5 + 50) = 85/100 = 85 %.

4.6 Practical 3 — Python Code to Evaluate a Model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('Salary_Data.csv', sep=',')
print(df.head())
print(df.shape)           # (30, 2)
print(df.isnull().sum())  # no nulls

# Data Preparation
X = np.array(df['YearsExperience']).reshape(-1, 1)
Y = np.array(df['Salary']).reshape(-1, 1)
print(X.shape, Y.shape)

# Train-Test Split
X_train, x_test, Y_train, y_test = train_test_split(
    X, Y, test_size=0.2, shuffle=True, random_state=10)

# Fit the model
model = LinearRegression()
model.fit(X_train, Y_train)

# Score
print("Train R²:", model.score(X_train, Y_train))
print("Test  R²:", model.score(x_test,  y_test))

# Mean Squared Error
y_pred = model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Capstone practical set: (a) compute MSE & RMSE in MS Excel for the 10-pair table, (b) compute Precision / Recall / F1 / Accuracy from the given confusion matrix, (c) on Salary_Data.csv, build a Linear Regression, do 80:20 split, and report training & testing R² and MSE.

Check Your Progress — quick MCQ pointers:

Hardest stage in Data Science Methodology → Business Understanding (a).
Business Sponsors define the problem from a Business perspective.
Data Modelling focuses on Predictive or Descriptive models.
"No optimal split percentage" is True; "most common is 20:80" is also correct (i.e., 80:20 train : test).
train_test_split is imported from sklearn.model_selection.
Cross-validation is preferred with small data and gives a more reliable measure — but it does not take short time (it takes longer).
Identifying data content, format, sources → Step 3 Data Requirements.
"Edge" is not an online data source (UNICEF, WHO, Google are).
Historical data with known outcomes = Training set.
Data set to evaluate the fit model = Test set.
In test_size=0.2, the training size is 0.8 (80 %).
In k-fold CV, k = number of subsets / folds.
MSE is Mean Squared Error (not Median) · used for regression · penalises large errors.

Quick Revision — Key Points to Remember

Data Science Methodology = iterative framework for building AI solutions; proposed by John B. Rollins (IBM Analytics).
10 steps · 5 modules — each module has 2 stages.
Module 1 — Problem to Approach: Business Understanding (5W1H, DT Framework, Problem Scoping) → Analytic Approach (Regression / Classification / Clustering / Anomaly / Recommendation).
4 types of analytics: Descriptive · Diagnostic · Predictive · Prescriptive.
Module 2 — Requirements to Collection: Data Requirements (types, format, source, prep) → Data Collection (Primary + Secondary sources — data.gov, World Bank, UNICEF, Kaggle, WHO, Google).
3 data types: Structured · Unstructured · Semi-structured.
Module 3 — Understanding to Preparation: Data Understanding (descriptive stats, histograms) → Data Preparation (clean, combine, transform + Feature Engineering). Prep is the most time-consuming step.
Module 4 — Modelling to Evaluation: Descriptive vs Predictive modelling → Evaluation (Diagnostic Measures + Statistical Significance Test).
Module 5 — Deployment to Feedback: Deployment (test environment → full rollout) → Feedback (iterative refinement).
Model Validation = post-training check using a test set; prevents overfitting / underfitting.
4 validation techniques: Train-Test Split · K-Fold CV · Leave-One-Out CV · Time-Series CV.
Common splits: 80:20 · 70:30 · 67:33 — no single optimal ratio.
K-Fold CV — k rotating holdout folds; every row serves as validation once; more reliable but slower.
Train-Test vs CV: TT for large datasets (clear split); CV for small datasets (rotating folds).
Classification metrics: Confusion Matrix (TP/TN/FP/FN), Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2·P·R/(P+R), Accuracy = (TP+TN)/Total.
Regression metrics: MAE · MSE (penalises large errors) · RMSE (same units as target).
Worked example: 10 predicted/actual pairs → Sum of Sq Residuals = 125 → MSE = 12.5 · RMSE = 3.54.
Confusion-matrix example: TP 35 · TN 50 · FP 10 · FN 5 → Precision 77.8 % · Recall 87.5 % · F1 82.3 % · Accuracy 85 %.
Python pipeline: pandas → numpy → sklearn.model_selection.train_test_split → sklearn.linear_model.LinearRegression → sklearn.metrics.mean_squared_error.

Practice Quiz — test yourself on this chapter→