Introduction — John B. Rollins's 10-Step Methodology
The methodology presented here was put forward by John B. Rollins, a Data Scientist at IBM Analytics. It consists of 10 steps grouped into 5 modules, each containing two stages.
🔹 The Five Modules
- From Problem to Approach
- From Requirements to Collection
- From Understanding to Preparation
- From Modelling to Evaluation
- From Deployment to Feedback
🔹 Key Concepts You'll Learn
- Introduction to Data Science Methodology
- Steps for Data Science Methodology
- Model Validation Techniques
- Model Performance — Evaluation Metrics
Prerequisite: Class XI AI concepts + familiarity with Capstone Projects and their objectives.
1.1 Module 1 — From Problem to Approach
❓ Step 1 · Business Understanding
"What is the problem that you are trying to solve?"
- Understand the customer's problem by asking questions and discussing with stakeholders.
- Define objectives that support the customer's goal — also called Problem Scoping and defining.
- Use the 5W1H Problem Canvas (Who, What, When, Where, Why, How) to deeply understand the issue.
- Apply the Design Thinking (DT) Framework.
🧭 Step 2 · Analytic Approach
"How can you use the data to answer the question?"
The team picks the most appropriate approach by asking:
- How much / how many? → Regression.
- Which category does the data belong to? → Classification.
- Can the data be grouped? → Clustering.
- Any unusual pattern? → Anomaly Detection.
- Which option should we give the customer? → Recommendation.
🔹 Four Main Types of Data Analytics
| Analytics | Focus | Purpose | Example |
|---|---|---|---|
| Descriptive | Summarise past data — what has happened | Identify patterns, trends, anomalies in past data (mean/median/mode, range/variance/SD; graphs/charts) | Average marks in an exam · last year's sales |
| Diagnostic | Why something happened | Root-cause analysis, hypothesis testing, correlation | Why did sales drop? Poor customer service? Low quality? |
| Predictive | What will happen (future) | Forecast future behaviour — regression, classification, clustering | Forecast next month's sales or demand |
| Prescriptive | What should we do | Recommend actions — optimisation, simulation, decision analysis | Right strategy to increase festival-season sales |
1.2 Module 2 — From Requirements to Collection
📋 Step 3 · Data Requirements
"What are the data requirements?"
Requirements are driven by the analytic approach chosen. Using the 5W1H method, identify:
- Types of data required — numbers, words, images.
- Structure — table, text file, database.
- Sources of data.
- Any cleaning / organisation steps needed.
🔹 Three Types of Data
- Structured data — organised in tables (customer databases).
- Unstructured data — no predefined structure (social media posts, images).
- Semi-structured data — some organisation (emails, XML files).
📥 Step 4 · Data Collection
"What occurs during data collection?" — a systematic process of gathering observations or measurements. After this step, the team decides if more or less data is required.
🔹 Primary Data Source
Original, raw, unprocessed data collected firsthand: direct observation, experimentation, surveys, interviews. Examples — marketing campaigns, feedback forms, IoT sensor data.
🔹 Secondary Data Source
Already-stored data ready to reuse: books, journals, websites, internal transactional databases. Methods: social-media tracking, web scraping, satellite data tracking, smart forms.
Popular online sources: data.gov · World Bank Open Data · UNICEF · Open Data Network · Kaggle · WHO · Google.
1.3 Module 3 — From Understanding to Preparation
🔎 Step 5 · Data Understanding
"Is the data collected representative of the problem to be solved?"
- Evaluate relevance, comprehensiveness and suitability for the problem.
- Apply descriptive statistics (univariate analysis, pairwise correlation).
- Use visualisation (histograms) to assess content, quality and initial insights.
🧹 Step 6 · Data Preparation
"What additional work is required to manipulate and work with the data?" This is the most time-consuming step.
- Cleaning — handle invalid/missing values, remove duplicates, assign suitable formats.
- Combine data from multiple sources (archives, tables, platforms).
- Transform data into meaningful input variables.
🔹 Feature Engineering
• Age of house = Current year − Year built.
• Price per sq ft = Price / Area.
These derived features help the model make more accurate predictions.
1.4 Module 4 — From Modelling to Evaluation
🤖 Step 7 · AI Modelling
"In what way can the data be visualised to get to the required answer?" Modelling is iterative — may loop back to Data Preparation. Data scientists test multiple algorithms to find the best fit.
- Descriptive Modelling — summarises & understands data without predicting. Uses Summary Statistics (mean, median, mode, SD, variance, range, percentiles/quartiles) and visualisations (bar charts, histograms, pie charts, box plots, scatter plots).
- Predictive Modelling — uses historical data to predict future outcomes. Techniques: regression, classification, time-series forecasting. Needs a training set (historical data with known outcomes) that acts as a gauge to calibrate the model.
📊 Step 8 · Evaluation
"Does the model really answer the initial question or does it need adjustment?" Use test data to measure accuracy, precision, recall, F1 score. Two phases:
- Diagnostic Measures — ensure the model works as intended. A decision tree can evaluate a predictive model; a testing set with known outcomes can refine a descriptive model.
- Statistical Significance Test — verifies the model accurately processes and interprets data. Avoids second-guessing when the answer is revealed.
1.5 Module 5 — From Deployment to Feedback
🚀 Step 9 · Deployment
"How does the solution reach the hands of the user?" Once evaluated, the trained model is made available to users — sometimes first to a limited group or test environment to build confidence, then rolled out fully. Deployment often involves additional internal teams, skills and technology.
💬 Step 10 · Feedback
"Is the problem solved? Has the question been satisfactorily answered?" Feedback comes from:
- Results of deployment.
- User and client feedback on model performance.
- Observations from how the model works in the live environment.
The process from Modelling → Feedback is highly iterative. Each step sets the stage for the next, making the methodology cyclical and ensuring refinement at every stage.
2.1 Solution Representation — What Fits the Problem?
The analytic approach chosen in Step 2 dictates how the solution is represented:
| Problem Type | Best Representation |
|---|---|
| Predict a continuous number | Regression line / scatter plot |
| Classify into categories | Decision tree · confusion matrix · classification report |
| Group similar data | Clusters shown on scatter plot; dendrogram (hierarchical clustering) |
| Detect unusual patterns | Box plot with outliers · anomaly-flag table |
| Recommend next-best action | Ranked list · decision matrix |
🔹 Represent with Stakeholders in Mind
- Business users → dashboards, simple charts.
- Technical team → metrics, code and model parameters.
- End users → one-line recommendations or alerts.
3.1 What Is Model Validation?
🔹 Benefits of Model Validation
- Enhances model quality.
- Reduces risk of errors.
- Prevents overfitting and underfitting.
3.2 Common Validation Techniques
- Train-Test Split
- K-Fold Cross Validation
- Leave-One-Out Cross Validation (LOOCV)
- Time-Series Cross Validation
3.3 Train-Test Split
A technique for evaluating any supervised learning algorithm — works for both classification and regression.
- Train dataset → used to fit the machine-learning model.
- Test dataset → used to evaluate predictions against expected values.
- Estimates model performance on new data.
🔹 Configuring the Split
Main parameter: the size of train vs test, expressed as a ratio between 0 and 1.
- Train 67 % → test size 0.33 → Train 0.67 / Test 0.33.
- Consider: computational cost of training & evaluating · representation of each set.
🔹 Common Split Percentages
- 80 : 20 (most common)
- 70 : 30
- 67 : 33
There is no single optimal split — pick one that meets the project's objectives.
3.4 K-Fold Cross Validation
Splits data into k folds (subsets). The model is trained on some folds and tested on the remainder — the process is repeated so every fold is used as a holdout exactly once.
- Gives a more accurate measure of model quality — especially when many modelling decisions are made.
- Takes more time — estimates models once per fold.
3.5 Train-Test Split vs Cross-Validation
| Train-Test Split | Cross-Validation |
|---|---|
| Normally applied on large datasets. | Normally applied on small datasets. |
| Divides data into train set + test set. | Divides into k folds; rotates through them. |
| Clear demarcation between train and test. | Every data point may be in either training or testing at some stage. |
4.1 Why Evaluation Metrics?
Evaluation metrics help assess the performance of a trained model on the test set. They enable comparison between models to pick the best one.
- Classification problems have a finite set of target classes.
- Regression problems have a continuous target variable.
Hence, different metrics for each task.
4.2 Evaluation Metrics for Classification
📊 1. Confusion Matrix
| Predicted: Yes | Predicted: No | |
|---|---|---|
| Actual: Yes | TP — True Positive | FN — False Negative |
| Actual: No | FP — False Positive | TN — True Negative |
- True Positives (TP) — model predicted Yes, real was Yes.
- True Negatives (TN) — model predicted No, real was No.
- False Positives (FP) — model predicted Yes, but actually No.
- False Negatives (FN) — model predicted No, but actually Yes.
🎯 2. Precision
What proportion of predicted positives is truly positive?
Precision = TP / (TP + FP). Should be as high as possible.
🔁 3. Recall
What proportion of actual positives is correctly classified?
Recall = TP / (TP + FN).
⚖️ 4. F1 Score
A good F1 score means low false positives and low false negatives — you correctly identify real threats and are not disturbed by false alarms. F1 = 1 is perfect; F1 = 0 is a total failure.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
✅ 5. Accuracy
Accuracy = (TP + TN) / (TP + FP + FN + TN) = number of correct predictions / total predictions.
4.3 Evaluation Metrics for Regression
📐 1. MAE — Mean Absolute Error
Sum of absolute differences between predictions and actual values. MAE = 0 means perfect predictions.
📐 2. MSE — Mean Square Error
The average of squared distances between the target and predicted values. Most commonly used metric for regression. Penalises larger errors more.
📐 3. RMSE — Root Mean Square Error
Standard deviation of the residuals (prediction errors). Preferred over MSE because it's in the same units as the target variable — easier to interpret.
4.4 Practical 1 — MSE & RMSE in MS Excel
Predicted vs Actual values for 10 data points:
| Predicted | Actual | Residual (Actual − Predicted) | Squared Residual |
|---|---|---|---|
| 14 | 17 | 3 | 9 |
| 19 | 18 | -1 | 1 |
| 17 | 18 | 1 | 1 |
| 13 | 15 | 2 | 4 |
| 12 | 18 | 6 | 36 |
| 7 | 11 | 4 | 16 |
| 24 | 20 | -4 | 16 |
| 23 | 18 | -5 | 25 |
| 17 | 13 | -4 | 16 |
| 18 | 19 | 1 | 1 |
| Sum of Squared Residuals | 125 | ||
| MSE = 125 / 10 | 12.5 | ||
| RMSE = √12.5 | 3.54 | ||
4.5 Practical 2 — Classification Metrics from a Confusion Matrix
Given: TP = 35, TN = 50, FP = 10, FN = 5.
- Precision = 35 / (35 + 10) = 35/45 = 77.8 %.
- Recall = 35 / (35 + 5) = 35/40 = 87.5 %.
- F1 Score = 2 × (0.778 × 0.875) / (0.778 + 0.875) ≈ 82.3 %.
- Accuracy = (35 + 50) / (35 + 10 + 5 + 50) = 85/100 = 85 %.
4.6 Practical 3 — Python Code to Evaluate a Model
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split df = pd.read_csv('Salary_Data.csv', sep=',') print(df.head()) print(df.shape) # (30, 2) print(df.isnull().sum()) # no nulls # Data Preparation X = np.array(df['YearsExperience']).reshape(-1, 1) Y = np.array(df['Salary']).reshape(-1, 1) print(X.shape, Y.shape) # Train-Test Split X_train, x_test, Y_train, y_test = train_test_split( X, Y, test_size=0.2, shuffle=True, random_state=10) # Fit the model model = LinearRegression() model.fit(X_train, Y_train) # Score print("Train R²:", model.score(X_train, Y_train)) print("Test R²:", model.score(x_test, y_test)) # Mean Squared Error y_pred = model.predict(x_test) print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
- Hardest stage in Data Science Methodology → Business Understanding (a).
- Business Sponsors define the problem from a Business perspective.
- Data Modelling focuses on Predictive or Descriptive models.
- "No optimal split percentage" is True; "most common is 20:80" is also correct (i.e., 80:20 train : test).
- train_test_split is imported from sklearn.model_selection.
- Cross-validation is preferred with small data and gives a more reliable measure — but it does not take short time (it takes longer).
- Identifying data content, format, sources → Step 3 Data Requirements.
- "Edge" is not an online data source (UNICEF, WHO, Google are).
- Historical data with known outcomes = Training set.
- Data set to evaluate the fit model = Test set.
- In test_size=0.2, the training size is 0.8 (80 %).
- In k-fold CV, k = number of subsets / folds.
- MSE is Mean Squared Error (not Median) · used for regression · penalises large errors.
Quick Revision — Key Points to Remember
- Data Science Methodology = iterative framework for building AI solutions; proposed by John B. Rollins (IBM Analytics).
- 10 steps · 5 modules — each module has 2 stages.
- Module 1 — Problem to Approach: Business Understanding (5W1H, DT Framework, Problem Scoping) → Analytic Approach (Regression / Classification / Clustering / Anomaly / Recommendation).
- 4 types of analytics: Descriptive · Diagnostic · Predictive · Prescriptive.
- Module 2 — Requirements to Collection: Data Requirements (types, format, source, prep) → Data Collection (Primary + Secondary sources — data.gov, World Bank, UNICEF, Kaggle, WHO, Google).
- 3 data types: Structured · Unstructured · Semi-structured.
- Module 3 — Understanding to Preparation: Data Understanding (descriptive stats, histograms) → Data Preparation (clean, combine, transform + Feature Engineering). Prep is the most time-consuming step.
- Module 4 — Modelling to Evaluation: Descriptive vs Predictive modelling → Evaluation (Diagnostic Measures + Statistical Significance Test).
- Module 5 — Deployment to Feedback: Deployment (test environment → full rollout) → Feedback (iterative refinement).
- Model Validation = post-training check using a test set; prevents overfitting / underfitting.
- 4 validation techniques: Train-Test Split · K-Fold CV · Leave-One-Out CV · Time-Series CV.
- Common splits: 80:20 · 70:30 · 67:33 — no single optimal ratio.
- K-Fold CV — k rotating holdout folds; every row serves as validation once; more reliable but slower.
- Train-Test vs CV: TT for large datasets (clear split); CV for small datasets (rotating folds).
- Classification metrics: Confusion Matrix (TP/TN/FP/FN), Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2·P·R/(P+R), Accuracy = (TP+TN)/Total.
- Regression metrics: MAE · MSE (penalises large errors) · RMSE (same units as target).
- Worked example: 10 predicted/actual pairs → Sum of Sq Residuals = 125 → MSE = 12.5 · RMSE = 3.54.
- Confusion-matrix example: TP 35 · TN 50 · FP 10 · FN 5 → Precision 77.8 % · Recall 87.5 % · F1 82.3 % · Accuracy 85 %.
- Python pipeline: pandas → numpy → sklearn.model_selection.train_test_split → sklearn.linear_model.LinearRegression → sklearn.metrics.mean_squared_error.