VM-LEARNING /class.x ·track.ai ·ch-b3 session: 2026_27
$cd ..

~/Evaluating Models

root@vm-learning ~ $ open ch-b3
PART B ▪ UNIT 3
08
Evaluating Models
Train-Test Split · Accuracy · Confusion Matrix · Precision · Recall · F1
Model Evaluation is the process of using different evaluation metrics to understand a machine-learning model's performance. It helps us find the best model that represents our data and estimates how well the chosen model will work in the future.
In the AI Project Cycle, after building a model in the Modeling stage, we must evaluate it before deploying. Evaluation is the report card of the AI model — it tells us where the model is strong, where it is weak, and where we need to improve.

Introduction — Why Evaluate?

Till now we have learnt about 4 stages of the AI Project Cycle — Problem Scoping, Data Acquisition, Data Exploration, Modeling. While modelling we can build different types of models — but how do we check if one is better than the other? That's where Evaluation comes in.

Learning Outcome 1: Understand the role of evaluation in AI development

3.1 Importance of Model Evaluation

📋 1. What is Evaluation?

Analogy — School Report Card: Your academic performance is measured by grades, percentage, percentile, rank. Each parameter tells you where to work more to do better. Model evaluation is the report card of your AI model.

🎯 2. Why Do We Need Model Evaluation?

  • Find strengths — where the model is already good.
  • Find weaknesses — where the model makes mistakes.
  • Check suitability for the task at hand.
  • Build trustworthy and reliable AI systems.
  • Enable feedback loops — essential for continuous improvement.
  • Estimate how the model will perform on new, unseen data.
  • Compare multiple candidate models and pick the best one.
Learning Outcome 2: Understand Train-Test split method

3.2 Splitting the Training Set Data for Evaluation — Train-Test Split

Train-Test Split is a technique for evaluating the performance of a machine-learning algorithm. The procedure takes a dataset and divides it into two subsets — the training dataset and the testing dataset. It can be used for any supervised-learning algorithm when a sufficiently large dataset is available.
🔹 How Train-Test Split Works
🔀 TRAIN-TEST SPLIT
Full Dataset
Training Set (≈70-80%) Testing Set (≈20-30%)
Make model learn Check model's predictions
🔹 Need for Train-Test Split
Never evaluate a model on the same data used to train it! The model will simply remember the whole training set and will always predict the correct label for any point in it. This is called overfitting — a model that looks perfect on training data but fails on new data.
Learning Outcome 3: Understand Accuracy and Error

3.3 Accuracy and Error

Intro Puzzle — Bob & Billy at a Concert: The concert entry fee was ₹500. Bob brought ₹300, Billy brought ₹550.
• Who is more accurate? Billy (off by only ₹50 vs Bob's ₹200).
• Error for Bob = |500 − 300| = ₹200. Error for Billy = |500 − 550| = ₹50.
✅ Accuracy

An evaluation metric that measures the total number of predictions a model gets right. Higher accuracy = better performance. Accuracy ∝ Performance.

❌ Error

An action that is inaccurate or wrong. The difference between a model's prediction and the actual outcome. Quantifies how often the model makes mistakes.

🔹 Accuracy & Error in a Medical Example
Training a model to predict if a patient has a disease:
Error: Model predicts "no disease" but patient actually has the disease → error.
Accuracy: Model correctly predicts disease/no-disease for a period → 100% accuracy for that period.
🔹 Key Points on Accuracy & Error

3.4 Calculating Accuracy — House Price Example

Error = |Actual − Predicted|
Error Rate = Error / Actual
Accuracy = 1 − Error Rate
Accuracy % = Accuracy × 100%
Predicted (USD)Actual (USD)ErrorError RateAccuracyAccuracy %
391 k402 k11 k11/402 = 0.0270.97397.3 %
453 k488 k35 k0.0720.92892.8 %
125 k97 k28 k0.2890.71171.1 %
871 k907 k36 k0.0400.96096.0 %
322 k425 k103 k0.2420.75875.8 %

Mean Accuracy = (97.3 + 92.8 + 71.1 + 96.0 + 75.8) / 5 ≈ 86.6 % overall accuracy.

Learning Outcome 4: Evaluation metrics for Classification — Confusion Matrix, Precision, Recall, F1

3.5 Evaluation Metrics for Classification

🎯 1. What is Classification?

Classification refers to a problem where a specific type of class label is the result to be predicted from the input data.
Supermarket trolleys: Place fruits and vegetables in one trolley, grocery items (bread, oil, egg) in another. You are classifying items into two classes — fruits/vegetables vs grocery.
🔹 Classification Use-Case Test
🔹 Popular Metrics for Classification
1. Confusion MatrixA table showing prediction vs reality — the foundation of all classification metrics.
2. Classification AccuracyRatio of correct predictions to total predictions.
3. PrecisionOf all predicted positives, how many are really positive?
4. RecallOf all actual positives, how many did we catch?
5. F1 ScoreCombined balance of Precision and Recall.

3.6 The Confusion Matrix

The Confusion Matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents actual values on the y-axis and predicted values on the x-axis. Each cell shows the number of predictions that fall into that category.

For a binary classifier that predicts 0 or 1:

True Positive (TP)Actual = 1 & Predicted = 1
Model correctly predicted positive.
False Positive (FP)Actual = 0 but Predicted = 1
Model wrongly predicted negative as positive.
False Negative (FN)Actual = 1 but Predicted = 0
Model wrongly predicted positive as negative.
True Negative (TN)Actual = 0 & Predicted = 0
Model correctly predicted negative.
🔹 TP / TN / FP / FN — Real-World Examples
TermFootball World CupDisease Diagnosis
TPPredicted France would win, and it won.Predicted "yes" (has disease), and patient actually has it.
TNPredicted Germany would not win, and it lost.Predicted "no" (no disease), and patient actually doesn't have it.
FPPredicted Germany would win, but it lost.Predicted "yes" but patient doesn't have the disease. (False alarm)
FNPredicted France would not win, but it won.Predicted "no" but patient actually has the disease. (Missed case)

🔨 Build a Confusion Matrix from Scratch

Given data (10 predictions): Count actual YES/NO vs predicted YES/NO for each row. Fill the 4 cells:
• Top-left: Both columns YES → TP count
• Top-right: Actual YES, Predicted NO → FN count
• Bottom-left: Actual NO, Predicted YES → FP count
• Bottom-right: Both columns NO → TN count

Example: 7 correct out of 10 → TP + TN = 7, FP + FN = 3.

3.7 Accuracy from Confusion Matrix

Classification Accuracy = Correct Predictions / Total Predictions
= (TP + TN) / (TP + TN + FP + FN)
🔹 Can We Use Accuracy All The Time?
Accuracy is only suitable when there is a balanced dataset (equal observations in each class) AND when all prediction errors are equally important. In practice, datasets are rarely balanced.
Unbalanced Dataset Problem:
• Test data = 1000 points → 900 "Yes", 100 "No" (unbalanced).
• A faulty model predicts "Yes" for everything.
• TP = 900, TN = 0, FP = 100, FN = 0.
• Accuracy = (900 + 0) / 1000 = 0.90 = 90 %.
• Yet the model never predicts a single "No" correctly — it's useless for detecting negatives!
So, for unbalanced datasets, we should use other metrics — Precision, Recall, or F1 Score.

3.8 Precision

Precision is the ratio of correctly classified positive examples to the total number of predicted positive examples.
Precision = TP / (TP + FP)

Precision = 0.843 means: when our model predicts a patient has heart disease, it is correct around 84 % of the time.

🔹 When to Use Precision?

Use Precision with unbalanced datasets when False Positives are costly — the model must reduce FPs as much as possible.

Precision Use Case — Satellite Launch Based on Weather:
Positive class = favorable weather day. Negative class = non-favorable day.
• Missing a good weather day (low recall) is OK — we can wait.
• But predicting a BAD weather day as a GOOD day (False Positive) to launch the satellite can be disastrous!
→ So False Positives must be minimised → use Precision.

3.9 Recall

Recall is the measure of our model correctly identifying True Positives. Of all patients who actually have heart disease, how many did we correctly identify?
Recall = TP / (TP + FN)

Recall = 0.86 means: out of the total patients who have heart disease, 86 % have been correctly identified.

Recall is also called Sensitivity or True Positive Rate.
🔹 When to Use Recall?

Use Recall with unbalanced datasets when False Negatives are costly — the model must reduce FNs as much as possible.

Recall Use Case — COVID-19 Detection:
Positive = COVID-19 affected. Negative = non-affected.
• If a COVID-19 patient (Positive) is wrongly predicted as non-affected (Negative) — a False Negative — the patient gets no treatment and may infect many others.
→ So False Negatives must be minimised → use Recall.

3.10 F1 Score

F1 Score provides a way to combine both Precision and Recall into a single measure that captures both properties. Use F1 when the dataset is unbalanced and you can't decide whether FP or FN is more important.
F1 Score = (2 × Precision × Recall) / (Precision + Recall)
F1 Use Case — Credit Card Fraud Detection:
• Dataset is highly unbalanced (most transactions are legit).
• FN (missing fraud) is very costly — account holder loses money.
• FP (flagging legit as fraud) is costly too — customer annoyed, bank blocked.
• Use F1 Score to balance both.

3.11 Choosing the Right Metric — Quick Reference

MetricFormulaUse When…
Accuracy(TP + TN) / (TP + TN + FP + FN)Dataset is balanced and all errors are equally important.
PrecisionTP / (TP + FP)Cost of False Positives is high (e.g., satellite launch, spam blocking important mail).
RecallTP / (TP + FN)Cost of False Negatives is high (e.g., cancer diagnosis, COVID detection).
F1 Score2·P·R / (P + R)Dataset is unbalanced AND both FP and FN matter (e.g., fraud detection).
🔹 Practice Cases — Precision or Recall?
ScenarioUse
Email spam detection — a legitimate mail marked as spam is a disaster for the user.Precision
Cancer diagnosis — missing a cancer case is a disaster.Recall
Legal cases — "innocent until proven guilty"; wrongly convicting an innocent is worse.Precision
Fraud detection — missing a fraudulent transaction is worse.Recall
Safe-content filtering (e.g., YouTube Kids) — unsafe content reaching a child is a disaster.Recall

3.12 Worked Case Study — Spam Email Detection

Out of 1000 emails:
• True Positives (TP) = 150 spam emails correctly classified as spam.
• False Positives (FP) = 50 legitimate emails incorrectly marked as spam.
• True Negatives (TN) = 750 legit emails correctly classified as not spam.
• False Negatives (FN) = 50 spam emails incorrectly classified as not spam.
Accuracy = (150 + 750) / 1000 = 900 / 1000 = 0.90 (90 %)
Precision = 150 / (150 + 50) = 150 / 200 = 0.75 (75 %)
Recall = 150 / (150 + 50) = 150 / 200 = 0.75 (75 %)
F1 Score = 2 × 0.75 × 0.75 / (0.75 + 0.75) = 0.75 (75 %)
Learning Outcome 5: Ethical concerns around model evaluation

3.13 Ethical Concerns Around Model Evaluation

While evaluating an AI model, three ethical concerns must always be kept in mind:

⚖️
1. Bias
Evaluation data must represent all groups fairly — not just the majority. A model that has 95 % accuracy overall but only 60 % accuracy for a minority group is biased.
🔍
2. Transparency
Users should know how the model reached its decision. Evaluation metrics, datasets, and limitations should be publicly disclosed — so people can trust and verify.
🎯
3. Accountability / Accuracy
When the model fails, someone must be answerable — developer, company, or regulator. Accuracy claims must be honest; hidden errors (edge cases, biased samples) must be reported.

⚖️ 1. Bias — Detail

🔍 2. Transparency — Detail

🎯 3. Accountability & Honest Accuracy — Detail

3.14 Improving an AI Model — The Iterative Loop

🔄 CONTINUOUS IMPROVEMENT LOOP
1. Train model 2. Evaluate on test set 3. Study metrics 4. Improve data / model 5. Re-train 6. Re-evaluate
🔹 Ways to Improve a Weak Model

Quick Revision — Key Points to Remember

  • Model Evaluation = using metrics to understand model performance; like a report card.
  • Why evaluate: find strengths/weaknesses · build trust · estimate real-world performance · compare models.
  • Train-Test Split: divide dataset into training (70-80%) and testing (20-30%) subsets. Test on unseen data.
  • Overfitting = model memorises training data, fails on new data. Avoid by NEVER evaluating on training data.
  • Accuracy = total correct predictions / total predictions. Directly proportional to performance.
  • Error = difference between prediction and actual outcome. Goal: minimise error, maximise accuracy.
  • Accuracy limitation: misleading on unbalanced datasets.
  • Confusion Matrix = table with 4 cells: TP · TN · FP · FN.
  • TP = actual+predicted positive · TN = actual+predicted negative · FP = actual negative wrongly flagged positive · FN = actual positive missed.
  • Accuracy = (TP+TN) / (TP+TN+FP+FN).
  • Precision = TP / (TP+FP). Use when FP is costly (spam blocking important mail, satellite launch).
  • Recall (Sensitivity) = TP / (TP+FN). Use when FN is costly (cancer, COVID, fraud, content safety).
  • F1 Score = 2·P·R / (P+R). Use when dataset is unbalanced AND both FP and FN matter.
  • 3 Ethical concerns: Bias (fairness across groups) · Transparency (disclose how the model works) · Accountability/Accuracy (honest reporting + chain of responsibility).
  • Improvement loop: train → evaluate → study metrics → improve → retrain → re-evaluate.
🧠Practice Quiz — test yourself on this chapter