PART B ▪ UNIT 3 · Evaluating Models

  root@vm-learning
  ~
  $
  open
  ch-b3
  

PART B ▪ UNIT 3

Evaluating Models

Train-Test Split · Accuracy · Confusion Matrix · Precision · Recall · F1

Model Evaluation is the process of using different evaluation metrics to understand a machine-learning model's performance. It helps us find the best model that represents our data and estimates how well the chosen model will work in the future.

In the AI Project Cycle, after building a model in the Modeling stage, we must evaluate it before deploying. Evaluation is the report card of the AI model — it tells us where the model is strong, where it is weak, and where we need to improve.

Introduction — Why Evaluate?

Till now we have learnt about 4 stages of the AI Project Cycle — Problem Scoping, Data Acquisition, Data Exploration, Modeling. While modelling we can build different types of models — but how do we check if one is better than the other? That's where Evaluation comes in.

Learning Outcome 1: Understand the role of evaluation in AI development

3.1 Importance of Model Evaluation

1. What is Evaluation?

Evaluation uses different evaluation metrics to understand a model's performance.
An AI model gets better with constructive feedback.
You build → get feedback from metrics → improve → repeat — until you achieve a desirable accuracy.

Analogy — School Report Card: Your academic performance is measured by grades, percentage, percentile, rank. Each parameter tells you where to work more to do better. Model evaluation is the report card of your AI model.

2. Why Do We Need Model Evaluation?

Find strengths — where the model is already good.
Find weaknesses — where the model makes mistakes.
Check suitability for the task at hand.
Build trustworthy and reliable AI systems.
Enable feedback loops — essential for continuous improvement.
Estimate how the model will perform on new, unseen data.
Compare multiple candidate models and pick the best one.

Learning Outcome 2: Understand Train-Test split method

3.2 Splitting the Training Set Data for Evaluation — Train-Test Split

Train-Test Split is a technique for evaluating the performance of a machine-learning algorithm. The procedure takes a dataset and divides it into two subsets — the training dataset and the testing dataset. It can be used for any supervised-learning algorithm when a sufficiently large dataset is available.

How Train-Test Split Works

TRAIN-TEST SPLIT

Full Dataset

Training Set (≈70-80%) Testing Set (≈20-30%)

Make model learn Check model's predictions

Need for Train-Test Split

The train dataset makes the model learn.
Input elements of the test dataset are given to the trained model. It makes predictions, and predicted values are compared with expected values.
Objective: Estimate the performance of the ML model on NEW data — data not used to train it.
This mimics real-world use — we fit on available data with known outputs, then predict new examples in future where we don't have the target values.

Never evaluate a model on the same data used to train it! The model will simply remember the whole training set and will always predict the correct label for any point in it. This is called overfitting — a model that looks perfect on training data but fails on new data.

Learning Outcome 3: Understand Accuracy and Error

3.3 Accuracy and Error

Intro Puzzle — Bob & Billy at a Concert: The concert entry fee was ₹500. Bob brought ₹300, Billy brought ₹550.
• Who is more accurate? Billy (off by only ₹50 vs Bob's ₹200).
• Error for Bob = |500 − 300| = ₹200. Error for Billy = |500 − 550| = ₹50.

Accuracy

An evaluation metric that measures the total number of predictions a model gets right. Higher accuracy = better performance. Accuracy ∝ Performance.

Error

An action that is inaccurate or wrong. The difference between a model's prediction and the actual outcome. Quantifies how often the model makes mistakes.

Accuracy & Error in a Medical Example

Training a model to predict if a patient has a disease:
• Error: Model predicts "no disease" but patient actually has the disease → error.
• Accuracy: Model correctly predicts disease/no-disease for a period → 100% accuracy for that period.

Key Points on Accuracy & Error

The goal is to minimise error and maximise accuracy.
Real-world data is messy — even the best models make mistakes.
Focusing solely on accuracy can be misleading. In medical diagnosis, a model with slightly lower accuracy but strong focus on not declaring a sick person as healthy might be preferable.
Choosing the right metric depends on the specific task and its requirements.

3.4 Calculating Accuracy — House Price Example

Error = |Actual − Predicted|
Error Rate = Error / Actual
Accuracy = 1 − Error Rate
Accuracy % = Accuracy × 100%

Predicted (USD)	Actual (USD)	Error	Error Rate	Accuracy	Accuracy %
391 k	402 k	11 k	11/402 = 0.027	0.973	97.3 %
453 k	488 k	35 k	0.072	0.928	92.8 %
125 k	97 k	28 k	0.289	0.711	71.1 %
871 k	907 k	36 k	0.040	0.960	96.0 %
322 k	425 k	103 k	0.242	0.758	75.8 %

Mean Accuracy = (97.3 + 92.8 + 71.1 + 96.0 + 75.8) / 5 ≈ 86.6 % overall accuracy.

Learning Outcome 4: Evaluation metrics for Classification — Confusion Matrix, Precision, Recall, F1

3.5 Evaluation Metrics for Classification

1. What is Classification?

Classification refers to a problem where a specific type of class label is the result to be predicted from the input data.

Supermarket trolleys: Place fruits and vegetables in one trolley, grocery items (bread, oil, egg) in another. You are classifying items into two classes — fruits/vegetables vs grocery.

Classification Use-Case Test

House price prediction → not classification (continuous value → Regression).
Credit-card fraud detection → classification (fraud / legit).
Salary prediction → not classification (continuous value → Regression).

Popular Metrics for Classification

1. Confusion MatrixA table showing prediction vs reality — the foundation of all classification metrics.

2. Classification AccuracyRatio of correct predictions to total predictions.

3. PrecisionOf all predicted positives, how many are really positive?

4. RecallOf all actual positives, how many did we catch?

5. F1 ScoreCombined balance of Precision and Recall.

3.6 The Confusion Matrix

The Confusion Matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents actual values on the y-axis and predicted values on the x-axis. Each cell shows the number of predictions that fall into that category.

For a binary classifier that predicts 0 or 1:

True Positive (TP)Actual = 1 & Predicted = 1
Model correctly predicted positive.

False Positive (FP)Actual = 0 but Predicted = 1
Model wrongly predicted negative as positive.

False Negative (FN)Actual = 1 but Predicted = 0
Model wrongly predicted positive as negative.

True Negative (TN)Actual = 0 & Predicted = 0
Model correctly predicted negative.

TP / TN / FP / FN — Real-World Examples

Term	Football World Cup	Disease Diagnosis
TP	Predicted France would win, and it won.	Predicted "yes" (has disease), and patient actually has it.
TN	Predicted Germany would not win, and it lost.	Predicted "no" (no disease), and patient actually doesn't have it.
FP	Predicted Germany would win, but it lost.	Predicted "yes" but patient doesn't have the disease. (False alarm)
FN	Predicted France would not win, but it won.	Predicted "no" but patient actually has the disease. (Missed case)

Build a Confusion Matrix from Scratch

Given data (10 predictions): Count actual YES/NO vs predicted YES/NO for each row. Fill the 4 cells:
• Top-left: Both columns YES → TP count
• Top-right: Actual YES, Predicted NO → FN count
• Bottom-left: Actual NO, Predicted YES → FP count
• Bottom-right: Both columns NO → TN count

Example: 7 correct out of 10 → TP + TN = 7, FP + FN = 3.

3.7 Accuracy from Confusion Matrix

Classification Accuracy = Correct Predictions / Total Predictions
= (TP + TN) / (TP + TN + FP + FN)

Can We Use Accuracy All The Time?

Accuracy is only suitable when there is a balanced dataset (equal observations in each class) AND when all prediction errors are equally important. In practice, datasets are rarely balanced.

Unbalanced Dataset Problem:
• Test data = 1000 points → 900 "Yes", 100 "No" (unbalanced).
• A faulty model predicts "Yes" for everything.
• TP = 900, TN = 0, FP = 100, FN = 0.
• Accuracy = (900 + 0) / 1000 = 0.90 = 90 %.
• Yet the model never predicts a single "No" correctly — it's useless for detecting negatives!

So, for unbalanced datasets, we should use other metrics — Precision, Recall, or F1 Score.

3.8 Precision

Precision is the ratio of correctly classified positive examples to the total number of predicted positive examples.

Precision = TP / (TP + FP)

Precision = 0.843 means: when our model predicts a patient has heart disease, it is correct around 84 % of the time.

When to Use Precision?

Use Precision with unbalanced datasets when False Positives are costly — the model must reduce FPs as much as possible.

Precision Use Case — Satellite Launch Based on Weather:
Positive class = favorable weather day. Negative class = non-favorable day.
• Missing a good weather day (low recall) is OK — we can wait.
• But predicting a BAD weather day as a GOOD day (False Positive) to launch the satellite can be disastrous!
→ So False Positives must be minimised → use Precision.

3.9 Recall

Recall is the measure of our model correctly identifying True Positives. Of all patients who actually have heart disease, how many did we correctly identify?

Recall = TP / (TP + FN)

Recall = 0.86 means: out of the total patients who have heart disease, 86 % have been correctly identified.

Recall is also called Sensitivity or True Positive Rate.

When to Use Recall?

Use Recall with unbalanced datasets when False Negatives are costly — the model must reduce FNs as much as possible.

Recall Use Case — COVID-19 Detection:
Positive = COVID-19 affected. Negative = non-affected.
• If a COVID-19 patient (Positive) is wrongly predicted as non-affected (Negative) — a False Negative — the patient gets no treatment and may infect many others.
→ So False Negatives must be minimised → use Recall.

3.10 F1 Score

F1 Score provides a way to combine both Precision and Recall into a single measure that captures both properties. Use F1 when the dataset is unbalanced and you can't decide whether FP or FN is more important.

F1 Score = (2 × Precision × Recall) / (Precision + Recall)

F1 Use Case — Credit Card Fraud Detection:
• Dataset is highly unbalanced (most transactions are legit).
• FN (missing fraud) is very costly — account holder loses money.
• FP (flagging legit as fraud) is costly too — customer annoyed, bank blocked.
• Use F1 Score to balance both.

3.11 Choosing the Right Metric — Quick Reference

Metric	Formula	Use When…
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Dataset is balanced and all errors are equally important.
Precision	TP / (TP + FP)	Cost of False Positives is high (e.g., satellite launch, spam blocking important mail).
Recall	TP / (TP + FN)	Cost of False Negatives is high (e.g., cancer diagnosis, COVID detection).
F1 Score	2·P·R / (P + R)	Dataset is unbalanced AND both FP and FN matter (e.g., fraud detection).

Practice Cases — Precision or Recall?

Scenario	Use
Email spam detection — a legitimate mail marked as spam is a disaster for the user.	Precision
Cancer diagnosis — missing a cancer case is a disaster.	Recall
Legal cases — "innocent until proven guilty"; wrongly convicting an innocent is worse.	Precision
Fraud detection — missing a fraudulent transaction is worse.	Recall
Safe-content filtering (e.g., YouTube Kids) — unsafe content reaching a child is a disaster.	Recall

3.12 Worked Case Study — Spam Email Detection

Out of 1000 emails:
• True Positives (TP) = 150 spam emails correctly classified as spam.
• False Positives (FP) = 50 legitimate emails incorrectly marked as spam.
• True Negatives (TN) = 750 legit emails correctly classified as not spam.
• False Negatives (FN) = 50 spam emails incorrectly classified as not spam.

Accuracy = (150 + 750) / 1000 = 900 / 1000 = 0.90 (90 %)
Precision = 150 / (150 + 50) = 150 / 200 = 0.75 (75 %)
Recall = 150 / (150 + 50) = 150 / 200 = 0.75 (75 %)
F1 Score = 2 × 0.75 × 0.75 / (0.75 + 0.75) = 0.75 (75 %)

Learning Outcome 5: Ethical concerns around model evaluation

3.13 Ethical Concerns Around Model Evaluation

While evaluating an AI model, three ethical concerns must always be kept in mind:

1. Bias

Evaluation data must represent all groups fairly — not just the majority. A model that has 95 % accuracy overall but only 60 % accuracy for a minority group is biased.

2. Transparency

Users should know how the model reached its decision. Evaluation metrics, datasets, and limitations should be publicly disclosed — so people can trust and verify.

3. Accountability / Accuracy

When the model fails, someone must be answerable — developer, company, or regulator. Accuracy claims must be honest; hidden errors (edge cases, biased samples) must be reported.

1. Bias — Detail

Bias can creep in from training data, not just the algorithm.
Evaluate the model separately on each subgroup (gender, region, age, race).
If a model performs well on one group but poorly on another, it is biased.
Example: the US hospital AI we saw in Unit 1 was biased against Western-region patients because it used healthcare expense as a proxy for illness.

2. Transparency — Detail

Document the training data used, its sources and any limitations.
Share the evaluation metrics openly — accuracy, precision, recall, F1 for each subgroup.
Explain the model's logic — why did it predict this?
Allow auditing by third-party experts.

3. Accountability & Honest Accuracy — Detail

Be honest about accuracy — don't quote only the best-case number.
Have a clear chain of responsibility — who is answerable if the model causes harm?
Provide a feedback and appeal mechanism for users affected by wrong decisions.
Have a plan to retrain or withdraw the model if serious flaws emerge.

3.14 Improving an AI Model — The Iterative Loop

CONTINUOUS IMPROVEMENT LOOP

1. Train model 2. Evaluate on test set 3. Study metrics 4. Improve data / model 5. Re-train 6. Re-evaluate

Ways to Improve a Weak Model

More data — especially for underrepresented groups.
Better features — remove noise, add informative columns.
Tune hyperparameters — adjust the model's internal settings.
Try a different algorithm — maybe a simpler or more powerful one works.
Fix labelling errors — bad labels lead to bad models.
Balance the dataset — oversample minority classes or under-sample majority.

Quick Revision — Key Points to Remember

Model Evaluation = using metrics to understand model performance; like a report card.
Why evaluate: find strengths/weaknesses · build trust · estimate real-world performance · compare models.
Train-Test Split: divide dataset into training (70-80%) and testing (20-30%) subsets. Test on unseen data.
Overfitting = model memorises training data, fails on new data. Avoid by NEVER evaluating on training data.
Accuracy = total correct predictions / total predictions. Directly proportional to performance.
Error = difference between prediction and actual outcome. Goal: minimise error, maximise accuracy.
Accuracy limitation: misleading on unbalanced datasets.
Confusion Matrix = table with 4 cells: TP · TN · FP · FN.
TP = actual+predicted positive · TN = actual+predicted negative · FP = actual negative wrongly flagged positive · FN = actual positive missed.
Accuracy = (TP+TN) / (TP+TN+FP+FN).
Precision = TP / (TP+FP). Use when FP is costly (spam blocking important mail, satellite launch).
Recall (Sensitivity) = TP / (TP+FN). Use when FN is costly (cancer, COVID, fraud, content safety).
F1 Score = 2·P·R / (P+R). Use when dataset is unbalanced AND both FP and FN matter.
3 Ethical concerns: Bias (fairness across groups) · Transparency (disclose how the model works) · Accountability/Accuracy (honest reporting + chain of responsibility).
Improvement loop: train → evaluate → study metrics → improve → retrain → re-evaluate.

Practice Quiz — test yourself on this chapter→