Introduction — Why Evaluate?
Till now we have learnt about 4 stages of the AI Project Cycle — Problem Scoping, Data Acquisition, Data Exploration, Modeling. While modelling we can build different types of models — but how do we check if one is better than the other? That's where Evaluation comes in.
3.1 Importance of Model Evaluation
📋 1. What is Evaluation?
- Evaluation uses different evaluation metrics to understand a model's performance.
- An AI model gets better with constructive feedback.
- You build → get feedback from metrics → improve → repeat — until you achieve a desirable accuracy.
🎯 2. Why Do We Need Model Evaluation?
- Find strengths — where the model is already good.
- Find weaknesses — where the model makes mistakes.
- Check suitability for the task at hand.
- Build trustworthy and reliable AI systems.
- Enable feedback loops — essential for continuous improvement.
- Estimate how the model will perform on new, unseen data.
- Compare multiple candidate models and pick the best one.
3.2 Splitting the Training Set Data for Evaluation — Train-Test Split
🔹 How Train-Test Split Works
🔹 Need for Train-Test Split
- The train dataset makes the model learn.
- Input elements of the test dataset are given to the trained model. It makes predictions, and predicted values are compared with expected values.
- Objective: Estimate the performance of the ML model on NEW data — data not used to train it.
- This mimics real-world use — we fit on available data with known outputs, then predict new examples in future where we don't have the target values.
3.3 Accuracy and Error
• Who is more accurate? Billy (off by only ₹50 vs Bob's ₹200).
• Error for Bob = |500 − 300| = ₹200. Error for Billy = |500 − 550| = ₹50.
An evaluation metric that measures the total number of predictions a model gets right. Higher accuracy = better performance. Accuracy ∝ Performance.
An action that is inaccurate or wrong. The difference between a model's prediction and the actual outcome. Quantifies how often the model makes mistakes.
🔹 Accuracy & Error in a Medical Example
• Error: Model predicts "no disease" but patient actually has the disease → error.
• Accuracy: Model correctly predicts disease/no-disease for a period → 100% accuracy for that period.
🔹 Key Points on Accuracy & Error
- The goal is to minimise error and maximise accuracy.
- Real-world data is messy — even the best models make mistakes.
- Focusing solely on accuracy can be misleading. In medical diagnosis, a model with slightly lower accuracy but strong focus on not declaring a sick person as healthy might be preferable.
- Choosing the right metric depends on the specific task and its requirements.
3.4 Calculating Accuracy — House Price Example
Error Rate = Error / Actual
Accuracy = 1 − Error Rate
Accuracy % = Accuracy × 100%
| Predicted (USD) | Actual (USD) | Error | Error Rate | Accuracy | Accuracy % |
|---|---|---|---|---|---|
| 391 k | 402 k | 11 k | 11/402 = 0.027 | 0.973 | 97.3 % |
| 453 k | 488 k | 35 k | 0.072 | 0.928 | 92.8 % |
| 125 k | 97 k | 28 k | 0.289 | 0.711 | 71.1 % |
| 871 k | 907 k | 36 k | 0.040 | 0.960 | 96.0 % |
| 322 k | 425 k | 103 k | 0.242 | 0.758 | 75.8 % |
Mean Accuracy = (97.3 + 92.8 + 71.1 + 96.0 + 75.8) / 5 ≈ 86.6 % overall accuracy.
3.5 Evaluation Metrics for Classification
🎯 1. What is Classification?
🔹 Classification Use-Case Test
- House price prediction → not classification (continuous value → Regression).
- Credit-card fraud detection → classification (fraud / legit).
- Salary prediction → not classification (continuous value → Regression).
🔹 Popular Metrics for Classification
3.6 The Confusion Matrix
For a binary classifier that predicts 0 or 1:
Model correctly predicted positive.
Model wrongly predicted negative as positive.
Model wrongly predicted positive as negative.
Model correctly predicted negative.
🔹 TP / TN / FP / FN — Real-World Examples
| Term | Football World Cup | Disease Diagnosis |
|---|---|---|
| TP | Predicted France would win, and it won. | Predicted "yes" (has disease), and patient actually has it. |
| TN | Predicted Germany would not win, and it lost. | Predicted "no" (no disease), and patient actually doesn't have it. |
| FP | Predicted Germany would win, but it lost. | Predicted "yes" but patient doesn't have the disease. (False alarm) |
| FN | Predicted France would not win, but it won. | Predicted "no" but patient actually has the disease. (Missed case) |
🔨 Build a Confusion Matrix from Scratch
• Top-left: Both columns YES → TP count
• Top-right: Actual YES, Predicted NO → FN count
• Bottom-left: Actual NO, Predicted YES → FP count
• Bottom-right: Both columns NO → TN count
Example: 7 correct out of 10 → TP + TN = 7, FP + FN = 3.
3.7 Accuracy from Confusion Matrix
= (TP + TN) / (TP + TN + FP + FN)
🔹 Can We Use Accuracy All The Time?
• Test data = 1000 points → 900 "Yes", 100 "No" (unbalanced).
• A faulty model predicts "Yes" for everything.
• TP = 900, TN = 0, FP = 100, FN = 0.
• Accuracy = (900 + 0) / 1000 = 0.90 = 90 %.
• Yet the model never predicts a single "No" correctly — it's useless for detecting negatives!
3.8 Precision
Precision = 0.843 means: when our model predicts a patient has heart disease, it is correct around 84 % of the time.
🔹 When to Use Precision?
Use Precision with unbalanced datasets when False Positives are costly — the model must reduce FPs as much as possible.
Positive class = favorable weather day. Negative class = non-favorable day.
• Missing a good weather day (low recall) is OK — we can wait.
• But predicting a BAD weather day as a GOOD day (False Positive) to launch the satellite can be disastrous!
→ So False Positives must be minimised → use Precision.
3.9 Recall
Recall = 0.86 means: out of the total patients who have heart disease, 86 % have been correctly identified.
🔹 When to Use Recall?
Use Recall with unbalanced datasets when False Negatives are costly — the model must reduce FNs as much as possible.
Positive = COVID-19 affected. Negative = non-affected.
• If a COVID-19 patient (Positive) is wrongly predicted as non-affected (Negative) — a False Negative — the patient gets no treatment and may infect many others.
→ So False Negatives must be minimised → use Recall.
3.10 F1 Score
• Dataset is highly unbalanced (most transactions are legit).
• FN (missing fraud) is very costly — account holder loses money.
• FP (flagging legit as fraud) is costly too — customer annoyed, bank blocked.
• Use F1 Score to balance both.
3.11 Choosing the Right Metric — Quick Reference
| Metric | Formula | Use When… |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Dataset is balanced and all errors are equally important. |
| Precision | TP / (TP + FP) | Cost of False Positives is high (e.g., satellite launch, spam blocking important mail). |
| Recall | TP / (TP + FN) | Cost of False Negatives is high (e.g., cancer diagnosis, COVID detection). |
| F1 Score | 2·P·R / (P + R) | Dataset is unbalanced AND both FP and FN matter (e.g., fraud detection). |
🔹 Practice Cases — Precision or Recall?
| Scenario | Use |
|---|---|
| Email spam detection — a legitimate mail marked as spam is a disaster for the user. | Precision |
| Cancer diagnosis — missing a cancer case is a disaster. | Recall |
| Legal cases — "innocent until proven guilty"; wrongly convicting an innocent is worse. | Precision |
| Fraud detection — missing a fraudulent transaction is worse. | Recall |
| Safe-content filtering (e.g., YouTube Kids) — unsafe content reaching a child is a disaster. | Recall |
3.12 Worked Case Study — Spam Email Detection
• True Positives (TP) = 150 spam emails correctly classified as spam.
• False Positives (FP) = 50 legitimate emails incorrectly marked as spam.
• True Negatives (TN) = 750 legit emails correctly classified as not spam.
• False Negatives (FN) = 50 spam emails incorrectly classified as not spam.
Precision = 150 / (150 + 50) = 150 / 200 = 0.75 (75 %)
Recall = 150 / (150 + 50) = 150 / 200 = 0.75 (75 %)
F1 Score = 2 × 0.75 × 0.75 / (0.75 + 0.75) = 0.75 (75 %)
3.13 Ethical Concerns Around Model Evaluation
While evaluating an AI model, three ethical concerns must always be kept in mind:
1. Bias
Evaluation data must represent all groups fairly — not just the majority. A model that has 95 % accuracy overall but only 60 % accuracy for a minority group is biased.2. Transparency
Users should know how the model reached its decision. Evaluation metrics, datasets, and limitations should be publicly disclosed — so people can trust and verify.3. Accountability / Accuracy
When the model fails, someone must be answerable — developer, company, or regulator. Accuracy claims must be honest; hidden errors (edge cases, biased samples) must be reported.⚖️ 1. Bias — Detail
- Bias can creep in from training data, not just the algorithm.
- Evaluate the model separately on each subgroup (gender, region, age, race).
- If a model performs well on one group but poorly on another, it is biased.
- Example: the US hospital AI we saw in Unit 1 was biased against Western-region patients because it used healthcare expense as a proxy for illness.
🔍 2. Transparency — Detail
- Document the training data used, its sources and any limitations.
- Share the evaluation metrics openly — accuracy, precision, recall, F1 for each subgroup.
- Explain the model's logic — why did it predict this?
- Allow auditing by third-party experts.
🎯 3. Accountability & Honest Accuracy — Detail
- Be honest about accuracy — don't quote only the best-case number.
- Have a clear chain of responsibility — who is answerable if the model causes harm?
- Provide a feedback and appeal mechanism for users affected by wrong decisions.
- Have a plan to retrain or withdraw the model if serious flaws emerge.
3.14 Improving an AI Model — The Iterative Loop
🔹 Ways to Improve a Weak Model
- More data — especially for underrepresented groups.
- Better features — remove noise, add informative columns.
- Tune hyperparameters — adjust the model's internal settings.
- Try a different algorithm — maybe a simpler or more powerful one works.
- Fix labelling errors — bad labels lead to bad models.
- Balance the dataset — oversample minority classes or under-sample majority.
Quick Revision — Key Points to Remember
- Model Evaluation = using metrics to understand model performance; like a report card.
- Why evaluate: find strengths/weaknesses · build trust · estimate real-world performance · compare models.
- Train-Test Split: divide dataset into training (70-80%) and testing (20-30%) subsets. Test on unseen data.
- Overfitting = model memorises training data, fails on new data. Avoid by NEVER evaluating on training data.
- Accuracy = total correct predictions / total predictions. Directly proportional to performance.
- Error = difference between prediction and actual outcome. Goal: minimise error, maximise accuracy.
- Accuracy limitation: misleading on unbalanced datasets.
- Confusion Matrix = table with 4 cells: TP · TN · FP · FN.
- TP = actual+predicted positive · TN = actual+predicted negative · FP = actual negative wrongly flagged positive · FN = actual positive missed.
- Accuracy = (TP+TN) / (TP+TN+FP+FN).
- Precision = TP / (TP+FP). Use when FP is costly (spam blocking important mail, satellite launch).
- Recall (Sensitivity) = TP / (TP+FN). Use when FN is costly (cancer, COVID, fraud, content safety).
- F1 Score = 2·P·R / (P+R). Use when dataset is unbalanced AND both FP and FN matter.
- 3 Ethical concerns: Bias (fairness across groups) · Transparency (disclose how the model works) · Accountability/Accuracy (honest reporting + chain of responsibility).
- Improvement loop: train → evaluate → study metrics → improve → retrain → re-evaluate.