VM-LEARNING /class.xi ·track.ai ·ch-b5 session: 2026_27
$cd ..

~/Data Literacy — Data Collection to Data Analysis

root@vm-learning ~ $ open ch-b5
PART B ▪ UNIT 5
10
Data Literacy — Data Collection to Data Analysis
Collection · Levels of Measurement · Statistics · Visualization · Matrices · Pre-processing
AI is essentially data-driven — it converts large amounts of raw data into actionable information. Data literacy means being able to find and use data effectively: collecting it, organising it, checking its quality, analysing it, understanding the results, and using it ethically. This unit covers the full pipeline — from collecting data and exploring its levels of measurement, through statistical analysis and visualisation with matplotlib, to matrices, data pre-processing, and finally modelling & evaluation.
Learning Outcome: Explain importance of data literacy · Identify data collection methods · Understand matrices and operations · Apply basic data analysis · Visualise data using different techniques

1.1 What is Data Literacy?

Data is a representation of facts or instructions about some entity (students, school, business, animals) that can be processed or communicated by humans or machines. Data Literacy means being able to find and use data effectively — includes collecting, organising, checking quality, analysing, interpreting and using it ethically.
AI is data-driven. Data may be structured, semi-structured, or unstructured — it must be collected, organised and analysed properly to know whether the input for AI models is valid and appropriate.

1.2 Data Collection

Data collection means pooling data by scraping, capturing or loading it from multiple offline + online sources. Key guidelines:

Two Sources of Data

🆕 Primary Sources
Data created newly for the analysis. The researcher collects it first-hand.
📚 Secondary Sources
Data already stored and ready for reuse — books, journals, newspapers, websites, databases.

Primary Data Collection Methods

MethodDescriptionExample
📋 SurveyGather data via interviews, questionnaires or online forms. Measures opinions, behaviours, demographics.Researcher uses a questionnaire to learn consumer preferences for a new product.
🎙️ InterviewDirect communication with individuals or groups. Can be structured, semi-structured, or unstructured.Online survey to collect employee feedback about job satisfaction.
👁️ ObservationWatch and record behaviours or events as they occur. Used when direct interaction isn't possible.Observing children's play patterns in a schoolyard to understand social dynamics.
🧪 ExperimentManipulate variables to observe effects on outcomes. Establishes cause-and-effect relationships.Testing effectiveness of different ad campaigns on a group of people.
📢 Marketing CampaignUse customer data to predict behaviour and optimise campaign performance.Personalised email based on past customer purchases.
📝 QuestionnaireA specific survey tool — list of questions. Can collect quantitative or qualitative data.Rate satisfaction 1-5, plus open-ended feedback.

Secondary Data Collection Methods

MethodDescriptionExample
💬 Social Media TrackingCollect user posts, comments, interactions.Analysing sentiment around a product launch.
🕸️ Web ScrapingAutomated tools extract content and data from websites.Scraping product prices from e-commerce sites for comparison.
🛰️ Satellite DataGather info about Earth's surface / atmosphere via satellites.Monitoring weather patterns and environmental changes.
☁️ Online Data PlatformsWebsites offering pre-compiled datasets.Kaggle · GitHub · data.gov.in.

1.3 Exploring Data — Levels of Measurement

The way a set of data is measured is called the level of measurement. Not all data can be treated equally — datasets classify into Qualitative (Nominal, Ordinal) and Quantitative (Interval, Ratio). Some data is discrete, some continuous.
1️⃣ Nominal
Categories / names only. No order, no calculation.
Examples: eye colour, yes/no responses, gender, smartphone brands, player jersey numbers.
2️⃣ Ordinal
Ordered categories — but differences cannot be measured. Still qualitative.
Examples: ratings (unpalatable → delicious), grades (A/B/C/D), survey "excellent / good / satisfactory / unsatisfactory".
3️⃣ Interval
Ordered with measurable differences — but no true zero. Ratios are meaningless.
Examples: temperature in °C or °F (0° doesn't mean absence of temperature — –20 °F, –30 °C exist).
4️⃣ Ratio
Like interval but has a true zero point. Ratios are meaningful; all arithmetic operations allowed.
Examples: weight, height, exam score (0-100), number of books.
Practical (Syllabus): Given a variable, identify its level of measurement. E.g., "Opinion about a new law (favour/oppose)" = Nominal; "Letter grade A/B/C" = Ordinal; "Teacher rating 1-10" = Interval.

1.4 Statistical Analysis — Measures of Central Tendency

Central Tendency summarises a dataset in a single value that represents the whole distribution. The three common measures are Mean, Median, Mode. In Python, use the statistics library.

Python statistics Library — Quick Reference

import statistics

statistics.mean(data)        # arithmetic mean
statistics.median(data)      # median (middle value)
statistics.mode(data)        # most frequent value
statistics.variance(data)    # variance
statistics.stdev(data)       # standard deviation

📏 Mean (Arithmetic Average)

M = Σ fx / n

where M = mean, Σ = sum, f = frequency, x = score, n = total cases.

Example: S = {5, 10, 15, 20, 30}
Mean = (5 + 10 + 15 + 20 + 30) / 5 = 80 / 5 = 16

📐 Median (Middle Value)

Arrange data in ascending / descending order — the middle value is the median. For 13 values (odd), the 7th is the median.

Marks: 17, 32, 35, 15, 21, 41, 32, 11, 10, 20, 27, 28, 30
Sorted: 10, 11, 15, 17, 20, 21, 27, 28, 30, 32, 32, 35, 40 → Median = 27

🏔️ Mode (Most Frequent Value)

Ages: 17, 17, 18, 18, 19, 20, 20, 21, 21, 22, 22, 22, 23, 24 → Mode = 22

When to Use Mean / Median / Mode?

MeasureWhen to use
MeanData is evenly spread with no exceptionally high / low values.
MedianData includes exceptionally high or low values (outliers); also for ordinal data.
ModeFinding the distribution peak — e.g., most popular book, most common shoe size.

Program 1 — Mean / Median / Mode of student heights

import statistics

heights = [145, 151, 152, 149, 147, 152, 151, 149,
           152, 151, 147, 148, 155, 147, 152, 151,
           149, 145, 147, 152, 146, 148, 150, 152, 151]

print("Mean   =", statistics.mean(heights))
print("Median =", statistics.median(heights))
print("Mode   =", statistics.mode(heights))

1.5 Statistical Analysis — Measures of Dispersion

Central tendency tells us the centre. Dispersion tells us how spread out the data is around that centre. The two key measures are Variance and Standard Deviation.

Variance = Σ (x − mean)² / n · Standard Deviation = √Variance
Dog heights (mm): 600, 470, 170, 430, 300
Mean = 1970 / 5 = 394 mm
Deviations: 206, 76, −224, 36, −94
Squared: 42436 + 5776 + 50176 + 1296 + 8836 = 108520
Variance = 108520 / 5 = 21704
Standard Deviation = √21704 = 147.32 mm

Interpreting Variance / Std Deviation

⬇️ Low Variance / Low SD
Data points are very close to the mean and to each other — tightly clustered.
⬆️ High Variance / High SD
Data points are very spread out from the mean — widely scattered.

Program 2 — Variance & Standard Deviation

import statistics

heights = [145, 151, 152, 149, 147, 152, 151, 149,
           152, 151, 147, 148, 155, 147, 152, 151,
           149, 145, 147, 152, 146, 148, 150, 152, 151]

print("Variance  =", statistics.variance(heights))
print("Std. Dev. =", statistics.stdev(heights))

1.6 Data Representation & Visualisation

Data representation is classified into two categories:

📋 Non-Graphical
Tabular form, case form. Old format — not suitable for large datasets or decision-making.
📊 Graphical (Data Visualisation)
Points, lines, dots, shapes — the human brain handles complex data better when visualised. The 5 core types are Line · Bar · Histogram · Scatter · Pie.

Matplotlib — Python's Plotting Library

pip install matplotlib
# or:
python -m pip install -U matplotlib

Import and use:

import matplotlib.pyplot as plt

Common pyplot Functions

FunctionDescription
title(text)Adds title to the chart.
xlabel(text) / ylabel(text)Sets X-axis / Y-axis labels.
xlim(a, b) / ylim(a, b)Sets value limits on axes.
xticks(list) / yticks(list)Sets tick marks.
show()Displays the graph on screen.
savefig("path")Saves the graph to disk.
figure(figsize=(w,h))Determines figure size in inches.

1.7 Five Chart Types

A. Line Graph — for continuous data / trends over time

Shows trends and changes. The line slopes up (increase) or down (decrease). Function: plt.plot().

import matplotlib.pyplot as plt

tests = ["T1", "T2", "T3", "T4", "T5"]
marks = [25, 34, 49, 40, 48]

plt.plot(tests, marks, marker="o", color="blue")
plt.title("Kavya's AI marks")
plt.xlabel("Test")
plt.ylabel("Marks")
plt.show()

B. Bar Graph — for comparison between categories

Rectangular bars with heights proportional to values. Function: plt.bar().

schools = ["Oxford", "Delhi", "Jyothis", "Sanskriti", "Bombay"]
students = [123, 87, 105, 146, 34]

plt.bar(schools, students, color="green", edgecolor="black")
plt.title("Deep Learning Seminar Attendance")
plt.show()

C. Histogram — for frequency distribution

Vertical rectangles showing frequencies of value ranges (bins). One distribution per axis. Function: plt.hist().

heights = [141, 145, 142, 147, 144, 148, 141,
           142, 149, 144, 143, 149, 146, 141,
           147, 142, 143]

plt.hist(heights, bins=5, color="skyblue", edgecolor="black")
plt.title("Height distribution — Class XII girls")
plt.show()

D. Scatter Graph — for relationships between two variables

Plots data points on x and y axes to reveal correlations, clusters, trends. Function: plt.scatter().

study_hours = [2, 4, 5, 7, 8, 10]
math_score = [55, 65, 70, 80, 85, 95]

plt.scatter(study_hours, math_score, color="red")
plt.title("Study Hours vs Math Score")
plt.xlabel("Hours per week")
plt.ylabel("Math %")
plt.show()

E. Pie Chart — for proportions / composition

Circular graph divided into segments representing percentages. Limit to ~7 categories for clarity. Cannot show zero values or changes over time. Function: plt.pie().

subjects = ["English", "Maths", "Science", "Social", "AI", "PE"]
periods = [6, 8, 8, 7, 3, 2]

plt.pie(periods, labels=subjects, autopct="%.1f%%")
plt.title("Weekly periods per subject")
plt.show()

🌧️ Practical — Using rainfall.csv

From syllabus: Use the rainfall.csv dataset (12-month rainfall data for Tamil Nadu) to plot a line graph, bar graph, histogram, scatter graph and pie chart.
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("rainfall.csv")

# Line chart
plt.plot(df["Month"], df["Rainfall"], marker="o")
plt.title("Monthly Rainfall - Tamil Nadu")
plt.show()

# Bar chart
plt.bar(df["Month"], df["Rainfall"], color="teal")
plt.show()

# Pie chart
plt.pie(df["Rainfall"], labels=df["Month"], autopct="%.1f%%")
plt.show()

1.8 Introduction to Matrices

A matrix (plural matrices) is a rectangular arrangement of numbers in rows and columns. Matrices are powerful in mathematics and foundational in AI — especially in Computer Vision, where every digital image is stored as a matrix of pixel intensities.

Example — Purchases Table

Aditi bought 25 pencils, 5 erasers · Adit bought 10 pencils, 2 erasers · Manu bought 5 pencils, 1 eraser.

Tabular form →  Matrix form:
A = [ 25   5 ] , [ 10   2 ] , [ 5   1 ] — a 3×2 matrix

Order of a Matrix

A matrix with m rows and n columns has order m × n. Total elements = m × n. Each element is written as aij where i = row, j = column.

Three Matrix Operations

Addition
Add corresponding elements. Both matrices must be same order.
A + B = [aij + bij]
Subtraction
Subtract corresponding elements. Same order required.
A − B = [aij − bij]
🔄
Transpose (Aᵀ)
Interchange rows and columns. A 3×2 matrix becomes a 2×3 matrix.

Applications of Matrices in AI

ApplicationHow matrices are used
🖼️ Image ProcessingDigital images stored as matrices of pixel intensities (0-255 for grayscale; 0 = dark, 255 = white). Colour images use 3 channels (R, G, B).
🎯 Recommender SystemsRelate users to the products they purchased / viewed — each cell rates a user-product interaction.
🗣️ NLPVectors (1-D matrices) depict the distribution of words across documents.

1.9 Data Pre-processing

Data pre-processing cleans, transforms, reduces, integrates and normalises data — making it machine-learning-friendly. It is a crucial step before training.

Five Stages of Pre-processing

🧹 1. Data CleaningHandle missing values · outliers · inconsistent data · duplicates.
🔄 2. Data TransformationConvert categorical → numerical. Create new features or modify existing ones.
📉 3. Data ReductionDimensionality reduction (fewer features). Sampling for large datasets.
🔗 4. Data Integration & NormalisationMerge multiple sources. Scale all features to a similar range.
⭐ 5. Feature SelectionKeep only the features that contribute most to the target; drop irrelevant ones.

Data Cleaning — Four Common Issues

IssueHow to handle
❓ Missing DataDelete rows / columns with missing values, or impute (fill with mean / median / mode), or use algorithms that tolerate missing data.
📍 OutliersIdentify (values far from the rest — errors or rare events), then remove, transform, or use robust statistics.
⚠️ Inconsistent DataFix typographical errors, mismatched data types. Standardise formats.
♊ Duplicate DataIdentify and remove duplicate records to ensure data integrity.

1.10 Data in Modelling & Evaluation

Train–Test Split

After pre-processing, data is split into:

Validation Techniques

🎯 Train-Test Split
Train on training set, evaluate on a separate test set. Simple & fast.
🔁 Cross-Validation
Split the data into k folds, train/test across rotations to ensure performance is consistent across different subsets.

Evaluation Metrics

Problem typeCommon metrics
📦 ClassificationAccuracy · Precision · Recall · F1-Score · ROC curve.
📈 RegressionMSE (Mean Squared Error) · RMSE · MAE (Mean Absolute Error) · R² (R-Squared).

1.11 Practical Programs (Syllabus)

Program A — Mean / Median / Mode of ages

import statistics

ages = [25, 28, 30, 35, 40, 45, 50, 55, 60, 65]

print("Mean   =", statistics.mean(ages))
print("Median =", statistics.median(ages))
# Note: mode fails if all values unique
print("Mode   =", statistics.mode(ages))

Program B — Variance / SD of temperatures

import statistics

temps = [20, 22, 25, 18, 23]
print("Variance  =", statistics.variance(temps))
print("Std. Dev. =", statistics.stdev(temps))

Program C — Line chart (weekly inquiries)

import matplotlib.pyplot as plt

weeks = ["Week 1", "Week 2", "Week 3", "Week 4"]
inquiries = [150, 170, 180, 200]

plt.plot(weeks, inquiries, marker="o", color="purple")
plt.title("Customer Inquiries Per Week")
plt.ylabel("Number of inquiries")
plt.show()

Program D — Bar chart (book sales by genre)

genres = ["Fiction", "Mystery", "Sci-Fi", "Romance", "Biography"]
sales  = [120, 90, 80, 110, 70]

plt.bar(genres, sales, color="orange", edgecolor="black")
plt.title("Books Sold by Genre")
plt.show()

Program E — Pie chart (city transportation)

modes = ["Car", "Public Transit", "Walking", "Bicycle"]
shares = [40, 30, 20, 10]

plt.pie(shares, labels=modes, autopct="%.1f%%")
plt.title("How people commute in the city")
plt.show()

1.12 Certification (Syllabus)

IBM SkillsBuild — Data Visualisation with Python (Modules 1, 2, 3).

Quick Revision — Key Points to Remember

  • Data literacy = ability to find and use data effectively (collect, organise, analyse, interpret, use ethically).
  • Data types: structured · semi-structured · unstructured.
  • Two sources of data: Primary (newly created) · Secondary (already stored).
  • 6 primary methods: Survey · Interview · Observation · Experiment · Marketing Campaign · Questionnaire.
  • 4 secondary methods: Social media tracking · Web scraping · Satellite data · Online data platforms (Kaggle, GitHub).
  • 4 levels of measurement: Nominal (names) · Ordinal (ordered) · Interval (ordered + measurable diffs, no true 0) · Ratio (true 0, all ops allowed).
  • Central tendency: Mean (even spread) · Median (skewed / ordinal) · Mode (peak).
  • Dispersion: Variance = Σ(x−mean)²/n · Std Dev = √variance. Low → clustered; High → spread.
  • 5 chart types: Line (trends) · Bar (comparison) · Histogram (frequency) · Scatter (relationships) · Pie (proportions).
  • Plotting library: matplotlib.pyplot → plot / bar / hist / scatter / pie.
  • Matrix = rectangular number arrangement in rows × columns. Order = m × n.
  • 3 matrix ops: Addition · Subtraction · Transpose (Aᵀ).
  • AI uses matrices for: image processing · recommender systems · NLP word vectors.
  • Pre-processing 5 stages: Cleaning · Transformation · Reduction · Integration/Normalisation · Feature Selection.
  • Cleaning 4 issues: Missing data · Outliers · Inconsistent data · Duplicates.
  • Modelling: Train/Test split (80/20) · Cross-validation.
  • Evaluation: Classification → Accuracy/Precision/Recall/F1/ROC · Regression → MSE/RMSE/MAE/R².
  • Certification: IBM SkillsBuild — Data Visualisation with Python (Modules 1-3).
🧠Practice Quiz — test yourself on this chapter