PART B ▪ UNIT 5 · Data Literacy — Data Collection to Data Analysis

  root@vm-learning
  ~
  $
  open
  ch-b5
  

PART B ▪ UNIT 5

Data Literacy — Data Collection to Data Analysis

Collection · Levels of Measurement · Statistics · Visualization · Matrices · Pre-processing

AI is essentially data-driven — it converts large amounts of raw data into actionable information. Data literacy means being able to find and use data effectively: collecting it, organising it, checking its quality, analysing it, understanding the results, and using it ethically. This unit covers the full pipeline — from collecting data and exploring its levels of measurement, through statistical analysis and visualisation with matplotlib, to matrices, data pre-processing, and finally modelling & evaluation.

Learning Outcome: Explain importance of data literacy · Identify data collection methods · Understand matrices and operations · Apply basic data analysis · Visualise data using different techniques

1.1 What is Data Literacy?

Data is a representation of facts or instructions about some entity (students, school, business, animals) that can be processed or communicated by humans or machines. Data Literacy means being able to find and use data effectively — includes collecting, organising, checking quality, analysing, interpreting and using it ethically.

AI is data-driven. Data may be structured, semi-structured, or unstructured — it must be collected, organised and analysed properly to know whether the input for AI models is valid and appropriate.

1.2 Data Collection

Data collection means pooling data by scraping, capturing or loading it from multiple offline + online sources. Key guidelines:

Quantity depends on the number of features and model complexity. A license-plate detector needs less data; a medical AI needs huge volumes.
Diversity matters more than raw quantity — diverse data helps the model cover more scenarios.
Start with small batches, then scale up.
Collection is iterative — as the project develops, data requirements may change.

Two Sources of Data

🆕 Primary Sources

Data created newly for the analysis. The researcher collects it first-hand.

Secondary Sources

Data already stored and ready for reuse — books, journals, newspapers, websites, databases.

Primary Data Collection Methods

Method	Description	Example
Survey	Gather data via interviews, questionnaires or online forms. Measures opinions, behaviours, demographics.	Researcher uses a questionnaire to learn consumer preferences for a new product.
Interview	Direct communication with individuals or groups. Can be structured, semi-structured, or unstructured.	Online survey to collect employee feedback about job satisfaction.
Observation	Watch and record behaviours or events as they occur. Used when direct interaction isn't possible.	Observing children's play patterns in a schoolyard to understand social dynamics.
Experiment	Manipulate variables to observe effects on outcomes. Establishes cause-and-effect relationships.	Testing effectiveness of different ad campaigns on a group of people.
Marketing Campaign	Use customer data to predict behaviour and optimise campaign performance.	Personalised email based on past customer purchases.
Questionnaire	A specific survey tool — list of questions. Can collect quantitative or qualitative data.	Rate satisfaction 1-5, plus open-ended feedback.

Secondary Data Collection Methods

Method	Description	Example
Social Media Tracking	Collect user posts, comments, interactions.	Analysing sentiment around a product launch.
Web Scraping	Automated tools extract content and data from websites.	Scraping product prices from e-commerce sites for comparison.
Satellite Data	Gather info about Earth's surface / atmosphere via satellites.	Monitoring weather patterns and environmental changes.
Online Data Platforms	Websites offering pre-compiled datasets.	Kaggle · GitHub · data.gov.in.

1.3 Exploring Data — Levels of Measurement

The way a set of data is measured is called the level of measurement. Not all data can be treated equally — datasets classify into Qualitative (Nominal, Ordinal) and Quantitative (Interval, Ratio). Some data is discrete, some continuous.

1️⃣ Nominal

Categories / names only. No order, no calculation.
Examples: eye colour, yes/no responses, gender, smartphone brands, player jersey numbers.

2️⃣ Ordinal

Ordered categories — but differences cannot be measured. Still qualitative.
Examples: ratings (unpalatable → delicious), grades (A/B/C/D), survey "excellent / good / satisfactory / unsatisfactory".

3️⃣ Interval

Ordered with measurable differences — but no true zero. Ratios are meaningless.
Examples: temperature in °C or °F (0° doesn't mean absence of temperature — –20 °F, –30 °C exist).

4️⃣ Ratio

Like interval but has a true zero point. Ratios are meaningful; all arithmetic operations allowed.
Examples: weight, height, exam score (0-100), number of books.

Practical (Syllabus): Given a variable, identify its level of measurement. E.g., "Opinion about a new law (favour/oppose)" = Nominal; "Letter grade A/B/C" = Ordinal; "Teacher rating 1-10" = Interval.

1.4 Statistical Analysis — Measures of Central Tendency

Central Tendency summarises a dataset in a single value that represents the whole distribution. The three common measures are Mean, Median, Mode. In Python, use the statistics library.

Python statistics Library — Quick Reference

import statistics

statistics.mean(data)        # arithmetic mean
statistics.median(data)      # median (middle value)
statistics.mode(data)        # most frequent value
statistics.variance(data)    # variance
statistics.stdev(data)       # standard deviation

Mean (Arithmetic Average)

M = Σ fx / n

where M = mean, Σ = sum, f = frequency, x = score, n = total cases.

Example: S = {5, 10, 15, 20, 30}
Mean = (5 + 10 + 15 + 20 + 30) / 5 = 80 / 5 = 16

Median (Middle Value)

Arrange data in ascending / descending order — the middle value is the median. For 13 values (odd), the 7th is the median.

Marks: 17, 32, 35, 15, 21, 41, 32, 11, 10, 20, 27, 28, 30
Sorted: 10, 11, 15, 17, 20, 21, 27, 28, 30, 32, 32, 35, 40 → Median = 27

Mode (Most Frequent Value)

Ages: 17, 17, 18, 18, 19, 20, 20, 21, 21, 22, 22, 22, 23, 24 → Mode = 22

When to Use Mean / Median / Mode?

Measure	When to use
Mean	Data is evenly spread with no exceptionally high / low values.
Median	Data includes exceptionally high or low values (outliers); also for ordinal data.
Mode	Finding the distribution peak — e.g., most popular book, most common shoe size.

Program 1 — Mean / Median / Mode of student heights

import statistics

heights = [145, 151, 152, 149, 147, 152, 151, 149,
           152, 151, 147, 148, 155, 147, 152, 151,
           149, 145, 147, 152, 146, 148, 150, 152, 151]

print("Mean   =", statistics.mean(heights))
print("Median =", statistics.median(heights))
print("Mode   =", statistics.mode(heights))

1.5 Statistical Analysis — Measures of Dispersion

Central tendency tells us the centre. Dispersion tells us how spread out the data is around that centre. The two key measures are Variance and Standard Deviation.

Variance = Σ (x − mean)² / n · Standard Deviation = √Variance

Dog heights (mm): 600, 470, 170, 430, 300
Mean = 1970 / 5 = 394 mm
Deviations: 206, 76, −224, 36, −94
Squared: 42436 + 5776 + 50176 + 1296 + 8836 = 108520
Variance = 108520 / 5 = 21704
Standard Deviation = √21704 = 147.32 mm

Interpreting Variance / Std Deviation

⬇️ Low Variance / Low SD

Data points are very close to the mean and to each other — tightly clustered.

⬆️ High Variance / High SD

Data points are very spread out from the mean — widely scattered.

Program 2 — Variance & Standard Deviation

import statistics

heights = [145, 151, 152, 149, 147, 152, 151, 149,
           152, 151, 147, 148, 155, 147, 152, 151,
           149, 145, 147, 152, 146, 148, 150, 152, 151]

print("Variance  =", statistics.variance(heights))
print("Std. Dev. =", statistics.stdev(heights))

1.6 Data Representation & Visualisation

Data representation is classified into two categories:

Non-Graphical

Tabular form, case form. Old format — not suitable for large datasets or decision-making.

Graphical (Data Visualisation)

Points, lines, dots, shapes — the human brain handles complex data better when visualised. The 5 core types are Line · Bar · Histogram · Scatter · Pie.

Matplotlib — Python's Plotting Library

pip install matplotlib
# or:
python -m pip install -U matplotlib

Import and use:

import matplotlib.pyplot as plt

Common pyplot Functions

Function	Description
`title(text)`	Adds title to the chart.
`xlabel(text)` / `ylabel(text)`	Sets X-axis / Y-axis labels.
`xlim(a, b)` / `ylim(a, b)`	Sets value limits on axes.
`xticks(list)` / `yticks(list)`	Sets tick marks.
`show()`	Displays the graph on screen.
`savefig("path")`	Saves the graph to disk.
`figure(figsize=(w,h))`	Determines figure size in inches.

1.7 Five Chart Types

A. Line Graph — for continuous data / trends over time

Shows trends and changes. The line slopes up (increase) or down (decrease). Function: plt.plot().

import matplotlib.pyplot as plt

tests = ["T1", "T2", "T3", "T4", "T5"]
marks = [25, 34, 49, 40, 48]

plt.plot(tests, marks, marker="o", color="blue")
plt.title("Kavya's AI marks")
plt.xlabel("Test")
plt.ylabel("Marks")
plt.show()

B. Bar Graph — for comparison between categories

Rectangular bars with heights proportional to values. Function: plt.bar().

schools = ["Oxford", "Delhi", "Jyothis", "Sanskriti", "Bombay"]
students = [123, 87, 105, 146, 34]

plt.bar(schools, students, color="green", edgecolor="black")
plt.title("Deep Learning Seminar Attendance")
plt.show()

C. Histogram — for frequency distribution

Vertical rectangles showing frequencies of value ranges (bins). One distribution per axis. Function: plt.hist().

heights = [141, 145, 142, 147, 144, 148, 141,
           142, 149, 144, 143, 149, 146, 141,
           147, 142, 143]

plt.hist(heights, bins=5, color="skyblue", edgecolor="black")
plt.title("Height distribution — Class XII girls")
plt.show()

D. Scatter Graph — for relationships between two variables

Plots data points on x and y axes to reveal correlations, clusters, trends. Function: plt.scatter().

study_hours = [2, 4, 5, 7, 8, 10]
math_score = [55, 65, 70, 80, 85, 95]

plt.scatter(study_hours, math_score, color="red")
plt.title("Study Hours vs Math Score")
plt.xlabel("Hours per week")
plt.ylabel("Math %")
plt.show()

E. Pie Chart — for proportions / composition

Circular graph divided into segments representing percentages. Limit to ~7 categories for clarity. Cannot show zero values or changes over time. Function: plt.pie().

subjects = ["English", "Maths", "Science", "Social", "AI", "PE"]
periods = [6, 8, 8, 7, 3, 2]

plt.pie(periods, labels=subjects, autopct="%.1f%%")
plt.title("Weekly periods per subject")
plt.show()

Practical — Using rainfall.csv

From syllabus: Use the rainfall.csv dataset (12-month rainfall data for Tamil Nadu) to plot a line graph, bar graph, histogram, scatter graph and pie chart.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("rainfall.csv")

# Line chart
plt.plot(df["Month"], df["Rainfall"], marker="o")
plt.title("Monthly Rainfall - Tamil Nadu")
plt.show()

# Bar chart
plt.bar(df["Month"], df["Rainfall"], color="teal")
plt.show()

# Pie chart
plt.pie(df["Rainfall"], labels=df["Month"], autopct="%.1f%%")
plt.show()

1.8 Introduction to Matrices

A matrix (plural matrices) is a rectangular arrangement of numbers in rows and columns. Matrices are powerful in mathematics and foundational in AI — especially in Computer Vision, where every digital image is stored as a matrix of pixel intensities.

Example — Purchases Table

Aditi bought 25 pencils, 5 erasers · Adit bought 10 pencils, 2 erasers · Manu bought 5 pencils, 1 eraser.

Tabular form → Matrix form:

A = [ 25 5 ] , [ 10 2 ] , [ 5 1 ] — a 3×2 matrix

Order of a Matrix

A matrix with m rows and n columns has order m × n. Total elements = m × n. Each element is written as a_ij where i = row, j = column.

Three Matrix Operations

Addition

Add corresponding elements. Both matrices must be same order.
A + B = [a_ij + b_ij]

Subtraction

Subtract corresponding elements. Same order required.
A − B = [a_ij − b_ij]

Transpose (Aᵀ)

Interchange rows and columns. A 3×2 matrix becomes a 2×3 matrix.

Applications of Matrices in AI

Application	How matrices are used
Image Processing	Digital images stored as matrices of pixel intensities (0-255 for grayscale; 0 = dark, 255 = white). Colour images use 3 channels (R, G, B).
Recommender Systems	Relate users to the products they purchased / viewed — each cell rates a user-product interaction.
NLP	Vectors (1-D matrices) depict the distribution of words across documents.

1.9 Data Pre-processing

Data pre-processing cleans, transforms, reduces, integrates and normalises data — making it machine-learning-friendly. It is a crucial step before training.

Five Stages of Pre-processing

1. Data CleaningHandle missing values · outliers · inconsistent data · duplicates.

2. Data TransformationConvert categorical → numerical. Create new features or modify existing ones.

3. Data ReductionDimensionality reduction (fewer features). Sampling for large datasets.

4. Data Integration & NormalisationMerge multiple sources. Scale all features to a similar range.

5. Feature SelectionKeep only the features that contribute most to the target; drop irrelevant ones.

Data Cleaning — Four Common Issues

Issue	How to handle
Missing Data	Delete rows / columns with missing values, or impute (fill with mean / median / mode), or use algorithms that tolerate missing data.
Outliers	Identify (values far from the rest — errors or rare events), then remove, transform, or use robust statistics.
Inconsistent Data	Fix typographical errors, mismatched data types. Standardise formats.
Duplicate Data	Identify and remove duplicate records to ensure data integrity.

1.10 Data in Modelling & Evaluation

Train–Test Split

After pre-processing, data is split into:

Training set — used to train the model (usually 80%).
Testing set — used to evaluate the trained model (usually 20%).

Validation Techniques

Train-Test Split

Train on training set, evaluate on a separate test set. Simple & fast.

Cross-Validation

Split the data into k folds, train/test across rotations to ensure performance is consistent across different subsets.

Evaluation Metrics

Problem type	Common metrics
Classification	Accuracy · Precision · Recall · F1-Score · ROC curve.
Regression	MSE (Mean Squared Error) · RMSE · MAE (Mean Absolute Error) · R² (R-Squared).

1.11 Practical Programs (Syllabus)

Program A — Mean / Median / Mode of ages

import statistics

ages = [25, 28, 30, 35, 40, 45, 50, 55, 60, 65]

print("Mean   =", statistics.mean(ages))
print("Median =", statistics.median(ages))
# Note: mode fails if all values unique
print("Mode   =", statistics.mode(ages))

Program B — Variance / SD of temperatures

import statistics

temps = [20, 22, 25, 18, 23]
print("Variance  =", statistics.variance(temps))
print("Std. Dev. =", statistics.stdev(temps))

Program C — Line chart (weekly inquiries)

import matplotlib.pyplot as plt

weeks = ["Week 1", "Week 2", "Week 3", "Week 4"]
inquiries = [150, 170, 180, 200]

plt.plot(weeks, inquiries, marker="o", color="purple")
plt.title("Customer Inquiries Per Week")
plt.ylabel("Number of inquiries")
plt.show()

Program D — Bar chart (book sales by genre)

genres = ["Fiction", "Mystery", "Sci-Fi", "Romance", "Biography"]
sales  = [120, 90, 80, 110, 70]

plt.bar(genres, sales, color="orange", edgecolor="black")
plt.title("Books Sold by Genre")
plt.show()

Program E — Pie chart (city transportation)

modes = ["Car", "Public Transit", "Walking", "Bicycle"]
shares = [40, 30, 20, 10]

plt.pie(shares, labels=modes, autopct="%.1f%%")
plt.title("How people commute in the city")
plt.show()

1.12 Certification (Syllabus)

IBM SkillsBuild — Data Visualisation with Python (Modules 1, 2, 3).

Quick Revision — Key Points to Remember

Data literacy = ability to find and use data effectively (collect, organise, analyse, interpret, use ethically).
Data types: structured · semi-structured · unstructured.
Two sources of data: Primary (newly created) · Secondary (already stored).
6 primary methods: Survey · Interview · Observation · Experiment · Marketing Campaign · Questionnaire.
4 secondary methods: Social media tracking · Web scraping · Satellite data · Online data platforms (Kaggle, GitHub).
4 levels of measurement: Nominal (names) · Ordinal (ordered) · Interval (ordered + measurable diffs, no true 0) · Ratio (true 0, all ops allowed).
Central tendency: Mean (even spread) · Median (skewed / ordinal) · Mode (peak).
Dispersion: Variance = Σ(x−mean)²/n · Std Dev = √variance. Low → clustered; High → spread.
5 chart types: Line (trends) · Bar (comparison) · Histogram (frequency) · Scatter (relationships) · Pie (proportions).
Plotting library: matplotlib.pyplot → plot / bar / hist / scatter / pie.
Matrix = rectangular number arrangement in rows × columns. Order = m × n.
3 matrix ops: Addition · Subtraction · Transpose (Aᵀ).
AI uses matrices for: image processing · recommender systems · NLP word vectors.
Pre-processing 5 stages: Cleaning · Transformation · Reduction · Integration/Normalisation · Feature Selection.
Cleaning 4 issues: Missing data · Outliers · Inconsistent data · Duplicates.
Modelling: Train/Test split (80/20) · Cross-validation.
Evaluation: Classification → Accuracy/Precision/Recall/F1/ROC · Regression → MSE/RMSE/MAE/R².
Certification: IBM SkillsBuild — Data Visualisation with Python (Modules 1-3).

Practice Quiz — test yourself on this chapter→