1.1 What is Data Literacy?
1.2 Data Collection
Data collection means pooling data by scraping, capturing or loading it from multiple offline + online sources. Key guidelines:
- Quantity depends on the number of features and model complexity. A license-plate detector needs less data; a medical AI needs huge volumes.
- Diversity matters more than raw quantity — diverse data helps the model cover more scenarios.
- Start with small batches, then scale up.
- Collection is iterative — as the project develops, data requirements may change.
Two Sources of Data
🆕 Primary Sources
Data created newly for the analysis. The researcher collects it first-hand.📚 Secondary Sources
Data already stored and ready for reuse — books, journals, newspapers, websites, databases.Primary Data Collection Methods
| Method | Description | Example |
|---|---|---|
| 📋 Survey | Gather data via interviews, questionnaires or online forms. Measures opinions, behaviours, demographics. | Researcher uses a questionnaire to learn consumer preferences for a new product. |
| 🎙️ Interview | Direct communication with individuals or groups. Can be structured, semi-structured, or unstructured. | Online survey to collect employee feedback about job satisfaction. |
| 👁️ Observation | Watch and record behaviours or events as they occur. Used when direct interaction isn't possible. | Observing children's play patterns in a schoolyard to understand social dynamics. |
| 🧪 Experiment | Manipulate variables to observe effects on outcomes. Establishes cause-and-effect relationships. | Testing effectiveness of different ad campaigns on a group of people. |
| 📢 Marketing Campaign | Use customer data to predict behaviour and optimise campaign performance. | Personalised email based on past customer purchases. |
| 📝 Questionnaire | A specific survey tool — list of questions. Can collect quantitative or qualitative data. | Rate satisfaction 1-5, plus open-ended feedback. |
Secondary Data Collection Methods
| Method | Description | Example |
|---|---|---|
| 💬 Social Media Tracking | Collect user posts, comments, interactions. | Analysing sentiment around a product launch. |
| 🕸️ Web Scraping | Automated tools extract content and data from websites. | Scraping product prices from e-commerce sites for comparison. |
| 🛰️ Satellite Data | Gather info about Earth's surface / atmosphere via satellites. | Monitoring weather patterns and environmental changes. |
| ☁️ Online Data Platforms | Websites offering pre-compiled datasets. | Kaggle · GitHub · data.gov.in. |
1.3 Exploring Data — Levels of Measurement
1️⃣ Nominal
Categories / names only. No order, no calculation.Examples: eye colour, yes/no responses, gender, smartphone brands, player jersey numbers.
2️⃣ Ordinal
Ordered categories — but differences cannot be measured. Still qualitative.Examples: ratings (unpalatable → delicious), grades (A/B/C/D), survey "excellent / good / satisfactory / unsatisfactory".
3️⃣ Interval
Ordered with measurable differences — but no true zero. Ratios are meaningless.Examples: temperature in °C or °F (0° doesn't mean absence of temperature — –20 °F, –30 °C exist).
4️⃣ Ratio
Like interval but has a true zero point. Ratios are meaningful; all arithmetic operations allowed.Examples: weight, height, exam score (0-100), number of books.
1.4 Statistical Analysis — Measures of Central Tendency
statistics library.
Python statistics Library — Quick Reference
import statistics statistics.mean(data) # arithmetic mean statistics.median(data) # median (middle value) statistics.mode(data) # most frequent value statistics.variance(data) # variance statistics.stdev(data) # standard deviation
📏 Mean (Arithmetic Average)
where M = mean, Σ = sum, f = frequency, x = score, n = total cases.
Mean = (5 + 10 + 15 + 20 + 30) / 5 = 80 / 5 = 16
📐 Median (Middle Value)
Arrange data in ascending / descending order — the middle value is the median. For 13 values (odd), the 7th is the median.
Sorted: 10, 11, 15, 17, 20, 21, 27, 28, 30, 32, 32, 35, 40 → Median = 27
🏔️ Mode (Most Frequent Value)
When to Use Mean / Median / Mode?
| Measure | When to use |
|---|---|
| Mean | Data is evenly spread with no exceptionally high / low values. |
| Median | Data includes exceptionally high or low values (outliers); also for ordinal data. |
| Mode | Finding the distribution peak — e.g., most popular book, most common shoe size. |
Program 1 — Mean / Median / Mode of student heights
import statistics heights = [145, 151, 152, 149, 147, 152, 151, 149, 152, 151, 147, 148, 155, 147, 152, 151, 149, 145, 147, 152, 146, 148, 150, 152, 151] print("Mean =", statistics.mean(heights)) print("Median =", statistics.median(heights)) print("Mode =", statistics.mode(heights))
1.5 Statistical Analysis — Measures of Dispersion
Central tendency tells us the centre. Dispersion tells us how spread out the data is around that centre. The two key measures are Variance and Standard Deviation.
Mean = 1970 / 5 = 394 mm
Deviations: 206, 76, −224, 36, −94
Squared: 42436 + 5776 + 50176 + 1296 + 8836 = 108520
Variance = 108520 / 5 = 21704
Standard Deviation = √21704 = 147.32 mm
Interpreting Variance / Std Deviation
⬇️ Low Variance / Low SD
Data points are very close to the mean and to each other — tightly clustered.⬆️ High Variance / High SD
Data points are very spread out from the mean — widely scattered.Program 2 — Variance & Standard Deviation
import statistics heights = [145, 151, 152, 149, 147, 152, 151, 149, 152, 151, 147, 148, 155, 147, 152, 151, 149, 145, 147, 152, 146, 148, 150, 152, 151] print("Variance =", statistics.variance(heights)) print("Std. Dev. =", statistics.stdev(heights))
1.6 Data Representation & Visualisation
Data representation is classified into two categories:
📋 Non-Graphical
Tabular form, case form. Old format — not suitable for large datasets or decision-making.📊 Graphical (Data Visualisation)
Points, lines, dots, shapes — the human brain handles complex data better when visualised. The 5 core types are Line · Bar · Histogram · Scatter · Pie.Matplotlib — Python's Plotting Library
pip install matplotlib # or: python -m pip install -U matplotlib
Import and use:
import matplotlib.pyplot as plt
Common pyplot Functions
| Function | Description |
|---|---|
title(text) | Adds title to the chart. |
xlabel(text) / ylabel(text) | Sets X-axis / Y-axis labels. |
xlim(a, b) / ylim(a, b) | Sets value limits on axes. |
xticks(list) / yticks(list) | Sets tick marks. |
show() | Displays the graph on screen. |
savefig("path") | Saves the graph to disk. |
figure(figsize=(w,h)) | Determines figure size in inches. |
1.7 Five Chart Types
A. Line Graph — for continuous data / trends over time
Shows trends and changes. The line slopes up (increase) or down (decrease). Function: plt.plot().
import matplotlib.pyplot as plt tests = ["T1", "T2", "T3", "T4", "T5"] marks = [25, 34, 49, 40, 48] plt.plot(tests, marks, marker="o", color="blue") plt.title("Kavya's AI marks") plt.xlabel("Test") plt.ylabel("Marks") plt.show()
B. Bar Graph — for comparison between categories
Rectangular bars with heights proportional to values. Function: plt.bar().
schools = ["Oxford", "Delhi", "Jyothis", "Sanskriti", "Bombay"] students = [123, 87, 105, 146, 34] plt.bar(schools, students, color="green", edgecolor="black") plt.title("Deep Learning Seminar Attendance") plt.show()
C. Histogram — for frequency distribution
Vertical rectangles showing frequencies of value ranges (bins). One distribution per axis. Function: plt.hist().
heights = [141, 145, 142, 147, 144, 148, 141, 142, 149, 144, 143, 149, 146, 141, 147, 142, 143] plt.hist(heights, bins=5, color="skyblue", edgecolor="black") plt.title("Height distribution — Class XII girls") plt.show()
D. Scatter Graph — for relationships between two variables
Plots data points on x and y axes to reveal correlations, clusters, trends. Function: plt.scatter().
study_hours = [2, 4, 5, 7, 8, 10] math_score = [55, 65, 70, 80, 85, 95] plt.scatter(study_hours, math_score, color="red") plt.title("Study Hours vs Math Score") plt.xlabel("Hours per week") plt.ylabel("Math %") plt.show()
E. Pie Chart — for proportions / composition
Circular graph divided into segments representing percentages. Limit to ~7 categories for clarity. Cannot show zero values or changes over time. Function: plt.pie().
subjects = ["English", "Maths", "Science", "Social", "AI", "PE"] periods = [6, 8, 8, 7, 3, 2] plt.pie(periods, labels=subjects, autopct="%.1f%%") plt.title("Weekly periods per subject") plt.show()
🌧️ Practical — Using rainfall.csv
rainfall.csv dataset (12-month rainfall data for Tamil Nadu) to plot a line graph, bar graph, histogram, scatter graph and pie chart.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv("rainfall.csv") # Line chart plt.plot(df["Month"], df["Rainfall"], marker="o") plt.title("Monthly Rainfall - Tamil Nadu") plt.show() # Bar chart plt.bar(df["Month"], df["Rainfall"], color="teal") plt.show() # Pie chart plt.pie(df["Rainfall"], labels=df["Month"], autopct="%.1f%%") plt.show()
1.8 Introduction to Matrices
Example — Purchases Table
Tabular form → Matrix form:
Order of a Matrix
A matrix with m rows and n columns has order m × n. Total elements = m × n. Each element is written as aij where i = row, j = column.
Three Matrix Operations
Addition
Add corresponding elements. Both matrices must be same order.A + B = [aij + bij]
Subtraction
Subtract corresponding elements. Same order required.A − B = [aij − bij]
Transpose (Aᵀ)
Interchange rows and columns. A 3×2 matrix becomes a 2×3 matrix.Applications of Matrices in AI
| Application | How matrices are used |
|---|---|
| 🖼️ Image Processing | Digital images stored as matrices of pixel intensities (0-255 for grayscale; 0 = dark, 255 = white). Colour images use 3 channels (R, G, B). |
| 🎯 Recommender Systems | Relate users to the products they purchased / viewed — each cell rates a user-product interaction. |
| 🗣️ NLP | Vectors (1-D matrices) depict the distribution of words across documents. |
1.9 Data Pre-processing
Five Stages of Pre-processing
Data Cleaning — Four Common Issues
| Issue | How to handle |
|---|---|
| ❓ Missing Data | Delete rows / columns with missing values, or impute (fill with mean / median / mode), or use algorithms that tolerate missing data. |
| 📍 Outliers | Identify (values far from the rest — errors or rare events), then remove, transform, or use robust statistics. |
| ⚠️ Inconsistent Data | Fix typographical errors, mismatched data types. Standardise formats. |
| ♊ Duplicate Data | Identify and remove duplicate records to ensure data integrity. |
1.10 Data in Modelling & Evaluation
Train–Test Split
After pre-processing, data is split into:
- Training set — used to train the model (usually 80%).
- Testing set — used to evaluate the trained model (usually 20%).
Validation Techniques
🎯 Train-Test Split
Train on training set, evaluate on a separate test set. Simple & fast.🔁 Cross-Validation
Split the data into k folds, train/test across rotations to ensure performance is consistent across different subsets.Evaluation Metrics
| Problem type | Common metrics |
|---|---|
| 📦 Classification | Accuracy · Precision · Recall · F1-Score · ROC curve. |
| 📈 Regression | MSE (Mean Squared Error) · RMSE · MAE (Mean Absolute Error) · R² (R-Squared). |
1.11 Practical Programs (Syllabus)
Program A — Mean / Median / Mode of ages
import statistics ages = [25, 28, 30, 35, 40, 45, 50, 55, 60, 65] print("Mean =", statistics.mean(ages)) print("Median =", statistics.median(ages)) # Note: mode fails if all values unique print("Mode =", statistics.mode(ages))
Program B — Variance / SD of temperatures
import statistics temps = [20, 22, 25, 18, 23] print("Variance =", statistics.variance(temps)) print("Std. Dev. =", statistics.stdev(temps))
Program C — Line chart (weekly inquiries)
import matplotlib.pyplot as plt weeks = ["Week 1", "Week 2", "Week 3", "Week 4"] inquiries = [150, 170, 180, 200] plt.plot(weeks, inquiries, marker="o", color="purple") plt.title("Customer Inquiries Per Week") plt.ylabel("Number of inquiries") plt.show()
Program D — Bar chart (book sales by genre)
genres = ["Fiction", "Mystery", "Sci-Fi", "Romance", "Biography"] sales = [120, 90, 80, 110, 70] plt.bar(genres, sales, color="orange", edgecolor="black") plt.title("Books Sold by Genre") plt.show()
Program E — Pie chart (city transportation)
modes = ["Car", "Public Transit", "Walking", "Bicycle"] shares = [40, 30, 20, 10] plt.pie(shares, labels=modes, autopct="%.1f%%") plt.title("How people commute in the city") plt.show()
1.12 Certification (Syllabus)
Quick Revision — Key Points to Remember
- Data literacy = ability to find and use data effectively (collect, organise, analyse, interpret, use ethically).
- Data types: structured · semi-structured · unstructured.
- Two sources of data: Primary (newly created) · Secondary (already stored).
- 6 primary methods: Survey · Interview · Observation · Experiment · Marketing Campaign · Questionnaire.
- 4 secondary methods: Social media tracking · Web scraping · Satellite data · Online data platforms (Kaggle, GitHub).
- 4 levels of measurement: Nominal (names) · Ordinal (ordered) · Interval (ordered + measurable diffs, no true 0) · Ratio (true 0, all ops allowed).
- Central tendency: Mean (even spread) · Median (skewed / ordinal) · Mode (peak).
- Dispersion: Variance = Σ(x−mean)²/n · Std Dev = √variance. Low → clustered; High → spread.
- 5 chart types: Line (trends) · Bar (comparison) · Histogram (frequency) · Scatter (relationships) · Pie (proportions).
- Plotting library: matplotlib.pyplot → plot / bar / hist / scatter / pie.
- Matrix = rectangular number arrangement in rows × columns. Order = m × n.
- 3 matrix ops: Addition · Subtraction · Transpose (Aᵀ).
- AI uses matrices for: image processing · recommender systems · NLP word vectors.
- Pre-processing 5 stages: Cleaning · Transformation · Reduction · Integration/Normalisation · Feature Selection.
- Cleaning 4 issues: Missing data · Outliers · Inconsistent data · Duplicates.
- Modelling: Train/Test split (80/20) · Cross-validation.
- Evaluation: Classification → Accuracy/Precision/Recall/F1/ROC · Regression → MSE/RMSE/MAE/R².
- Certification: IBM SkillsBuild — Data Visualisation with Python (Modules 1-3).