Introduction — What Will We Learn?
- Review the basics of the NumPy and Pandas libraries — arrays, Series, DataFrames and essential functions.
- Efficiently import and export data between CSV files and Pandas DataFrames.
- Detect and handle missing values in a dataset.
- (Advanced) Implement a Linear Regression algorithm including data preparation and model training.
Prerequisite: Foundational understanding of Python from Class XI and familiarity with basic programming.
1.1 Python Libraries
In data science and analytics, two libraries stand out — NumPy and Pandas. They form the backbone of data manipulation and analysis in Python, enabling users to handle large datasets with ease and precision.
1.2 NumPy — Numerical Python
In NumPy, the number of dimensions of the array is called the rank of the array.
🔹 Quick NumPy Recap (from Class XI)
- Creation: np.array([1, 2, 3, 4]), np.zeros(5), np.ones((2,3)), np.arange(10), np.linspace(0,1,5).
- Shape & rank: arr.shape, arr.ndim, arr.size, arr.dtype.
- Arithmetic: element-wise + − * / between arrays of same shape.
- Statistics: np.mean, np.median, np.std, np.var, np.min, np.max, np.sum.
- Indexing / slicing: arr[0], arr[1:4], arr[:, 0] (column 0 of a 2-D array).
- NaN: np.NaN represents a missing numeric value.
import numpy as np a = np.array([90, 100, 110, 120]) print(a.shape) # (4,) print(a.mean()) # 105.0
1.3 Pandas — Why and Where We Use It
Suppose we have a dataset of marketing campaigns — campaign type, budget, duration, reach, engagement metrics and sales performance. Pandas is used to:
- Load the dataset.
- Display summary statistics.
- Perform group-wise analysis to understand campaign performance.
- Pair with Matplotlib for visualising sales and average engagement per campaign type.
This capability is invaluable in AI and data-driven decision-making — it lets businesses gain actionable insights from their data.
1.4 Pandas Data Structures — Series & DataFrame
Pandas provides two main data structures for manipulating data:
- Series — a one-dimensional labelled array (a single column of data with an index).
- DataFrame — a two-dimensional labelled table (rows + columns, like a spreadsheet).
🧩 1. Creating a Series from Scalar Values
import pandas as pd s = pd.Series([10, 20, 30, 40], index=['a','b','c','d']) print(s)
📐 2. Creating a DataFrame from NumPy Arrays
import numpy as np import pandas as pd array1 = np.array([90, 100, 110, 120]) array2 = np.array([50, 60, 70]) array3 = np.array([10, 20, 30, 40]) marksDF = pd.DataFrame([array1, array2, array3], columns=['A','B','C','D']) print(marksDF)
🗃️ 3. Creating a DataFrame from a Dictionary of Arrays / Lists
import pandas as pd data = {'Name':['Varun','Ganesh','Joseph','Abdul','Reena'], 'Age' :[37, 30, 38, 39, 40]} df = pd.DataFrame(data) print(df)
The dictionary keys become column labels and the lists become the column data.
📋 4. Creating a DataFrame from a List of Dictionaries
listDict = [{'a':10, 'b':20},
{'a':5, 'b':10, 'c':20}]
a = pd.DataFrame(listDict)
print(a)
There will be as many rows as dictionaries in the list. Missing keys become NaN.
1.5 Dealing with Rows and Columns
➕ 1. Adding a New Column to a DataFrame
ResultSheet = {
'Rajat' : pd.Series([90,91,97], index=['Maths','Science','Hindi']),
'Amrita' : pd.Series([92,81,96], index=['Maths','Science','Hindi']),
'Meenakshi': pd.Series([89,91,88], index=['Maths','Science','Hindi']),
'Rose' : pd.Series([81,71,67], index=['Maths','Science','Hindi']),
'Karthika' : pd.Series([94,95,99], index=['Maths','Science','Hindi'])
}
Result = pd.DataFrame(ResultSheet)
Result['Fathima'] = [89, 78, 76] # new column
print(Result)
➕ 2. Adding a New Row Using .loc[]
Result.loc['English'] = [90, 92, 89, 80, 90, 88] print(Result)
✏️ 3. Updating a Row with .loc[]
Result.loc['Science'] = [92, 84, 90, 72, 96, 88] # change Science marks print(Result)
1.6 Deleting Rows or Columns — .drop()
- axis = 0 → delete a row.
- axis = 1 → delete a column.
Result = Result.drop('Hindi', axis=0) # delete row 'Hindi' Result = Result.drop(['Rajat','Meenakshi','Karthika'], axis=1) # delete columns print(Result)
1.7 Attributes of DataFrames
import pandas as pd dict = { "Student": pd.Series(["Arnav","Neha","Priya","Rahul"], index=["Data 1","Data 2","Data 3","Data 4"]), "Marks" : pd.Series([85, 92, 78, 83], index=["Data 1","Data 2","Data 3","Data 4"]), "Sports" : pd.Series(["Cricket","Volleyball","Hockey","Badminton"], index=["Data 1","Data 2","Data 3","Data 4"]) } df = pd.DataFrame(dict) print(df)
| Attribute | Returns |
|---|---|
| df.index | Row labels (Index object). |
| df.columns | Column labels. |
| df.shape | (rows, cols) tuple — e.g., (4, 3). |
| df.head(n) | First n rows (default 5). |
| df.tail(n) | Last n rows. |
2.1 What Is a CSV File?
In Python, CSV files are fundamental for data analysis. Pandas provides powerful tools to read CSV files into DataFrames — the go-to format for data scientists.
2.2 Importing a CSV File to a DataFrame — read_csv()
Use pd.read_csv("filename.csv") to import tabular data from a CSV into a Pandas DataFrame.
☁️ 1. Using Google Colab
- Open Google Colab at colab.research.google.com → File → New notebook.
- Click the Folder icon on the left sidebar.
- Click the ↑ Upload button → select the CSV file (e.g., studentsmarks.csv).
- Execute the code:
import pandas as pd df = pd.read_csv("studentsmarks.csv") print(df)
💻 2. Using a Local Python IDE
Give the complete path of the CSV file:
import pandas as pd import io df = pd.read_csv('C:/PANDAS/studentsmarks.csv', sep=",", header=0) print(df)
2.3 Exporting a DataFrame to a CSV File — to_csv()
Use df.to_csv() to save a DataFrame to a text or CSV file.
💾 On Local Python IDE
df.to_csv(path_or_buf='C:/PANDAS/resultout.csv', sep=',')
This creates resultout.csv on the hard disk. Opening it in any text editor or spreadsheet shows the data with row labels and column headers separated by commas.
☁️ On Google Colab
df.to_csv("resultout.csv", index=False)
The index=False argument prevents the row index from being written into the CSV.
2.4 Handling Missing Values
The two most common strategies to handle missing values are:
- Drop the row having missing values.
- Estimate (fill) the missing value.
🔍 1. Checking Missing Values — isnull()
Pandas provides the isnull() function to check whether any value is missing. It returns True if the attribute has missing values, else False.
🗑️ 2. Drop Missing Values — dropna()
Dropping removes the entire row (object) that has missing value(s). This reduces the size of the dataset, so use it only when missing values appear on few rows.
drop = marks.dropna() print(drop)
🔢 3. Estimate the Missing Value — fillna()
Missing values can be filled with estimations — the value just before/after, the average/minimum/maximum of that attribute, or simply 0 or 1.
FillZero = marks.fillna(0) # replace missing with 0 print(FillZero) FillOne = marks.fillna(1) # replace missing with 1
2.5 Case Study — Student Marks with Missing Values
Meera and Suhana couldn't attend Science and Hindi exams (fever). Joseph attended a national-level science exhibition on AI exam day. Their marks are missing.
import pandas as pd import numpy as np ResultSheet = { 'Maths' : pd.Series([90,91,97,89,65,93], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'Science': pd.Series([92,81,np.NaN,87,50,88], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'English': pd.Series([89,91,88,78,77,82], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'Hindi' : pd.Series([81,71,67,82,np.NaN,89], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'AI' : pd.Series([94,95,99,np.NaN,96,99], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']) } marks = pd.DataFrame(ResultSheet) print(marks)
🔹 Check for Missing Values
>>> print(marks.isnull()) # shows True/False table; three True values → three missing pieces >>> print(marks['Science'].isnull().any()) True >>> print(marks['Maths'].isnull().any()) False >>> marks.isnull().sum().sum() # total NaN in whole dataset 3
🔹 Apply dropna() and fillna()
# Option 1 — Drop rows with NaN drop = marks.dropna() print(drop) # Option 2 — Fill NaN with 0 FillZero = marks.fillna(0) print(FillZero)
3.1 Linear Regression — What & Why
🔹 Typical Linear Regression Workflow
- Load the dataset (with Pandas).
- Perform Exploratory Data Analysis (EDA) — head, info, describe, plots.
- Split data into features (X) and target (y).
- Split into training set (80%) and testing set (20%).
- Train the model using sklearn.linear_model.LinearRegression.
- Predict on the test set.
- Evaluate actual vs predicted values.
3.2 Practical Activity — USA Housing Dataset
Dataset (USA Housing) available from the CBSE Handbook's Google-Drive link.
📥 1. Load & Peek
import pandas as pd df = pd.read_csv('USA_Housing.csv') df.head() df.info() df.describe()
Examining the count reveals that all columns contain 5000 values → no missing values anywhere.
🔍 2. Exploratory Data Analysis
Use df.corr() for correlations, and Matplotlib/Seaborn for scatter-plots between each feature and the target (Price). Strong linear trends indicate good predictors.
📊 3. Train–Test Split (80 : 20)
from sklearn.model_selection import train_test_split X = df[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms','Area Population']] y = df['Price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 4000 rows for training (80% of 5000) · 1000 rows for testing
🤖 4. Apply Linear Regression
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) print("Intercept:", model.intercept_) print("Coefficients:", model.coef_)
📈 5. Predict & Compare
predictions = model.predict(X_test) import matplotlib.pyplot as plt plt.scatter(y_test, predictions) plt.xlabel('Actual Price') plt.ylabel('Predicted Price') plt.title('Actual vs Predicted') plt.show()
We observe there is a difference between actual and predicted values. The next chapter (Data Science Methodology · Unit 2) covers how to calculate the error, evaluate the model and test its accuracy.
- Primary data structure in Pandas → Series (not List/Tuple/Matrix).
- fillna(0) → fills missing values with zeros.
- Library typically used for importing/managing data for Linear Regression → Pandas.
- Read CSV → pd.read_csv("filename.csv").
- df.shape → number of rows and columns in the DataFrame.
- Export DataFrame → to_csv().
Quick Revision — Key Points to Remember
- Python libraries = pre-written toolkits so we don't code from scratch.
- NumPy (Numerical Python) = numerical computing + array-processing. Rank = number of dimensions.
- Pandas = data manipulation & analysis — vital in AI / data-driven decisions.
- 2 Pandas data structures: Series (1-D) · DataFrame (2-D).
- Create DataFrame from: NumPy arrays · Dictionary of lists · List of dictionaries.
- Add column: df['NewCol'] = [...]. Add row: df.loc['RowLabel'] = [...].
- Delete: df.drop(label, axis=0) row · axis=1 column.
- Attributes: df.index · df.columns · df.shape · df.head(n) · df.tail(n).
- CSV = Comma-Separated Values — simple tabular text files.
- Import CSV: pd.read_csv("file.csv").
- Export CSV: df.to_csv("file.csv", index=False).
- Handling missing values — 2 strategies: Drop the row · Estimate (fill) the value.
- isnull() → True/False for missing. dropna() → removes rows with NaN. fillna(num) → replaces NaN with num.
- np.NaN represents a missing numeric value.
- Linear Regression (advanced) predicts continuous values using y = mx + c.
- LR workflow: Load data → EDA → Split X/y → 80:20 Train-Test split → Train with LinearRegression() → Predict → Compare actual vs predicted.
- sklearn modules used: model_selection.train_test_split · linear_model.LinearRegression.
- Case-study dataset: USA_Housing.csv (5000 rows · no missing values · 80:20 → 4000 train / 1000 test).