Introduction — What Will We Learn?
- Review the basics of the NumPy and Pandas libraries — arrays, Series, DataFrames and essential functions.
- Efficiently import and export data between CSV files and Pandas DataFrames.
- Detect and handle missing values in a dataset.
- (Advanced) Implement a Linear Regression algorithm including data preparation and model training.
Prerequisite: Foundational understanding of Python from Class XI and familiarity with basic programming.
1.1 Python Libraries
In data science and analytics, two libraries stand out — NumPy and Pandas. They form the backbone of data manipulation and analysis in Python, enabling users to handle large datasets with ease and precision.
1.2 NumPy — Numerical Python
In NumPy, the number of dimensions of the array is called the rank of the array.
Quick NumPy Recap (from Class XI)
- Creation: np.array([1, 2, 3, 4]), np.zeros(5), np.ones((2,3)), np.arange(10), np.linspace(0,1,5).
- Shape & rank: arr.shape, arr.ndim, arr.size, arr.dtype.
- Arithmetic: element-wise + − * / between arrays of same shape.
- Statistics: np.mean, np.median, np.std, np.var, np.min, np.max, np.sum.
- Indexing / slicing: arr[0], arr[1:4], arr[:, 0] (column 0 of a 2-D array).
- NaN: np.NaN represents a missing numeric value.
import numpy as np a = np.array([90, 100, 110, 120]) print(a.shape) # (4,) print(a.mean()) # 105.0
1.3 Pandas — Why and Where We Use It
Suppose we have a dataset of marketing campaigns — campaign type, budget, duration, reach, engagement metrics and sales performance. Pandas is used to:
- Load the dataset.
- Display summary statistics.
- Perform group-wise analysis to understand campaign performance.
- Pair with Matplotlib for visualising sales and average engagement per campaign type.
This capability is invaluable in AI and data-driven decision-making — it lets businesses gain actionable insights from their data.
1.4 Pandas Data Structures — Series & DataFrame
Pandas provides two main data structures for manipulating data:
- Series — a one-dimensional labelled array (a single column of data with an index).
- DataFrame — a two-dimensional labelled table (rows + columns, like a spreadsheet).
1. Creating a Series from Scalar Values
import pandas as pd s = pd.Series([10, 20, 30, 40], index=['a','b','c','d']) print(s)
2. Creating a DataFrame from NumPy Arrays
import numpy as np import pandas as pd array1 = np.array([90, 100, 110, 120]) array2 = np.array([50, 60, 70]) array3 = np.array([10, 20, 30, 40]) marksDF = pd.DataFrame([array1, array2, array3], columns=['A','B','C','D']) print(marksDF)
3. Creating a DataFrame from a Dictionary of Arrays / Lists
import pandas as pd data = {'Name':['Varun','Ganesh','Joseph','Abdul','Reena'], 'Age' :[37, 30, 38, 39, 40]} df = pd.DataFrame(data) print(df)
The dictionary keys become column labels and the lists become the column data.
4. Creating a DataFrame from a List of Dictionaries
listDict = [{'a':10, 'b':20},
{'a':5, 'b':10, 'c':20}]
a = pd.DataFrame(listDict)
print(a)
There will be as many rows as dictionaries in the list. Missing keys become NaN.
1.5 Dealing with Rows and Columns
1. Adding a New Column to a DataFrame
ResultSheet = {
'Rajat' : pd.Series([90,91,97], index=['Maths','Science','Hindi']),
'Amrita' : pd.Series([92,81,96], index=['Maths','Science','Hindi']),
'Meenakshi': pd.Series([89,91,88], index=['Maths','Science','Hindi']),
'Rose' : pd.Series([81,71,67], index=['Maths','Science','Hindi']),
'Karthika' : pd.Series([94,95,99], index=['Maths','Science','Hindi'])
}
Result = pd.DataFrame(ResultSheet)
Result['Fathima'] = [89, 78, 76] # new column
print(Result)
2. Adding a New Row Using .loc[]
Result.loc['English'] = [90, 92, 89, 80, 90, 88] print(Result)
3. Updating a Row with .loc[]
Result.loc['Science'] = [92, 84, 90, 72, 96, 88] # change Science marks print(Result)
1.6 Deleting Rows or Columns — .drop()
- axis = 0 → delete a row.
- axis = 1 → delete a column.
Result = Result.drop('Hindi', axis=0) # delete row 'Hindi' Result = Result.drop(['Rajat','Meenakshi','Karthika'], axis=1) # delete columns print(Result)
1.7 Attributes of DataFrames
import pandas as pd dict = { "Student": pd.Series(["Arnav","Neha","Priya","Rahul"], index=["Data 1","Data 2","Data 3","Data 4"]), "Marks" : pd.Series([85, 92, 78, 83], index=["Data 1","Data 2","Data 3","Data 4"]), "Sports" : pd.Series(["Cricket","Volleyball","Hockey","Badminton"], index=["Data 1","Data 2","Data 3","Data 4"]) } df = pd.DataFrame(dict) print(df)
| Attribute | Returns |
|---|---|
| df.index | Row labels (Index object). |
| df.columns | Column labels. |
| df.shape | (rows, cols) tuple — e.g., (4, 3). |
| df.head(n) | First n rows (default 5). |
| df.tail(n) | Last n rows. |
2.1 What Is a CSV File?
In Python, CSV files are fundamental for data analysis. Pandas provides powerful tools to read CSV files into DataFrames — the go-to format for data scientists.
2.2 Importing a CSV File to a DataFrame — read_csv()
Use pd.read_csv("filename.csv") to import tabular data from a CSV into a Pandas DataFrame.
1. Using Google Colab
- Open Google Colab at colab.research.google.com → File → New notebook.
- Click the Folder icon on the left sidebar.
- Click the Upload button → select the CSV file (e.g., studentsmarks.csv).
- Execute the code:
import pandas as pd df = pd.read_csv("studentsmarks.csv") print(df)
2. Using a Local Python IDE
Give the complete path of the CSV file:
import pandas as pd import io df = pd.read_csv('C:/PANDAS/studentsmarks.csv', sep=",", header=0) print(df)
2.3 Exporting a DataFrame to a CSV File — to_csv()
Use df.to_csv() to save a DataFrame to a text or CSV file.
On Local Python IDE
df.to_csv(path_or_buf='C:/PANDAS/resultout.csv', sep=',')
This creates resultout.csv on the hard disk. Opening it in any text editor or spreadsheet shows the data with row labels and column headers separated by commas.
On Google Colab
df.to_csv("resultout.csv", index=False)
The index=False argument prevents the row index from being written into the CSV.
2.4 Handling Missing Values
The two most common strategies to handle missing values are:
- Drop the row having missing values.
- Estimate (fill) the missing value.
1. Checking Missing Values — isnull()
Pandas provides the isnull() function to check whether any value is missing. It returns True if the attribute has missing values, else False.
️ 2. Drop Missing Values — dropna()
Dropping removes the entire row (object) that has missing value(s). This reduces the size of the dataset, so use it only when missing values appear on few rows.
drop = marks.dropna() print(drop)
3. Estimate the Missing Value — fillna()
Missing values can be filled with estimations — the value just before/after, the average/minimum/maximum of that attribute, or simply 0 or 1.
FillZero = marks.fillna(0) # replace missing with 0 print(FillZero) FillOne = marks.fillna(1) # replace missing with 1
2.5 Case Study — Student Marks with Missing Values
Meera and Suhana couldn't attend Science and Hindi exams (fever). Joseph attended a national-level science exhibition on AI exam day. Their marks are missing.
import pandas as pd import numpy as np ResultSheet = { 'Maths' : pd.Series([90,91,97,89,65,93], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'Science': pd.Series([92,81,np.NaN,87,50,88], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'English': pd.Series([89,91,88,78,77,82], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'Hindi' : pd.Series([81,71,67,82,np.NaN,89], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']), 'AI' : pd.Series([94,95,99,np.NaN,96,99], index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']) } marks = pd.DataFrame(ResultSheet) print(marks)
Check for Missing Values
>>> print(marks.isnull()) # shows True/False table; three True values → three missing pieces >>> print(marks['Science'].isnull().any()) True >>> print(marks['Maths'].isnull().any()) False >>> marks.isnull().sum().sum() # total NaN in whole dataset 3
Apply dropna() and fillna()
# Option 1 — Drop rows with NaN drop = marks.dropna() print(drop) # Option 2 — Fill NaN with 0 FillZero = marks.fillna(0) print(FillZero)
3.1 Linear Regression — What & Why
Typical Linear Regression Workflow
- Load the dataset (with Pandas).
- Perform Exploratory Data Analysis (EDA) — head, info, describe, plots.
- Split data into features (X) and target (y).
- Split into training set (80%) and testing set (20%).
- Train the model using sklearn.linear_model.LinearRegression.
- Predict on the test set.
- Evaluate actual vs predicted values.
3.2 Practical Activity — USA Housing Dataset
Dataset (USA Housing) available from the CBSE Handbook's Google-Drive link.
1. Load & Peek
import pandas as pd df = pd.read_csv('USA_Housing.csv') df.head() df.info() df.describe()
Examining the count reveals that all columns contain 5000 values → no missing values anywhere.
2. Exploratory Data Analysis
Use df.corr() for correlations, and Matplotlib/Seaborn for scatter-plots between each feature and the target (Price). Strong linear trends indicate good predictors.
3. Train–Test Split (80 : 20)
from sklearn.model_selection import train_test_split X = df[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms','Area Population']] y = df['Price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 4000 rows for training (80% of 5000) · 1000 rows for testing
4. Apply Linear Regression
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) print("Intercept:", model.intercept_) print("Coefficients:", model.coef_)
5. Predict & Compare
predictions = model.predict(X_test) import matplotlib.pyplot as plt plt.scatter(y_test, predictions) plt.xlabel('Actual Price') plt.ylabel('Predicted Price') plt.title('Actual vs Predicted') plt.show()
We observe there is a difference between actual and predicted values. The next chapter (Data Science Methodology · Unit 2) covers how to calculate the error, evaluate the model and test its accuracy.
- Primary data structure in Pandas → Series (not List/Tuple/Matrix).
- fillna(0) → fills missing values with zeros.
- Library typically used for importing/managing data for Linear Regression → Pandas.
- Read CSV → pd.read_csv("filename.csv").
- df.shape → number of rows and columns in the DataFrame.
- Export DataFrame → to_csv().
Quick Revision — Key Points to Remember
- Python libraries = pre-written toolkits so we don't code from scratch.
- NumPy (Numerical Python) = numerical computing + array-processing. Rank = number of dimensions.
- Pandas = data manipulation & analysis — vital in AI / data-driven decisions.
- 2 Pandas data structures: Series (1-D) · DataFrame (2-D).
- Create DataFrame from: NumPy arrays · Dictionary of lists · List of dictionaries.
- Add column: df['NewCol'] = [...]. Add row: df.loc['RowLabel'] = [...].
- Delete: df.drop(label, axis=0) row · axis=1 column.
- Attributes: df.index · df.columns · df.shape · df.head(n) · df.tail(n).
- CSV = Comma-Separated Values — simple tabular text files.
- Import CSV: pd.read_csv("file.csv").
- Export CSV: df.to_csv("file.csv", index=False).
- Handling missing values — 2 strategies: Drop the row · Estimate (fill) the value.
- isnull() → True/False for missing. dropna() → removes rows with NaN. fillna(num) → replaces NaN with num.
- np.NaN represents a missing numeric value.
- Linear Regression (advanced) predicts continuous values using y = mx + c.
- LR workflow: Load data → EDA → Split X/y → 80:20 Train-Test split → Train with LinearRegression() → Predict → Compare actual vs predicted.
- sklearn modules used: model_selection.train_test_split · linear_model.LinearRegression.
- Case-study dataset: USA_Housing.csv (5000 rows · no missing values · 80:20 → 4000 train / 1000 test).