PART B ▪ UNIT 1 · Python Programming – II

  root@vm-learning
  ~
  $
  open
  ch-b1
  

PART B ▪ UNIT 1

Python Programming – II

NumPy · Pandas · CSV I/O · Missing Values · Linear Regression (Practical only)

Python Programming II is a practical-only unit. It recaps NumPy and Pandas (Class XI foundations), then moves on to CSV import/export, handling missing values, and (for advanced learners) the Linear Regression algorithm — all preparing students for the Capstone Project and Data Science Methodology (Unit 2).

Introduction — What Will We Learn?

Review the basics of the NumPy and Pandas libraries — arrays, Series, DataFrames and essential functions.
Efficiently import and export data between CSV files and Pandas DataFrames.
Detect and handle missing values in a dataset.
(Advanced) Implement a Linear Regression algorithm including data preparation and model training.

Prerequisite: Foundational understanding of Python from Class XI and familiarity with basic programming.

Learning Outcome 1: Apply the fundamental concepts of NumPy and Pandas libraries to perform data manipulation and analysis tasks

1.1 Python Libraries

Python Library — a collection of pre-written code we can use to perform common tasks. Libraries are toolkits that provide functions and methods, so we avoid writing code from scratch.

In data science and analytics, two libraries stand out — NumPy and Pandas. They form the backbone of data manipulation and analysis in Python, enabling users to handle large datasets with ease and precision.

1.2 NumPy — Numerical Python

NumPy (Numerical Python) — a powerful library used for numerical computing. It is a general-purpose array-processing package.

In NumPy, the number of dimensions of the array is called the rank of the array.

Quick NumPy Recap (from Class XI)

Creation: np.array([1, 2, 3, 4]), np.zeros(5), np.ones((2,3)), np.arange(10), np.linspace(0,1,5).
Shape & rank: arr.shape, arr.ndim, arr.size, arr.dtype.
Arithmetic: element-wise + − * / between arrays of same shape.
Statistics: np.mean, np.median, np.std, np.var, np.min, np.max, np.sum.
Indexing / slicing: arr[0], arr[1:4], arr[:, 0] (column 0 of a 2-D array).
NaN: np.NaN represents a missing numeric value.

import numpy as np
a = np.array([90, 100, 110, 120])
print(a.shape)   # (4,)
print(a.mean())  # 105.0

1.3 Pandas — Why and Where We Use It

Suppose we have a dataset of marketing campaigns — campaign type, budget, duration, reach, engagement metrics and sales performance. Pandas is used to:

Load the dataset.
Display summary statistics.
Perform group-wise analysis to understand campaign performance.
Pair with Matplotlib for visualising sales and average engagement per campaign type.

This capability is invaluable in AI and data-driven decision-making — it lets businesses gain actionable insights from their data.

1.4 Pandas Data Structures — Series & DataFrame

Pandas provides two main data structures for manipulating data:

Series — a one-dimensional labelled array (a single column of data with an index).
DataFrame — a two-dimensional labelled table (rows + columns, like a spreadsheet).

1. Creating a Series from Scalar Values

import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a','b','c','d'])
print(s)

2. Creating a DataFrame from NumPy Arrays

import numpy as np
import pandas as pd
array1 = np.array([90, 100, 110, 120])
array2 = np.array([50, 60, 70])
array3 = np.array([10, 20, 30, 40])
marksDF = pd.DataFrame([array1, array2, array3], columns=['A','B','C','D'])
print(marksDF)

3. Creating a DataFrame from a Dictionary of Arrays / Lists

import pandas as pd
data = {'Name':['Varun','Ganesh','Joseph','Abdul','Reena'],
        'Age' :[37, 30, 38, 39, 40]}
df = pd.DataFrame(data)
print(df)

The dictionary keys become column labels and the lists become the column data.

4. Creating a DataFrame from a List of Dictionaries

listDict = [{'a':10, 'b':20},
            {'a':5,  'b':10, 'c':20}]
a = pd.DataFrame(listDict)
print(a)

There will be as many rows as dictionaries in the list. Missing keys become NaN.

1.5 Dealing with Rows and Columns

1. Adding a New Column to a DataFrame

ResultSheet = {
  'Rajat'    : pd.Series([90,91,97], index=['Maths','Science','Hindi']),
  'Amrita'   : pd.Series([92,81,96], index=['Maths','Science','Hindi']),
  'Meenakshi': pd.Series([89,91,88], index=['Maths','Science','Hindi']),
  'Rose'     : pd.Series([81,71,67], index=['Maths','Science','Hindi']),
  'Karthika' : pd.Series([94,95,99], index=['Maths','Science','Hindi'])
}
Result = pd.DataFrame(ResultSheet)
Result['Fathima'] = [89, 78, 76]       # new column
print(Result)

2. Adding a New Row Using .loc[]

Result.loc['English'] = [90, 92, 89, 80, 90, 88]
print(Result)

3. Updating a Row with .loc[]

Result.loc['Science'] = [92, 84, 90, 72, 96, 88]   # change Science marks
print(Result)

1.6 Deleting Rows or Columns — .drop()

axis = 0 → delete a row.
axis = 1 → delete a column.

Result = Result.drop('Hindi', axis=0)                       # delete row 'Hindi'
Result = Result.drop(['Rajat','Meenakshi','Karthika'], axis=1)  # delete columns
print(Result)

1.7 Attributes of DataFrames

import pandas as pd
dict = {
  "Student": pd.Series(["Arnav","Neha","Priya","Rahul"],
                       index=["Data 1","Data 2","Data 3","Data 4"]),
  "Marks"  : pd.Series([85, 92, 78, 83],
                       index=["Data 1","Data 2","Data 3","Data 4"]),
  "Sports" : pd.Series(["Cricket","Volleyball","Hockey","Badminton"],
                       index=["Data 1","Data 2","Data 3","Data 4"])
}
df = pd.DataFrame(dict)
print(df)

Attribute	Returns
df.index	Row labels (Index object).
df.columns	Column labels.
df.shape	(rows, cols) tuple — e.g., (4, 3).
df.head(n)	First n rows (default 5).
df.tail(n)	Last n rows.

Learning Outcome 2: Import and export data between CSV files and Pandas DataFrames, ensuring data integrity and consistency

2.1 What Is a CSV File?

CSV (Comma-Separated Values) — a simple text file used to store tabular data. Each line = one row; each value in a row is separated by a comma. Easy to read/write for humans and computers.

In Python, CSV files are fundamental for data analysis. Pandas provides powerful tools to read CSV files into DataFrames — the go-to format for data scientists.

2.2 Importing a CSV File to a DataFrame — read_csv()

Use pd.read_csv("filename.csv") to import tabular data from a CSV into a Pandas DataFrame.

1. Using Google Colab

Open Google Colab at colab.research.google.com → File → New notebook.
Click the Folder icon on the left sidebar.
Click the Upload button → select the CSV file (e.g., studentsmarks.csv).

Execute the code:

import pandas as pd
df = pd.read_csv("studentsmarks.csv")
print(df)

2. Using a Local Python IDE

Give the complete path of the CSV file:

import pandas as pd
import io
df = pd.read_csv('C:/PANDAS/studentsmarks.csv', sep=",", header=0)
print(df)

2.3 Exporting a DataFrame to a CSV File — to_csv()

Use df.to_csv() to save a DataFrame to a text or CSV file.

On Local Python IDE

df.to_csv(path_or_buf='C:/PANDAS/resultout.csv', sep=',')

This creates resultout.csv on the hard disk. Opening it in any text editor or spreadsheet shows the data with row labels and column headers separated by commas.

On Google Colab

df.to_csv("resultout.csv", index=False)

The index=False argument prevents the row index from being written into the CSV.

2.4 Handling Missing Values

The two most common strategies to handle missing values are:

Drop the row having missing values.
Estimate (fill) the missing value.

1. Checking Missing Values — isnull()

Pandas provides the isnull() function to check whether any value is missing. It returns True if the attribute has missing values, else False.

️ 2. Drop Missing Values — dropna()

Dropping removes the entire row (object) that has missing value(s). This reduces the size of the dataset, so use it only when missing values appear on few rows.

drop = marks.dropna()
print(drop)

3. Estimate the Missing Value — fillna()

Missing values can be filled with estimations — the value just before/after, the average/minimum/maximum of that attribute, or simply 0 or 1.

FillZero = marks.fillna(0)   # replace missing with 0
print(FillZero)
FillOne  = marks.fillna(1)   # replace missing with 1

2.5 Case Study — Student Marks with Missing Values

Meera and Suhana couldn't attend Science and Hindi exams (fever). Joseph attended a national-level science exhibition on AI exam day. Their marks are missing.

import pandas as pd
import numpy as np

ResultSheet = {
  'Maths'  : pd.Series([90,91,97,89,65,93],
                       index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
  'Science': pd.Series([92,81,np.NaN,87,50,88],
                       index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
  'English': pd.Series([89,91,88,78,77,82],
                       index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
  'Hindi'  : pd.Series([81,71,67,82,np.NaN,89],
                       index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
  'AI'     : pd.Series([94,95,99,np.NaN,96,99],
                       index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet'])
}
marks = pd.DataFrame(ResultSheet)
print(marks)

Check for Missing Values

>>> print(marks.isnull())
# shows True/False table; three True values → three missing pieces
>>> print(marks['Science'].isnull().any())
True
>>> print(marks['Maths'].isnull().any())
False
>>> marks.isnull().sum().sum()    # total NaN in whole dataset
3

Apply dropna() and fillna()

# Option 1 — Drop rows with NaN
drop = marks.dropna()
print(drop)

# Option 2 — Fill NaN with 0
FillZero = marks.fillna(0)
print(FillZero)

Learning Outcome 3 (for Advanced Learners): Implement the Linear Regression algorithm on Google Colab or any Python IDE

3.1 Linear Regression — What & Why

Linear Regression — a supervised machine-learning algorithm that models the relationship between an independent variable (x) and a dependent variable (y) using a straight line: y = mx + c. Used to predict continuous numeric values (price, temperature, sales).

Typical Linear Regression Workflow

Load the dataset (with Pandas).
Perform Exploratory Data Analysis (EDA) — head, info, describe, plots.
Split data into features (X) and target (y).
Split into training set (80%) and testing set (20%).
Train the model using sklearn.linear_model.LinearRegression.
Predict on the test set.
Evaluate actual vs predicted values.

3.2 Practical Activity — USA Housing Dataset

Dataset (USA Housing) available from the CBSE Handbook's Google-Drive link.

1. Load & Peek

import pandas as pd
df = pd.read_csv('USA_Housing.csv')
df.head()
df.info()
df.describe()

Examining the count reveals that all columns contain 5000 values → no missing values anywhere.

2. Exploratory Data Analysis

Use df.corr() for correlations, and Matplotlib/Seaborn for scatter-plots between each feature and the target (Price). Strong linear trends indicate good predictors.

3. Train–Test Split (80 : 20)

from sklearn.model_selection import train_test_split

X = df[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms',
        'Avg. Area Number of Bedrooms','Area Population']]
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 4000 rows for training (80% of 5000) · 1000 rows for testing

4. Apply Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

5. Predict & Compare

predictions = model.predict(X_test)

import matplotlib.pyplot as plt
plt.scatter(y_test, predictions)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted')
plt.show()

We observe there is a difference between actual and predicted values. The next chapter (Data Science Methodology · Unit 2) covers how to calculate the error, evaluate the model and test its accuracy.

Practical assignments for the lab: (1) Create the student-marks DataFrame, use isnull / dropna / fillna to handle the three missing values, and save the cleaned result to resultout.csv. (2) Load USA_Housing.csv, do an 80:20 split, train a Linear-Regression model, and print the first 10 actual vs predicted prices. (3) A marketing CSV: group data by campaign type and calculate average sales and engagement.

Check Your Progress — quick MCQ pointers:

Primary data structure in Pandas → Series (not List/Tuple/Matrix).
fillna(0) → fills missing values with zeros.
Library typically used for importing/managing data for Linear Regression → Pandas.
Read CSV → pd.read_csv("filename.csv").
df.shape → number of rows and columns in the DataFrame.
Export DataFrame → to_csv().

Quick Revision — Key Points to Remember

Python libraries = pre-written toolkits so we don't code from scratch.
NumPy (Numerical Python) = numerical computing + array-processing. Rank = number of dimensions.
Pandas = data manipulation & analysis — vital in AI / data-driven decisions.
2 Pandas data structures: Series (1-D) · DataFrame (2-D).
Create DataFrame from: NumPy arrays · Dictionary of lists · List of dictionaries.
Add column: df['NewCol'] = [...]. Add row: df.loc['RowLabel'] = [...].
Delete: df.drop(label, axis=0) row · axis=1 column.
Attributes: df.index · df.columns · df.shape · df.head(n) · df.tail(n).
CSV = Comma-Separated Values — simple tabular text files.
Import CSV: pd.read_csv("file.csv").
Export CSV: df.to_csv("file.csv", index=False).
Handling missing values — 2 strategies: Drop the row · Estimate (fill) the value.
isnull() → True/False for missing. dropna() → removes rows with NaN. fillna(num) → replaces NaN with num.
np.NaN represents a missing numeric value.
Linear Regression (advanced) predicts continuous values using y = mx + c.
LR workflow: Load data → EDA → Split X/y → 80:20 Train-Test split → Train with LinearRegression() → Predict → Compare actual vs predicted.
sklearn modules used: model_selection.train_test_split · linear_model.LinearRegression.
Case-study dataset: USA_Housing.csv (5000 rows · no missing values · 80:20 → 4000 train / 1000 test).

Practice Quiz — test yourself on this chapter→