PART B ▪ UNIT 5 · Computer Vision

  root@vm-learning
  ~
  $
  open
  ch-b5
  

PART B ▪ UNIT 5

Computer Vision

Pixels · RGB · Image Features · Convolution · CNN

Computer Vision (CV) is a field of Artificial Intelligence that enables machines to see, observe and make sense of visual data — images and videos — in a way similar to how humans do. It involves extraction of information from images, text and videos to infer meaningful understanding and make predictions.

Just as AI enables computers to think, Computer Vision enables AI to see. It is a subset of AI that includes techniques from Deep Learning and Machine Learning.

Introduction — Play the Emoji Scavenger Hunt

Try this game: https://emojiscavengerhunt.withgoogle.com/

The challenge: find 8 items in the real world within a time limit. Point your phone camera at objects that match the emoji shown.

Reflect: Did you manage to win? What strategy worked? Was the computer able to identify all items? Did room lighting affect the identification?

A Quick Overview of Computer Vision

Computer Vision is a system that can process, analyze and make sense of visual data the same way humans do.

HUMAN VISION vs COMPUTER VISION

Object Eye / Sensing Device (Camera) Brain / Interpreting Device (AI)

5.1 Computer Vision vs Image Processing

Computer Vision

Deals with extracting information from input images/videos to infer meaningful understanding and predict visual input.

Superset of Image Processing.

Examples: Object detection, Handwriting recognition.

Image Processing

Mainly focused on processing raw input images to enhance them or prepare them for other tasks.

Subset of Computer Vision.

Examples: Rescaling image, Correcting brightness, Changing tones.

Learning Outcome 1: Define CV and understand its applications

5.2 Applications of Computer Vision

Computer Vision was first introduced in the 1970s. Today it is widely used across industries:

Facial RecognitionUsed in smart cities, smart homes — for security, guest recognition, and school attendance systems.

Face FiltersInstagram, Snapchat — identify facial dynamics and apply selected filters through the camera.

Google Search by ImageUpload an image and Google compares its features to a massive database to return matching results.

Computer Vision in RetailTrack customers' movements, analyse navigation paths, detect walking patterns for better store layout.

Inventory ManagementSecurity cameras + CV algorithms estimate stock accurately and suggest better shelf placement.

Self-Driving CarsIdentifies objects, navigational routes, monitors environment for autonomous vehicles.

Medical ImagingHelps doctors interpret X-rays, MRIs, converts 2D scans into interactive 3D models.

Google Translate AppPoint phone camera at foreign-language signs — OCR + augmented reality overlays instant translation.

Agricultural DronesDrones with high-res cameras monitor crop health, detect pests and diseases, estimate yields.

Attendance SystemsAutomated face-recognition based attendance in schools and offices.

Learning Outcome 2: Understand Computer Vision Tasks

5.3 Computer Vision Tasks

Every CV application is built on a set of core tasks performed on input images:

1. Classification

Assigning an input image one label from a fixed set of categories. Core CV problem — simple but widely used.

2. Classification + Localisation

Identifying what object is present AND where it is in the image. Used for single objects.

3. Object Detection

Finding instances of real-world objects — faces, bicycles, buildings — in images or videos. Uses extracted features + learning algorithms. Common in image retrieval, parking systems.

4. Instance Segmentation

Detecting objects, giving them a category, and labelling each pixel based on that. Output: a collection of regions/segments.

5.4 Basics of Images — Pixels

The word pixel means "picture element". Every digital photograph is made up of pixels — the smallest unit of information that makes a picture. They are usually round or square and arranged in a 2-dimensional grid.

The more pixels you have, the more closely the image resembles the original. Fewer pixels = pixelated, blocky image.

5.5 Resolution

Resolution = the number of pixels in an image.

Two Ways to Express Resolution

Width × Heighte.g., 1280 × 1024 — 1280 pixels from left to right, 1024 from top to bottom.

MegapixelsSingle number in millions of pixels. e.g., 5 megapixel camera = 5 million pixels (width × height = 5,000,000). A 1280 × 1024 monitor = 1,310,720 = 1.31 MP.

5.6 Pixel Value

Each pixel has a pixel value that describes its brightness and colour. The most common pixel format is the byte image — the value is stored as an 8-bit integer with range 0 to 255.

Why the 0-255 Range?

Computer data = binary system (0s and 1s).
Each pixel uses 1 byte = 8 bits.
Each bit has 2 possible values (0 or 1).
8 bits → 2⁸ = 256 possibilities (0 to 255).
0 = no colour / black. 255 = full colour / white.

5.7 Grayscale Images

Grayscale images are images that have a range of shades of gray without apparent colour.

Darkest shade = Black (pixel value 0).
Lightest shade = White (pixel value 255).
Intermediate shades = equal brightness of the three primary colours.
Each pixel = 1 byte (8 bits). Image = single plane / 2D array of pixels.
Size of grayscale image = Height × Width (single-plane).

5.8 RGB Images

All coloured images we see are RGB images. They are made up of three primary colours — Red (R), Green (G), Blue (B). All visible colours come from different intensities of these three.

How RGB Images Are Stored

Every RGB image is stored in three separate channels:

R ChannelIntensity of red for each pixel (0-255).

G ChannelIntensity of green for each pixel (0-255).

B ChannelIntensity of blue for each pixel (0-255).

Each pixel has 3 values (one per channel). All three channels combined form a colour image. Size of RGB image = Height × Width × 3.

Activity — RGB Calculator: Visit https://www.w3schools.com/colors/colors_rgb.asp, then answer:

Output colour when R=G=B=255 → White.
Output colour when R=G=B=0 → Black.
How does the colour vary when either of the three is 0 and the other two vary? → you get combinations of the other two colours.
When all three vary in same proportion → shades of gray.
RGB value of your favourite colour from the palette → experiment!

Pixel Art Task: Visit www.piskelapp.com and create your own pixel art. Try making a GIF!

Learning Outcome 3: Use No-Code AI tools for Computer Vision

5.9 No-Code AI Tools for CV

1. Lobe

Lobe.ai is an AutoML tool — a no-code AI tool.
Works with image classification.
Provide a set of labelled images → Lobe automatically finds the most optimal model to classify them.

2. Teachable Machine

AI, ML and DL tool developed by Google in 2017.
Runs on tensorflow.js.
Web-based tool to train a model using images, audio, or poses — input via webcam or pictures.

3. Orange Data Mining

Has dedicated widgets for image classification — covered in Unit 4.

Activity — Build a Smart Sorter (Teachable Machine or Lobe):

Form groups of 4 members.
Find images of Bottles, Cans and Paper online or from around.
Visit the no-code AI tool.
Build 3 different classes: Bottles · Cans · Paper.
Train the model.
Test the classifier on new images!

Use-Case Walkthrough — Coral Bleaching Detection

What are Coral Reefs? Coral reefs are large underwater structures made of skeletons of marine invertebrates. Found in tropical ocean waters — integral to aquatic life.

Coral Bleaching happens when corals lose their colour due to stress (rising sea temperatures, pollution, acidification). It has caused unbalanced scenarios in aquatic life. Detecting bleached corals early can save marine ecosystems.

Using Orange Data Mining with image-based widgets, we can build a classification model to detect healthy vs bleached corals — saving marine biodiversity.

Learning Outcome 4: Understand Image Features & Convolution

5.10 Image Features

In computer vision, a feature is a piece of information that is relevant for solving a computational task. Features may be specific structures — points, edges, or objects.

What Makes a Good Feature?

Imagine a security camera capturing an image. Three types of patches we might try to find:

Patch A & B — Flat Areas

Spread over a lot of area. Can appear anywhere in that region. Hardest to locate exactly.

▬ Patch C & D — Edges

Edges of a building. Can find approximate location, but the pattern is the same all along the edge — still hard to pin down.

▪ Patch E & F — Corners

Corners of a building. Wherever you move this patch, it looks different — easiest to find & best features.

Conclusion: In image processing we can extract blobs, edges, or corners as features. But corners are the best features because they are unique — they can only be found at a particular location. Edges are second-best; flat areas are worst.

5.11 Convolution

Convolution is a simple mathematical operation fundamental to many image-processing operators. It multiplies together two arrays of numbers (of the same dimensionality but different sizes) to produce a third array.

An image convolution = element-wise multiplication of image arrays and another array called the kernel, followed by sum.

I = Image Array · K = Kernel Array
I * K = Result of applying convolution

The kernel is passed over the whole image (slid across) to get the resulting array after convolution.

When we edit photos in Photoshop, or apply filters on Instagram / Snapchat, we are using convolution internally! The filter is simply a kernel that modifies pixel values to produce an effect.

5.12 What is a Kernel?

A Kernel is a matrix that is slid across the image and multiplied with the input, such that the output is enhanced in a certain desirable manner. Each kernel has different values for different effects — sharpen, blur, edge-detect, emboss.

Activity — Online Kernel Tool: https://setosa.io/ev/image-kernels/
Try:

Change all kernel values to positive → see what happens.
Change all to negative → observe.
Mix negative and positive values → observe.
Make 4 numbers negative, one positive → notice the pattern.

Propose your own theory for how convolution works, then test it on different images!

Why Use Convolution in CNN?

To extract features from images — edges, corners, patterns.
Convolution is used in the Convolutional Neural Network (CNN) for feature extraction (next section).
The center of the kernel overlaps each pixel; output becomes smaller (because edges can't be fully convolved).
To keep the same size, we extend edges with zero padding.

Learning Outcome 5: Understand CNN architecture and layers

5.13 Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image, and be able to differentiate one from the other.

CNN Workflow

HOW A CNN WORKS

1. Input Image 2. Convolution Layer 3. ReLU Layer 4. Pooling Layer 5. Fully Connected Layer 6. Output (Prediction)

5.14 Layers of CNN

1. Convolution Layer

The first layer of a CNN.
Objective: extract high-level features such as edges from the input image.
The first Convolution Layer captures low-level features (edges, colour, gradient orientation).
Added layers capture high-level features — giving the network a wholesome understanding of images.
Uses the convolution operation with several kernels to produce several features.
Output = Feature Map (also called Activation Map).
Benefits of Feature Map:
- Reduces image size for efficient processing.
- Focuses only on features important for further processing — e.g., eyes, nose, mouth are enough to recognise a person; you don't need the whole face.

2. Rectified Linear Unit (ReLU) Layer

After the feature map is generated, it is passed to the ReLU layer.
What it does: simply gets rid of all negative numbers in the feature map; lets positive numbers stay as-is.
This introduces non-linearity in the feature map.
Why? It makes colour change more abrupt and obvious — sharper edges.
Smooth grey gradient → more abrupt edges → better features for later layers → stronger CNN.

ReLU(x) = max(0, x)
If x < 0 → output = 0 · If x ≥ 0 → output = x

3. Pooling Layer

Similar to convolution layer — responsible for reducing the spatial size of the convolved feature while retaining important features.
Two types of pooling:

Max Pooling

Returns the maximum value from the portion of the image covered by the kernel.

Average Pooling

Returns the average of values from the portion of the image covered by the kernel.

Why Pooling?

Makes the image smaller and more manageable.
Makes the image resistant to small transformations, distortions, translations — a small difference in input creates a very similar pooled image.

4. Fully Connected (FC) Layer

The final layer of the CNN.
Takes results of convolution + pooling and uses them to classify the image into a label.
Output of conv/pooling is flattened into a single vector of values, each representing a probability that a feature belongs to a certain label.
Example: for an image of a cat, features like whiskers or fur should have high probability for the label "cat".

5.15 CNN Summary — Putting It All Together

Layer	Purpose	What Happens
1. Convolution	Extract features	Kernels slide over image producing feature maps (edges, colour, gradient).
2. ReLU	Non-linearity	Replaces all negative values with 0; positive values unchanged.
3. Pooling	Reduce size	Max or Average pooling shrinks feature map while keeping important info.
4. Fully Connected	Classify	Flattens features and assigns final label probabilities.

CNN Conv and Pool layers can be stacked multiple times. Early layers learn simple features (edges), deeper layers learn complex features (shapes, objects).

5.16 Python Libraries for Computer Vision

Three major Python libraries are used to build CV projects:

TensorFlow

Open-source ML library by Google. Used for building and training deep-learning models including CNNs for image classification, object detection.

Keras

High-level Deep Learning API that runs on top of TensorFlow. Makes it simple to build and train neural networks with just a few lines of Python code.

OpenCV

Open-source Computer Vision library. Works with real-time image and video processing — reading, editing, applying filters, face detection, and more.

Applications of OpenCV

Face detection in photos and videos.
Object tracking in live video streams.
Motion detection in security footage.
Image editing — rescaling, rotating, cropping, colour changes.
OCR (Optical Character Recognition) — read text from images.
Augmented reality apps.
Self-driving cars — detect lanes, traffic signs, vehicles.
Medical imaging — analyse X-rays, MRIs, CT scans.

Sample Python CV Code (OpenCV)

import cv2
img = cv2.imread('photo.jpg')
print("Shape:", img.shape)
print("Data type:", img.dtype)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
cv2.imshow('Original', img)
cv2.imshow('Grayscale', gray)
cv2.waitKey(0)
cv2.destroyAllWindows()

5.17 Practical Programs for CV (Class X)

As per the CBSE Class X syllabus (Practical list), you should be able to write these programs:

Read an image and display using Python (OpenCV imread + imshow).
Read an image and identify its shape using Python (img.shape gives Height × Width × Channels).

Program: Read and Display an Image

import cv2
img = cv2.imread('photo.jpg')
cv2.imshow('My Image', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Program: Identify Image Shape

import cv2
img = cv2.imread('photo.jpg')
print("Shape of image:", img.shape)
print("Height:", img.shape[0])
print("Width:", img.shape[1])
print("Channels:", img.shape[2])

Shape of image: (480, 640, 3) Height: 480 Width: 640 Channels: 3

Quick Revision — Key Points to Remember

Computer Vision (CV) = AI field that enables machines to see, observe and make sense of visual data.
CV vs Image Processing: CV extracts meaning (superset). Image Processing enhances images (subset).
Applications: Facial Recognition · Face Filters · Google Image Search · Retail Analytics · Inventory · Self-Driving Cars · Medical Imaging · Google Translate · Agricultural Drones · Attendance.
4 CV Tasks: Classification · Classification+Localisation · Object Detection · Instance Segmentation.
Pixel = smallest unit of a digital image, arranged in a 2D grid.
Resolution = number of pixels. Expressed as Width×Height or in megapixels.
Pixel value = 0 to 255 (1 byte / 8 bits). 0 = black, 255 = white.
Grayscale image = single 2D plane of pixels, each pixel 0-255.
RGB image = 3 channels (R, G, B) stacked. Each pixel has 3 values. Size = H × W × 3.
No-Code CV tools: Lobe · Teachable Machine · Orange Data Mining. Activities: Smart Sorter, Coral Bleaching detection.
Image Features: blobs · edges · corners. Corners are the best features (unique locations).
Convolution = element-wise multiplication of image + kernel, followed by sum. Slide kernel over image.
Kernel = matrix that modifies the image — different kernels produce different effects (blur, sharpen, edge-detect).
CNN (Convolutional Neural Network) = DL algorithm for image classification. 4 layers: Convolution → ReLU → Pooling → Fully Connected.
Convolution Layer: extracts features using kernels → Feature Map.
ReLU Layer: removes negatives; introduces non-linearity; ReLU(x) = max(0, x).
Pooling Layer: Max Pooling (max value) or Average Pooling. Reduces size, keeps important features.
Fully Connected (FC) Layer: flattens features → classifies image into label probabilities.
Python Libraries: TensorFlow (Google's DL library) · Keras (high-level API on top of TF) · OpenCV (CV & real-time video/image processing).
OpenCV applications: face detection, object tracking, motion detection, image editing, OCR, AR, self-driving cars, medical imaging.

Practice Quiz — test yourself on this chapter→