Introduction — Why Machines Need to See
With social media (Facebook, Instagram, Twitter) and smartphone cameras, billions of images and videos are shared every day. Unlike text, indexing and searching images is hard — algorithms must organise image data by colour, texture, shape or metadata for quick retrieval. Traditionally, images relied on manual meta-descriptions; today we need computers to visually perceive images themselves.
Just as children are taught to associate an image (e.g., an apple) with the letter "A", computers must develop similar capabilities by repeatedly viewing labelled images.
🔹 Key Concepts You'll Learn
- Introduction to Computer Vision
- Working of Computer Vision
- Applications of Computer Vision
- Challenges of Computer Vision
- The Future of Computer Vision
Prerequisite: Basic understanding of digital imaging and knowledge of machine learning.
1.1 How Machines See?
Computer Vision is analogous to human vision:
| Human Vision | Computer Vision |
|---|---|
| Retina | Camera sensor |
| Optic nerve | Data bus |
| Visual cortex | Deep-learning model / algorithm |
CV systems inspect products, infrastructure or production assets in real time — noticing defects or issues faster than humans. Due to speed, objectivity, continuity, accuracy and scalability, they often surpass human capabilities. Modern deep-learning models achieve above-human accuracy on tasks like facial recognition, object detection and image classification.
2.1 Working of Computer Vision — Basics of Digital Images
2.2 Interpretation of an Image in Digital Form
- During digitisation, an image is converted into a grid of pixels.
- Resolution = number of pixels in the image. Higher resolution → finer detail, closer to the original scene.
- For monochrome (black & white) images, each pixel's value ranges from 0 (black) to 255 (white).
- 1 byte = 8 bits → 2⁸ = 256 distinct values (0–255).
- For coloured images, each pixel has 3 numbers based on the RGB (Red · Green · Blue) colour model — each channel 0–255 → over 16 million possible colours per pixel.
🔹 Binary-to-Decimal Quick Reference
Binary 00000000 = 2⁷·0 + 2⁶·0 + … + 2⁰·0 = 0 (black).
Binary 11111111 = 2⁷·1 + 2⁶·1 + … + 2⁰·1 = 255 (white).
2.3 Computer Vision Process — The 5 Stages
📷 Stage 1 · Image Acquisition
The initial stage — capturing digital images or videos. This is the raw data for subsequent analysis.
- Sources: digital cameras, scanners (for physical photos/documents), design-software generation.
- Quality depends on device capability & resolution, lighting conditions and angle.
- Specialised scientific imaging: MRI (Magnetic Resonance Imaging), CT (Computed Tomography) scans for medical diagnosis.
🧼 Stage 2 · Preprocessing
Aims to enhance the quality of the acquired image. Main goals: remove noise · highlight important features · ensure consistency across the dataset.
- Noise Reduction — removes blurriness, random spots, distortions. Example: removing grainy effects in low-light photos.
- Image Normalization — standardises pixel values to a consistent range (e.g., 0–1 or −1 to +1). Example: scaling 0–255 → 0–1.
- Resizing / Cropping — makes all images uniform. Example: resize all inputs to 224 × 224 pixels before a neural network.
- Histogram Equalization — adjusts brightness & contrast by spreading pixel intensities evenly, enhancing detail in dark or bright regions.
🔍 Stage 3 · Feature Extraction
Identifies and extracts relevant visual patterns or attributes. Common techniques:
- Edge Detection — identifies boundaries where there is a significant change in intensity.
- Corner Detection — identifies points where two or more edges meet (high-curvature areas).
- Texture Analysis — extracts features like smoothness, roughness or repetition.
- Colour-Based Feature Extraction — quantifies colour distributions to discriminate objects/regions.
In deep-learning approaches, feature extraction is performed automatically by Convolutional Neural Networks (CNNs) during training.
🎯 Stage 4 · Detection / Segmentation
Identifies objects or regions of interest. Split into two broad categories:
🔹 Single-Object Tasks
- Classification — determines the category/class of one object. Algorithms: KNN (K-Nearest Neighbour) for supervised, K-means Clustering for unsupervised.
- Classification + Localization — also draws a bounding box tightly around the object.
🔹 Multiple-Object Tasks
- Object Detection — identifies and locates multiple objects; draws bounding boxes with class labels. Algorithms: R-CNN (Region-Based CNN), R-FCN (Region-Based Fully Convolutional Network), YOLO (You Only Look Once), SSD (Single Shot Detector).
- Image Segmentation — creates a pixel-wise mask for each object. Uses edge-detection to find discontinuities in brightness. Two popular kinds:
- Semantic Segmentation — classifies pixels by class; objects of the same class are not differentiated (e.g., all "animals" get one mask).
- Instance Segmentation — differentiates every object even if they share a class (each animal separately masked).
🔹 Key Difference: Classification vs Detection
Classification considers the whole image and predicts one class. Detection identifies multiple objects in an image and classifies each.
🧠 Stage 5 · High-Level Processing
The final stage — interpreting and extracting meaningful information. Tasks include recognising objects, understanding scenes and analysing context. Empowers CV systems to extract valuable insights and drive intelligent decision-making in domains from autonomous driving to medical diagnostics.
3.1 Applications of Computer Vision
- Facial recognition — Facebook uses it to detect and tag users in photos.
- Healthcare — evaluates cancerous tumours, identifies diseases and abnormalities, tracks objects in medical imaging (MRI, CT, X-ray).
- Self-Driving Vehicles — capture video around the car to detect other vehicles, traffic signals, pedestrian paths.
- Optical Character Recognition (OCR) — extracts printed or handwritten text from images (invoices, bills, articles).
- Machine Inspection — detects defects, functional flaws, and irregularities in manufactured products using tuned lighting and handling.
- 3D Model Building — constructs 3D models of objects for robotics, autonomous driving, 3D tracking, scene reconstruction, AR/VR.
- Surveillance — live CCTV analysis to identify suspicious behaviour and dangerous objects; maintains law and order.
- Fingerprint Recognition & Biometrics — validates user identity for banking, immigration, attendance.
4.1 Challenges of Computer Vision
🧩 1. Reasoning and Analytical Issues
CV relies on more than just image identification — it requires accurate interpretation. Without robust reasoning, extracting meaningful insights is limited.
📸 2. Difficulty in Image Acquisition
Hindered by lighting variations, perspectives, scales, occlusions and complex multi-object scenes. Obtaining high-quality data amid these challenges is crucial.
🔐 3. Privacy & Security Concerns
Vision-powered surveillance can infringe on privacy rights. Facial recognition raises ethical dilemmas — regulatory scrutiny and public debate surround such technologies.
🎭 4. Duplicate & False Content
Malicious actors can create misleading or fraudulent content (deepfakes, forged images). Data breaches foster misinformation and reputational damage.
5.1 The Future of Computer Vision
CV has evolved from basic image processing to systems that understand and interpret visual data with human-like precision. Breakthroughs in deep learning and the availability of vast labelled training datasets have propelled the field forward.
🔹 What's Next?
- Personalised Healthcare Diagnostics — CV-powered tools for early disease detection.
- Immersive AR / VR — real-time scene understanding for headsets and smart glasses.
- Smart Cities — traffic monitoring, waste sorting, infrastructure inspection.
- Agriculture — crop health analysis, pest detection, yield prediction from drone imagery.
- Education — automatic answer-sheet checking, engagement analysis.
By embracing innovation, fostering collaboration and prioritising ethics, we can harness the transformative power of CV for humanity.
6.1 Introduction to OpenCV
🔹 Installing OpenCV
pip install opencv-python
6.2 Loading & Displaying an Image
import cv2 image = cv2.imread('example.jpg') # load image cv2.imshow('original image', image) # display image cv2.waitKey(0) # wait for any key press cv2.destroyAllWindows() # close all OpenCV windows
🔹 Function Reference
- cv2.imread('path') — loads an image into a NumPy array.
- cv2.imshow('title', image) — opens a window displaying the image.
- cv2.waitKey(0) — waits indefinitely for a key press (use a positive number for milliseconds).
- cv2.destroyAllWindows() — closes any OpenCV-created windows.
6.3 Resizing an Image
import cv2 image = cv2.imread('example.jpg') new_width = 300 new_height = 300 resized_image = cv2.resize(image, (new_width, new_height)) cv2.imshow('Resized Image', resized_image) cv2.waitKey(0) cv2.destroyAllWindows()
Use cv2.resize(image, (width, height)) to set fixed dimensions. Common use: standardise all inputs (e.g., 300 × 300 or 224 × 224) before feeding them into a model.
6.4 Converting an Image to Grayscale
Grayscale images reduce computational complexity by removing the three colour channels.
import cv2 image = cv2.imread('example.jpg') grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) cv2.imshow('Grayscale Image', grayscale_image) cv2.waitKey(0) cv2.destroyAllWindows()
Use cv2.cvtColor(src, cv2.COLOR_BGR2GRAY) — the colour-conversion code cv2.COLOR_BGR2GRAY converts BGR → grayscale.
- Field that helps computers "see" → Computer Vision.
- Task of assigning a class label to an input image → Image Classification.
- 1 byte = 8 bits. Monochrome pixel range: 0–255.
- Capturing a digital image → Image Acquisition.
- KNN is used for supervised classification; K-means for unsupervised.
- A computer sees an image as a series of pixels.
- High-level processing drives intelligent decision-making.
- Edge detection identifies abrupt intensity changes / object boundaries.
- Preprocessing does not include edge/corner detection — those are feature-extraction steps.
- Incorrect: "RGB is only for camera images" (it applies to all colour images) and "fewer pixels resemble the original image better" (more pixels = more detail).
Quick Revision — Key Points to Remember
- Computer Vision (CV) = AI field enabling machines to see, observe & understand visual data; a.k.a. Machine Vision.
- Human vs CV analogy: Retina → Sensor · Optic Nerve → Data Bus · Visual Cortex → Model.
- Why CV surpasses humans: speed · objectivity · continuity · accuracy · scalability.
- Digital image = grid of pixels (picture elements); more pixels = higher resolution = finer detail.
- Monochrome pixel: 0 (black) to 255 (white); 1 byte = 8 bits = 2⁸ = 256 values.
- RGB colour model: 3 channels (Red · Green · Blue); each 0–255 → over 16 million colours.
- CV Process — 5 Stages: Image Acquisition → Preprocessing → Feature Extraction → Detection/Segmentation → High-Level Processing.
- Preprocessing techniques: Noise Reduction · Image Normalization · Resizing/Cropping · Histogram Equalization.
- Feature Extraction: Edge Detection · Corner Detection · Texture Analysis · Colour-Based (CNN does it automatically in DL).
- Single-object tasks: Classification (KNN supervised, K-means unsupervised) · Classification + Localization (bounding box).
- Multi-object tasks: Object Detection (R-CNN · R-FCN · YOLO · SSD) · Image Segmentation (Semantic · Instance).
- Classification vs Detection: Whole image → single class vs Multiple objects → bounding boxes + classes.
- Applications: Facial recognition · Healthcare · Self-driving · OCR · Machine inspection · 3D modelling · Surveillance · Biometrics.
- 4 Challenges: Reasoning · Image Acquisition difficulty · Privacy/Security · Duplicate/False content (deepfakes).
- Future: Personalised healthcare · AR/VR · Smart cities · Agriculture · Education — with ethics at the core.
- Activities: Binary Art (0s/1s re-form the image) · Website with ML model via Teachable Machine + Weebly.
- OpenCV essentials: pip install opencv-python · cv2.imread · cv2.imshow · cv2.waitKey(0) · cv2.destroyAllWindows · cv2.resize · cv2.cvtColor(..., cv2.COLOR_BGR2GRAY).