PART B ▪ UNIT 4 · AI with Orange Data Mining Tool

  root@vm-learning
  ~
  $
  open
  ch-b4
  

PART B ▪ UNIT 4

AI with Orange Data Mining Tool

Widgets · Data Science · Computer Vision · NLP (Practical only)

Orange Data Mining — a free, open-source, component-based visual programming software for data visualization, machine learning, data mining, and data analysis. Components are called widgets and are connected on a canvas to build workflows — no coding required.

Introduction — Why Orange?

Students will learn to use Orange's intuitive interface across the domains of Data Science, Computer Vision and Natural Language Processing.
Through hands-on projects & case studies, they gain practical insights into widget usage for data visualisation, preprocessing, feature selection, modelling and evaluation.

Prerequisites: Awareness of Data Science, NLP and Computer Vision concepts + basic knowledge of ML algorithms.

4.1 What Is Data Mining?

Data Mining — the process of discovering trends, useful information and patterns from large datasets. It analyses and interprets data to extract meaningful insights that aid decision-making.

4.2 Introduction to Orange Data Mining Tool

Within Orange, components are widgets. Their functionalities range from basic data visualization, subset selection and preprocessing to empirical evaluation of learning algorithms and predictive modelling. Visual programming: workflows are built by interconnecting widgets on a canvas.

4.3 Beneficiaries of Orange Data Mining

User Group	How Orange Helps
Data Analysts & Scientists	User-friendly interface — accessible even without extensive programming skills.
Researchers	Explore research data, test hypotheses, generate insights from experimental results.
Educators & Students	Intuitive interface + visual programming — introduce complex topics approachably.
Business Professionals	Identify trends, predict customer behaviour, optimise processes, improve performance.
Open-Source Community	Source code freely available; large community of contributors.

Learning Outcome 1: Develop proficiency in utilizing the Orange Data Mining tool — navigate its interface, employ its features, and execute data-analysis tasks effectively

1.1 Getting Started with Orange — Installation

1. Visit the Orange Website

Go to the official site: orangedatamining.com/download/.

2. Choose the Correct Version

Windows users: download the Standalone Installer.
Mac users: select Orange for Apple Silicon.

⬇️ 3. Download the Installer

Click the respective download link to begin the download.

4. Install the Software

Windows: double-click the installer and follow on-screen instructions.
Mac: mount the disk image and drag the Orange application into the Applications folder.

5. Launch Orange

Start the tool from the system's Applications menu or its desktop icon.

1.2 Components of the Orange Data Mining Tool

Blank Canvas — the workspace where you build analysis workflows by dragging and dropping widgets. Arrange and connect widgets to form a data-processing pipeline from input to output.
Widgets — graphical elements that perform specific tasks or operations on data (e.g., File, Data Table, Scatter Plot).
Connectors — lines that link widgets on the canvas, representing the flow of data from one widget's output to another's input.

1.3 Default Widget Catalog — 6 Categories

1. Data Widgets

Used for data manipulation — load, store and read datasets.

File — reads an input data file (table of instances) and sends it to its output channel.
Data Table — displays attribute-value data in a spreadsheet view.
SQL Table — reads data directly from an SQL database.

2. Transform Widgets

Apply various transformations to the dataset within the workflow — e.g., select columns, continuise, discretise, normalise, concatenate, impute, and preprocess.

3. Visualize Widgets

Tools for visualising data — scatter plots, bar charts, heat maps, box plots, distributions, histograms, parallel coordinate plots.

4. Model Widgets

Apply machine-learning algorithms — classification, regression, clustering, anomaly detection — to build predictive models and analyse data patterns.

5. Evaluate Widgets

Evaluate model performance via cross-validation, confusion matrix, ROC analysis, Test & Score, Predictions, etc.

6. Unsupervised Widgets

Facilitate exploratory data analysis & pattern recognition without labelled data — clustering, dimensionality reduction, association-rule mining.

Learning Outcome 2: Demonstrate the ability to apply Orange in real-world scenarios across AI domains — Data Science · Computer Vision · Natural Language Processing — through hands-on projects and case studies

2.1 Three Key Domains of AI with Orange

Data Science with Orange
Computer Vision with Orange
Natural Language Processing with Orange

2.2 Data Science with Orange — The Iris Flower Case Study

The violet-coloured iris comes in three main types: Iris Setosa · Iris Versicolor · Iris Virginica. Key differences are in the sepal and petal length/width. We'll use Orange to measure and classify them.

A. Data Visualization — Exploring Iris Dimensions

Launch Orange — opens a blank canvas.
File widget — drag onto the canvas → double-click → select the iris dataset. Set the iris column role to target.
Data Table widget — connect File → Data Table to view the 150 samples in tabular form.
Scatter Plot widget — connect File → Scatter Plot; select variables like sepal length vs sepal width. Each point = one iris sample.
Experiment with Histograms, Box Plots, Parallel Coordinate Plots for extra perspectives.

B. Classification with the Tree Widget

Prepare testing data in a spreadsheet — columns for sepal length, sepal width, petal length, petal width (same names as training, cm units).
Tree widget — drag it onto the canvas; connect the training File → Tree.
Predictions widget — connect training File → Predictions, then connect a second File (testing data) → Predictions.
Double-click Predictions — Orange displays the predicted class for each test sample.

Data Table is optional — it just displays data; not required to connect for classification.

C. Evaluating the Classification Model

Test & Score widget — connect the File to it. Orange uses cross-validation with 10 subsets by default.
Inspect Accuracy · Precision · Recall · F1 Score. In the Iris example, accuracy comes out to ≈ 93 %.
Confusion Matrix widget — connect Test & Score → Confusion Matrix. Reveals TP / TN / FP / FN for each class, showing (e.g.) that Setosa is separated cleanly while Versicolor and Virginica overlap.

Cross-Validation (Rotation Estimation)

Cross-validation (a.k.a. rotation estimation) — resampling and sample-splitting method that uses different portions of the data to test and train a model across multiple iterations. Default in Orange = 10-fold CV.

Practical — Differentiate fruits vs vegetables by nutrition: collect data on energy, water, protein, fat, carbs, fibre, sugars, calcium, iron, magnesium, phosphorus, potassium, sodium (e.g., from Kaggle). Split into train/test. Train several classifiers via Orange widgets. Evaluate with Accuracy, Precision, Recall, F1.

2.3 Computer Vision with Orange — Dogs vs Cats Clustering

Step 1 · Install Image Analytics Add-On

Go to Options → Add-ons → Image Analytics → install → restart Orange. The image widgets now appear in the side panel.

Step 2 · Import Images

Drag the Import Images widget onto the canvas and upload the folder containing dog and cat images (CBSE provides a sample dataset via the Handbook's Google-Drive link).

Step 3 · Image Viewer

Add the Image Viewer widget, connect Import Images → Image Viewer, and double-click to browse all the thumbnails.

Step 4 · Image Embedding

Connect the Image Embedding widget to Import Images. It sends each image to a server where a deep neural network trained on millions of real-life images converts it into a numerical vector (embedding).

Step 5 · Distance (Cosine)

Connect the Distance widget to the output of Image Embedding. Double-click and select Cosine distance — usually the best-working option for images.

Step 6 · Hierarchical Clustering

Drag the Hierarchical Clustering widget and connect the Distance matrix to it. Double-click to see the dendrogram — a tree showing how images group by similarity.

Step 7 · Visualise Clusters

Use the Image Viewer to explore each cluster selected on the dendrogram — you'll notice dogs grouped together and cats in another cluster.

Practical — Cluster images of birds vs animals: collect enough labelled (or unlabelled) images of several species, import into Orange, run Image Embedding → Distance → Hierarchical Clustering, interpret the resulting dendrogram for patterns and similarities.

2.4 Natural Language Processing with Orange

Step 1 · Install Text Add-On

Options → Add-ons → Text → install → restart Orange to activate the NLP widgets.

Step 2 · Load or Create Textual Data

Corpus widget — to load a prepared dataset (e.g., articles, reviews).
Create Corpus widget — to type your own text directly into Orange.

Step 3 · Corpus Viewer

Connect Corpus Viewer to browse through the text, search for specific words (which it highlights) and preview documents.

Step 4 · Word Cloud — Visualise Word Frequencies

Connect the Word Cloud widget to the Corpus output. More frequent words are shown larger — a quick glimpse of prominent themes in the text.

Step 5 · Preprocess Text

Connect the Preprocess Text widget. It performs text normalisation:

Convert text to lowercase.
Tokenise into individual words.
Remove punctuation.
Filter out stop words (the, is, a, an…).
Optionally apply stemming or lemmatisation to reduce words to their base form.

Step 6 · Visualise Cleaned Text

Connect Preprocess Text → Word Cloud again. Now only meaningful words appear — the main themes stand out clearly (e.g., in a story about a race, "Turtle" and "Rabbit" appear largest).

Practical — Build your own corpus: pick a story or article, create a corpus in Orange, apply text normalisation (lowercase, tokenise, remove stop words), generate a Word Cloud before and after applying stemming/lemmatisation, and compare the most-frequent words.

Check Your Progress — quick MCQ pointers:

Widget to see Accuracy, Precision, Recall, F1 Score → Test & Score.
Widget giving detailed TP/TN/FP/FN breakdown → Confusion Matrix.
Widget performing text normalisation (lowercase, tokenise, stop-word removal) → Preprocess Text.
Cross-validation is also called → Rotation Estimation.
Word Cloud — more frequent words appear larger → True.
Widget to convert raw images into numerical vectors → Image Embedding.
Widget to compare embeddings & compute similarities → Distance.
Lines linking widgets on the canvas → Connectors.
Open-source visual-programming tool for data viz + ML + mining → Orange.
Add-on to cluster images of 2-legged vs 4-legged animals → Image Analytics.

Quick Revision — Key Points to Remember

Data Mining = discovering trends, useful information and patterns from large datasets.
Orange Data Mining = free, open-source, component-based visual-programming tool for data-viz, ML, data mining and analysis.
5 Beneficiaries: Data Analysts · Researchers · Educators/Students · Business Professionals · Open-Source Community.
Install: orangedatamining.com/download — Windows Standalone installer / Mac Apple Silicon.
3 Components: Blank Canvas · Widgets · Connectors (arrows between widgets showing data flow).
6 Widget Categories: Data · Transform · Visualize · Model · Evaluate · Unsupervised.
Key Data widgets: File · Data Table · SQL Table.
3 AI Domains with Orange: Data Science · Computer Vision · NLP.
Iris Data Science workflow: File → Data Table → Scatter Plot → Tree → Predictions → Test & Score → Confusion Matrix.
3 Iris types: Setosa · Versicolor · Virginica (150-sample dataset). Accuracy ≈ 93%.
Cross-Validation (a.k.a. Rotation Estimation) — default 10 folds in Orange.
CV workflow (Dogs vs Cats): Image Analytics add-on → Import Images → Image Viewer → Image Embedding → Distance (cosine) → Hierarchical Clustering → Dendrogram.
Embedding = numerical vector of an image computed by a deep NN trained on millions of images.
NLP workflow: Text add-on → Corpus / Create Corpus → Corpus Viewer → Word Cloud → Preprocess Text → Word Cloud (cleaned).
Preprocess Text steps: lowercase · tokenise · remove punctuation · remove stop words · (optional) stemming/lemmatisation.
Word Cloud: more frequent words appear larger.

Practice Quiz — test yourself on this chapter→