PART B ▪ UNIT 4 · Statistical Data

  root@vm-learning
  ~
  $
  open
  ch-b4
  

PART B ▪ UNIT 4

Statistical Data

Data Science · No-Code AI · Orange Data Mining · Statistics

Data Science is the field that unifies statistics, data analysis, machine learning and related methods to understand and analyse real-world phenomena with data. It uses techniques from Mathematics, Statistics, Computer Science and Information Science. This unit introduces No-Code AI tools (Orange Data Mining, MS Excel) to work with statistical data — so you can build AI models without writing code.

This unit is practical-based. It has two sub-units: (4.1) Introduction to Data Science + No-Code AI (Orange Data Mining, Lobe, Teachable Machine, etc.). (4.2) Statistical Data Use-Case Walkthrough — important statistical concepts + MS Excel + full Orange workflow with Palmer Penguins case study.

Introduction — Data Science & Its Applications

AI depends entirely on data. Based on the type of data fed to a machine, AI splits into three broad domains — Data Science (statistical data), Computer Vision, and NLP. Data Science is the foundation that makes machines understand numbers, patterns and trends.

🔹 Real-World Applications of Data Science

🔎 Internet SearchGoogle, Bing, Yahoo — all use data-science algorithms to deliver the best result in a fraction of a second. Google processes more than 20 petabytes of data daily.

📣 Targeted AdvertisingDisplay banners, digital billboards at airports — almost all digital ads are decided using data-science algorithms. This is why digital ads have higher Click-Through Rates (CTR) than traditional ones.

🛒 Website RecommendationsAmazon, Twitter, Google Play, Netflix, LinkedIn, IMDB — all recommend products or content based on past search results.

🧬 Genetics & GenomicsData science integrates different kinds of genomic data to understand reactions to drugs and diseases — enabling personalised treatment.

🏥 Healthcare AnalyticsPredicting patient outcomes, drug discovery, hospital resource planning — all driven by statistical AI.

💳 Fraud DetectionBanks analyse transaction patterns in milliseconds to flag fraudulent purchases.

📈 Stock MarketPredicting prices, trends, volatility through statistical modelling.

🌤️ Weather ForecastingIMD and weather apps use historical + real-time data to forecast rain, storms and temperature.

Learning Outcome 1: Define No-Code / Low-Code AI and identify differences

4.1 No-Code AI Tool — Introduction

Imagine: You want to build a food-delivery application. How do you start? Three approaches exist — High-Code, Low-Code, No-Code.

🧑‍💻 High-Code · Low-Code · No-Code Comparison

	High-Code	Low-Code	No-Code
What it is	Traditional development — coders write all code manually using Java, Python, C#.	Platforms with visual interfaces + pre-built components. Some manual code still needed.	Create applications without any coding or scripting. Drag-and-drop only.
Coding knowledge	Mandatory — deep expertise needed.	Partial — developers write some code.	Not required — anyone can build.
Cost	Expensive.	Less expensive than high-code.	Least expensive.
Customisation	Full — you own the product, can make anything.	Limited customisation.	Lacks customisable options — limited to tool's built-in functions.
Ease of use	Complex — needs coders.	Moderate.	Simple — drag & drop.
Example	Custom chatbot built from scratch.	Mendix, OutSystems, Microsoft Power Apps.	Orange Data Mining, Teachable Machine, Lobe.

Custom code is also known as high-code. The company owns everything they build.

4.2 Why Do We Need No-Code AI?

🐞 No Code ErrorsWe tend to run into many types of errors when coding — troublesome at times. No code = no code errors!

💰 Saves CostFully coded AI systems are costly to build. No-Code helps businesses cut expenses.

👥 No AI Hiring NeededCompanies can implement AI without hiring specialised AI staff — less stress.

🎓 Easy to UseEven middle-school students can create AI using No-Code tools.

👁️ Visual / Drag & DropYou see what you're building in real time — intuitive interface.

⚡ Fast PrototypingBuild and test AI ideas in minutes, not weeks.

4.3 Who Can Use No-Code AI?

No-Code AI makes AI accessible to the general public. Non-technical people — doctors, architects, musicians, teachers — can quickly construct accurate AI models without any coding.

Scenario — Kayla the Zoo Dietitian:
Kayla manages the food budget at a zoo. With rising prices of meat and vegetables, the zoo wants to predict future food prices to raise sponsorship. Kayla has never coded, but using a No-Code tool like Orange Data Mining, she builds a price-prediction model herself!

4.4 Benefits of No-Code Tools

Accessibility — anyone, even non-tech people, can use AI.
Cost-effective — much cheaper than writing code.
Time-saving — drag-and-drop builds models in minutes.
Visual learning — easy to understand workflows.
Real-time feedback — see results as you build.
Low risk of errors — tool handles coding automatically.
Easy iteration — change parameters, re-run in seconds.

4.5 Disadvantages of No-Code Tools

🔗 Lack of FlexibilityDrag-and-drop is convenient but you're limited to fixed elements. Customisation is restricted.

🤖 Automation BiasHumans tend to favour suggestions from automated systems and ignore contradictory information — even when the automation is wrong.

🔒 Security IssuesNo-code platforms don't always enforce security best practices — not ideal for sensitive data.

4.6 Popular No-Code AI Tools

Tool	Released	Details
Azure Machine Learning	July 2014	Cloud-based service by Microsoft. Build ML models without coding, clean data, train and evaluate models, put into production.
Google Cloud AutoML	January 2018	Users with limited ML knowledge can train high-quality custom models specific to business needs. Build in minutes and use in apps and websites.
Orange Data Mining	October 1996	Open-source data-visualisation + ML + data-mining toolkit by University of Ljubljana. Perform data analysis through drag-and-drop widgets.
Lobe AI	2015	Machine-learning platform to create custom ML models using a visual interface. Train models with a free, easy-to-use tool that auto-trains a custom model shippable in an app.
Teachable Machine	November 2017	Web-based tool by Google. Train a computer to recognise your own images, sounds and poses. No expertise or coding required.
Data Robot	2012	Automated ML platform for enterprise-grade models.

4.7 Building a Simple Price Prediction Model — Orange Data Mining

Let's help Kayla build a food-price prediction model in Orange Data Mining.

🔄 7-STEP WORKFLOW IN ORANGE

1. Download Dataset ➜ 2. Open Orange ➜ 3. Upload Dataset (File widget) ➜ 4. View Dataset (Data Table)

5. Select Model (Linear Regression) ➜ 6. Evaluate (Test & Score) ➜ 7. Prediction

Step	Action
Step 1	Download the dataset from fao.org/worldfoodsituation/foodpricesindex/
Step 2	Double-click the Orange icon to open the tool.
Step 3	Click the File widget under Data Menu → appears on canvas. Click to browse and upload the dataset. Select Food Price Index as the target variable.
Step 4	Click the Data Table widget under Data Menu. Connect File → Data Table. Click to view the dataset.
Step 5	Click Linear Regression widget under Model Menu. Connect File → Linear Regression.
Step 6	Click Test & Score widget under Evaluate Menu. Connect File + Linear Regression → Test & Score. View performance parameters.
Step 7	Click Prediction widget under Evaluate Menu. Connect Test & Score → Prediction. Click to view predicted prices.

Kayla now has a model that can predict future food prices — without writing a single line of code! She can now make a systematic fund-raising plan for the zoo.

4.8 Other No-Code Tools — Lobe & Teachable Machine

🎨 Lobe

Makes ML easy with everything needed to turn ideas into models.
Train models with a free, easy-to-use tool.
Auto-trains custom ML model shippable in your app.

🎯 Teachable Machine

Web-based tool by Google.
Train a computer to recognise images, sounds, poses.
Fast, easy, accessible to everyone — no expertise or coding required.

Learning Outcome 2: Statistical concepts · MS Excel · Orange Data Mining Use Case

4.9 Important Concepts in Statistics

📊 1. Statistical Sampling

Population = the entire set of raw data available for a test or experiment.
You cannot always measure patterns and trends across the entire population.
Take a sample — a portion of the population — and perform computations on it.

🎯 2. Descriptive Statistics — Mean, Median, Mode

Describe the data and help understand its underlying characteristics.

📏 MeanThe central / average value. Sum of all values ÷ number of values.

🎯 MedianThe middle value when data is ordered from low to high and divided exactly in half.

🔢 ModeThe value which occurs most often in the dataset.

For data 2, 3, 3, 5, 7, 8, 9 →
• Mean = (2+3+3+5+7+8+9) / 7 = 5.28
• Median = 5 (middle value)
• Mode = 3 (appears twice)

📈 3. Distributions

Distribution = charts/graphs that display the frequency of each value in a dataset.
Some distributions contain numbers much larger than others → skewed distribution.
Normal Distribution = symmetrical, bell-shaped, with most values around the central peak.

🎲 4. Probability

Probability = likelihood of an event occurring.
An event is the outcome of an experiment.
Events can be independent (one doesn't affect the other) or dependent.

📉 5. Variance, Standard Deviation, Outlier

Variance = how far each value in the dataset is from the mean. A measurement of the spread of numbers.
Standard Deviation = a calculation giving a single value that represents how widely distributed the values are.
Outlier = a data point that lies at an abnormal distance from other values.

4.10 MS Excel for Statistical Analysis

MS Excel is the simplest statistical tool. With the Analysis ToolPak add-in, Excel can perform regression, histograms, descriptive statistics, and much more.

Activity — Speed vs Distance Linear Regression in Excel:

Step 1 — Get the Add-in: File → Options → Add-ins → Analysis ToolPak → Go → Check → OK. The Data Analysis option appears in the Data menu.
Step 2 — View Data: Identify independent (X = Speed) and dependent (Y = Distance) features.
Step 3 — Visualise: Select both columns → Insert → Charts → Scatter. Add chart title "Distance vs Speed".
Step 4 — Add Regression Line: Click scatter plot → Chart Design → Add Chart Element → Trendline → More Trendline Options → Linear → tick "Display Equation on chart" and "Display R-squared value on chart".
Step 5 — Verify Coefficients: Data → Data Analysis → Regression → Y Range (Distance column with label) → X Range (Speed column with label) → tick Labels → choose output cell → OK. Summary stats appear.
Step 6 — Predict: Use the generated equation y = mx + c. For Speed = 6, calculate Distance.

4.11 Orange Data Mining — What is It?

Orange Data Mining (ODM) is an open-source data-mining and machine-learning software suite designed for data analysis, visualisation and exploration. It has a graphical user interface (GUI) that lets users interactively build data-analysis workflows using components called widgets.

🔹 Key Features of Orange

A machine-learning tool for data analysis through Python + visual programming.
Perform operations on data through simple drag-and-drop steps.
Visualise data; perform data mining and machine learning.
No code required — use without writing a single line.
Relatively easy with beautiful visuals.
Open-source and free.

4.12 Orange Widgets — By Category

📥 1. Data Loading Widgets

Bring your data into Orange from files or online sources:

File — loads data from CSV, Excel, SQL.
URL — loads data from a URL.
Data Table — displays loaded data in tabular format.

🔍 2. Data Exploration Widgets

Look at data in different ways to spot patterns:

Scatter Plot — visualises the relationship between two variables.
Data Table — manual inspection and exploration.
Distributions — histograms and statistical distributions of variables.

🧼 3. Preprocessing Widgets

Clean up data before modelling:

Impute — handles missing values.
Normalize — scales data to a common scale.
Select Columns — pick specific columns.

🎯 4. Feature Selection Widgets

Select Columns — choose relevant features.
Select Best Features — auto-selects best features using criteria like mutual information or correlation.

🤖 5. Modelling Widgets

Build models from your data:

Classification Tree — decision-tree classifier.
k-Means — clustering algorithm.
Support Vector Machine (SVM) — classifier.
Logistic Regression — classification model.
Linear Regression — regression model.

📊 6. Evaluation Widgets

Check model performance:

Test & Score — evaluates model on a test dataset.
Cross Validation — assesses model performance.
ROC Curve — plots receiver-operating-characteristic curve for binary classifiers.

📈 7. Visualization Widgets

Turn data into visual representations:

Bar Chart — displays data in bar-chart format.
Heat Map — visualises data using a heatmap.
Scatter Plot — 2-variable relationship.

4.13 AI Project Cycle Mapped to Orange Data Mining

AI Project Cycle Stage	Orange Widget / Action
1. Problem Scoping	Define the problem statement (done outside Orange — what are you predicting?).
2. Data Acquisition	File / URL widget to load the dataset.
3. Data Exploration	Data Table, Scatter Plot, Distributions widgets to explore.
4. Modeling	Linear Regression, Classification Tree, k-Means, SVM, Logistic Regression widgets.
5. Evaluation	Test & Score, Cross Validation, ROC Curve widgets.
6. Deployment	Prediction widget; then integrate into an application.

4.14 Case Study — Palmer Penguins

About the dataset: Palmer Penguins are a species found in the Antarctic Peninsula region. Researchers study their behaviour, habitat, population dynamics and effects of climate change. The Palmer Penguins dataset is a popular alternative to the famous Iris dataset. Available on Kaggle.

🎯 Stage 1 — Problem Scoping

The researchers want to predict the species of Palmer Penguins based on collected data. The dataset has three species — Adelie, Chinstrap, Gentoo — and physical features differ across them.

📥 Stage 2 — Data Acquisition

Features in the dataset include:

Species — Adelie / Chinstrap / Gentoo (target label).
Bill Length (mm).
Bill Depth (mm).
Flipper Length (mm).
Body Mass (g).
Island — where the penguin was observed.
Sex — male / female.

🔍 Stage 3 — Data Exploration

Load dataset via File widget → connect to Data Table to inspect.
Use Scatter Plot — check how Bill Length vs Flipper Length clusters the three species.
Use Distributions — see how Body Mass differs across species.
Look for missing values → use Impute if needed.

🤖 Stage 4 — Modelling

Since "species" is a category, this is a Classification problem.
Use widgets like Classification Tree, k-Nearest Neighbours, or Logistic Regression.
Connect the File widget → Classifier → Test & Score.

📊 Stage 5 — Evaluation

Use Test & Score — check Accuracy, Precision, Recall, F1 for each species.
Use Confusion Matrix to see which species is being confused with which.
If accuracy is poor, try different algorithms or tune parameters.

🚀 Stage 6 — Prediction / Deployment

Feed new unseen penguin measurements into the trained model.
The Prediction widget outputs the predicted species.
Export the trained model for use in a real app — e.g., a wildlife-research mobile app.

4.15 Limitations of No-Code AI Tools

Only work for standard problem types — custom problems still need code.
Limited customisation — fixed widgets and options.
May not scale for very large datasets.
Security concerns with sensitive data.
Automation bias — users may trust results without questioning.
Not suitable for complex Deep-Learning research.

Quick Revision — Key Points to Remember

Data Science = unifies statistics + data analysis + ML + their related methods.
Applications: internet search (Google processes 20 PB/day), targeted ads, recommendations (Amazon/Netflix), genetics & genomics, healthcare, fraud detection, stock market, weather.
3 coding approaches: High-Code (manual coding) · Low-Code (visual + some code) · No-Code (drag-drop only, no coding).
Why No-Code AI: no code errors, saves cost, no AI hiring, easy to use, visual, fast prototyping.
Who uses No-Code: non-technical people — doctors, architects, musicians, teachers.
Benefits: accessible, cost-effective, time-saving, visual, low-risk, easy iteration.
Disadvantages: lack of flexibility · automation bias · security issues.
Popular No-Code Tools: Azure ML (2014) · Google Cloud AutoML (2018) · Orange Data Mining (1996) · Lobe AI (2015) · Teachable Machine (2017) · Data Robot.
Kayla's Zoo Example: food-price prediction built in Orange without coding using 7 steps.
Statistics concepts: Population & Sample · Mean (average) · Median (middle) · Mode (most frequent) · Distribution (frequency chart) · Normal distribution · Probability · Variance · Standard Deviation · Outlier.
MS Excel: Analysis ToolPak add-in → scatter plot → trendline → regression equation (y = mx + c).
Orange Data Mining = open-source drag-and-drop data-mining tool by Uni of Ljubljana.
Orange widget categories: Data Loading (File/URL/Data Table) · Exploration (Scatter Plot, Distributions) · Preprocessing (Impute, Normalize) · Feature Selection · Modelling (Classification Tree, k-Means, SVM, Linear/Logistic Regression) · Evaluation (Test & Score, Cross Validation, ROC) · Visualization (Bar Chart, Heat Map).
AI Project Cycle in Orange: Problem Scoping → File widget → Data Table/Scatter → Model widget → Test & Score → Prediction.
Palmer Penguins Case Study: classify species (Adelie, Chinstrap, Gentoo) using Bill Length, Bill Depth, Flipper Length, Body Mass, Island, Sex.
No-Code limitations: limited customisation, scalability issues, security concerns, automation bias.

🧠Practice Quiz — test yourself on this chapter→