PART B ▪ UNIT 5 · Introduction to Big Data and Data Analytics

  root@vm-learning
  ~
  $
  open
  ch-b5
  

PART B ▪ UNIT 5

Introduction to Big Data and Data Analytics

Types · 6 Vs · Analytics · Data Streams · Future Trends

Big Data — extremely large and complex datasets that regular computer programs and databases cannot handle. Special tools (Hadoop, Spark, Orange, Tableau) are required to store, process and extract valuable insights from Big Data.

Introduction — Why Big Data Matters

In today's digital age, Big Data is a game-changer. Understanding its types, characteristics, and analytics helps us manage vast information, discover patterns, and unlock new possibilities in fields as diverse as healthcare, finance, entertainment, and environmental science.

🔹 Key Concepts You'll Learn

Introduction to Big Data
Types of Big Data
Advantages and Disadvantages of Big Data
Characteristics of Big Data (6 Vs)
Big Data Analytics
Working on Big Data Analytics
Mining Data Streams
Future of Big Data Analytics

Prerequisites: Understanding of the concept of data and reasonable fluency in English.

Learning Outcome 1: Define Big Data and identify its various types

1.1 What Is Big Data? — Small Data vs Big Data

Small Data refers to datasets easily comprehensible by people — accessible, informative, actionable. Example: a small store tracks daily sales to decide what to restock.

Big Data refers to extremely large and complex datasets that regular programs cannot handle. It comes from three main sources:

Transactional data — online purchases, banking transactions.
Machine data — sensor readings, IoT devices.
Social data — social media posts, tweets, comments.

Amazon and Netflix use Big Data to recommend products or shows based on users' past activities.

1.2 Types of Big Data

Aspect	Structured	Semi-Structured	Unstructured
Definition	Quantitative data with defined structure	Mix of quantitative & qualitative properties	No inherent structure or formal rules
Data model	Dedicated data model	May lack a specific data model	Lacks a consistent data model
Organisation	Clearly defined columns	Less organised than structured	Organisation exhibits variability over time
Accessibility	Easily accessible & searchable	Depends on the specific data format	Accessible but harder to analyse
Examples	Customer info, transaction records, product directories	XML, CSV, JSON, HTML, semi-structured documents	Audio, images, videos, emails, social media posts

Learning Outcome 2: Evaluate the advantages and disadvantages of Big Data

2.1 Advantages of Big Data — 5 Key Benefits

🎯 1. Enhanced Decision-Making

Big Data analytics empowers organisations to make data-driven decisions based on insights derived from large & diverse datasets.

⚙️ 2. Improved Efficiency & Productivity

Analysing vast data helps businesses identify inefficiencies, streamline processes and optimise resource allocation.

👥 3. Better Customer Insights

Deeper understanding of customer behaviour, preferences and needs — enabling personalised marketing and improved customer experiences.

🏆 4. Competitive Advantage

Uncovers market trends, identifies opportunities, helps stay ahead of competitors.

💡 5. Innovation & Growth

Facilitates new products, services and business models based on data-driven insights — driving growth and expansion.

2.2 Disadvantages of Big Data — 5 Challenges

🔐 1. Privacy and Security Concerns

Large volumes of data raise risks of unauthorised access, data breaches and misuse of personal information.

❓ 2. Data Quality Issues

Ensuring accuracy, reliability and completeness is hard — Big Data often contains unstructured and heterogeneous sources, leading to errors and biases.

🛠️ 3. Technical Complexity

Implementing and managing Big Data infrastructure and analytics tools requires specialised skills & expertise.

📜 4. Regulatory Compliance

Organisations must meet data-protection laws like GDPR (General Data Protection Regulation) and India's Digital Personal Data Protection Act, 2023. Non-compliance invites legal risks and penalties.

💰 5. Cost & Resource Intensiveness

Acquisition, storage, processing, analytics and hiring skilled staff are expensive — a burden on smaller organisations with limited budgets.

Learning Outcome 3: Recognize the characteristics of Big Data

3.1 The 3V and 6V Frameworks

Characteristics of Big Data start with the 3Vs — Volume, Velocity, Variety — and extend to the 6Vs by adding Veracity, Variability, Value.

⏩ 1. Velocity

Speed at which data is generated, delivered and analysed. Today millions of people store information online, so the speed is enormous.

Google alone generates 40,000+ search queries per second.

📦 2. Volume

The sheer amount of data generated every day. Data crossing gigabytes falls into the realm of Big Data — ranging from petabytes → terabytes → exabytes.

According to latest estimates, 328.77 million terabytes of data are created each day.

🗂️ 3. Variety

Big Data spans many formats — structured, unstructured, semi-structured and highly complex. Ranges from simple numbers to text, images, audio, video. Unstructured data is hard to store in RDBMS but often provides the most valuable insights.

✅ 4. Veracity

Concerns consistency, accuracy, quality and trustworthiness of data. Not all data holds value — data must be cleaned before storage or analysis.

💎 5. Value

The business value derived from the data — perhaps the most critical characteristic. Without valuable insights, the other Vs hold little significance.

🔀 6. Variability

The consistency of the data stream even under extreme unpredictability. Ensures meaningful data can be extracted despite fluctuating conditions.

3.2 Case Study — OnDemandDrama (OTT Platform)

🔹 3V Framework

Volume — huge data from millions of users: watch history, ratings, searches, preferences.
Velocity — data processed in real time to adjust recommendations and surface trending content.
Variety — diverse data: user profiles (structured), watch lists, video content, reviews — a mix of structured / semi-structured / unstructured.

🔹 Additional 3 Vs (6V Framework)

Veracity — filters out irrelevant or low-quality data (e.g., incomplete profiles) for accurate recommendations.
Value — personalises experiences to drive engagement and retention.
Variability — handles inconsistencies in data streams from changing user behaviour, trends, events or regions.

Learning Outcome 4: Explain the concept of Big Data Analytics and its significance

4.1 Data Analytics vs Big Data Analytics

Data Analytics — analysing datasets of any size to uncover insights, trends and patterns. Tools include statistical software, data-visualisation tools, RDBMS.

Big Data Analytics — uses advanced analytic techniques against huge, diverse datasets (structured/semi-/unstructured) from many sources and sizes ranging from terabytes to zettabytes.

Big-Data Analytics covers the methodologies, tools and practices for data collection, organisation, storage, statistical analysis — used in business to refine processes, enhance decision-making and improve performance.

4.2 4 Types of Big Data Analytics (Recap)

From Unit 2 — Descriptive · Diagnostic · Predictive · Prescriptive.

4.3 4 Global Trends Driving Big Data Analytics

Moore's Law — exponential growth of computing power enables handling & analysing massive datasets.
Mobile Computing — widespread smartphones give real-time access to vast data from anywhere.
Social Networking — Facebook, Foursquare, Pinterest generate massive user-generated content and interactions.
Cloud Computing — access hardware/software remotely on a pay-as-you-go basis, no heavy on-premises investment.

Learning Outcome 5: Describe how Big Data Analytics works

5.1 4 Steps of Big Data Analytics

📥 Step 1 · Gather Data

Collect structured and unstructured data from various sources — cloud storage, mobile apps, IoT sensors, social feeds, internal databases.

⚙️ Step 2 · Process Data

Once collected, data must be processed correctly — especially when large and unstructured. Two common options:

Batch Processing — examines large blocks of data over time.
Stream Processing — processes small batches at once, giving quicker insight and shorter delay.

🧹 Step 3 · Clean Data

Scrubbing all data improves quality and results. Correct formatting and elimination of duplicate or irrelevant entries are essential. Erroneous or missing data lead to inaccurate insights.

📈 Step 4 · Analyze Data

Once the data is ready, advanced analytics turns Big Data into Big Insights.

🔹 Popular Big Data Analytics Tools

Tableau — data visualisation.
Apache Hadoop — distributed file system + processing engine for massive datasets.
Cassandra — NoSQL distributed database.
MongoDB — NoSQL document database.
SAS — Statistical Analysis System.

5.2 Big Data Analytics with Orange — Heart Disease Dataset

📂 Step 1 · Gather Data

Use the File widget to load Orange's built-in Heart Disease dataset. Features: age, gender, chest pain, rest_spb (resting blood pressure), cholesterol, rest_ecg, max_hr, etc. Target: diameter narrowing (1 = narrowed arteries / heart-disease risk, 0 = healthy).

🧪 Step 2 · Process Data — Normalization

Use the Preprocess widget to normalise features (scale numerical values to a specific range like 0–1 or −1 to +1). Connect a Data Table to verify that all numerical values are now between 0 and 1.

🧼 Step 3 · Clean Data — Impute Missing Values

The Impute widget handles missing values. Strategies:

Average (mean)
Most frequent (mode)
Fixed value
Random value

Verify with a Data Table that missing values are replaced.

📊 Step 4 · Analyze — Build & Test Model

Drop the Logistic Regression (or Decision Tree / K-Means) widget on the canvas.
Add Test & Score widget. Connect learner + processed data.
Double-click Test & Score → choose Cross-Validation.
Connect Predictions widget to view outputs.
Optional: use Scatter Plot · Box Plot · Heat Map to visualise patterns.

5.3 Mining Data Streams

Data Stream — a continuous, real-time flow of data generated by sources like sensors, satellite images, internet & web traffic.

Mining Data Streams — the process of extracting meaningful patterns, trends and knowledge from a continuous flow of real-time data. Unlike traditional mining, it processes data as it arrives, without storing it completely.

A sudden spike in web searches for "election results" on a particular day can indicate elections have just been held — or reveal the level of public interest in the outcome.

Learning Outcome 6: Explore future trends and advancements in Big Data Analytics

6.1 Future of Big Data Analytics — 3 Key Trends

⚡ 1. Real-Time Analytics

Data processed instantaneously for immediate insights — live customer-behaviour monitoring, supply-chain tracking, fraud detection.

🧠 2. Advanced Predictive Analytics

Predictive analytics will evolve with sophisticated machine-learning and AI algorithms, enabling organisations to forecast trends and behaviours with greater precision.

⚛️ 3. Quantum Computing

Promises unprecedented processing power — quantum computers will solve complex problems much faster than classical computers, revolutionising Big Data analytics.

Activity 1 (research-based group) — Explore Big Data & Data Analytics applications in three fields: Education · Environmental Science · Media & Entertainment. For each, note the video/article resource used and the insights + futuristic developments.
Activity 2 — List the 4 steps of Big Data Analytics in order (Gather → Process → Clean → Analyze).

Check Your Progress — quick MCQ pointers:

"Volume" in Big Data → amount of data generated.
Key characteristic of Big Data → Variety.
Verification is NOT one of the V's (they are Volume, Velocity, Variety, Veracity, Value, Variability).
Primary purpose of preprocessing → improve data quality.
Technique for patterns & relationships in large datasets → data mining.
Extracting useful info from large datasets → data analytics.
Benefit of BDA → improved decision-making.
Hadoop = distributed file system for storing and processing big data.
Primary veracity challenge → ensuring data quality and reliability.
CSV & Excel → Structured; JSON & XML → Semi-structured; audio, images, social-media posts → Unstructured.
"Positive / Negative / Neutral" terms → Sentiment Analysis.
Analysing large textual materials to capture concepts & trends → Text Mining.

Quick Revision — Key Points to Remember

Big Data = extremely large/complex datasets that normal programs cannot handle; sources: transactional · machine · social.
3 Types: Structured (tables, CSV) · Semi-Structured (XML, JSON, HTML) · Unstructured (audio, images, videos, social posts).
5 Advantages: Enhanced decision-making · Improved efficiency · Customer insights · Competitive advantage · Innovation & growth.
5 Disadvantages: Privacy & security · Data quality · Technical complexity · Regulatory compliance (GDPR, DPDP Act 2023) · Cost & resources.
3 Vs: Volume · Velocity · Variety.
6 Vs: Volume · Velocity · Variety · Veracity · Value · Variability.
Volume: 328.77 million TB/day. Velocity: Google — 40,000+ searches/sec.
Case study: OnDemandDrama uses all 6 Vs for personalised recommendations.
Big Data Analytics uses advanced techniques on terabyte-to-zettabyte datasets.
4 Types of analytics (from Unit 2): Descriptive · Diagnostic · Predictive · Prescriptive.
4 Drivers of BDA: Moore's Law · Mobile Computing · Social Networking · Cloud Computing.
4 Working Steps: Gather → Process (Batch / Stream) → Clean → Analyze.
5 Tools: Tableau · Apache Hadoop · Cassandra · MongoDB · SAS.
Orange BDA pipeline (Heart Disease): File → Preprocess (Normalize) → Impute (missing values) → Logistic Regression / Decision Tree → Test & Score → Predictions.
Impute strategies: Average · Most Frequent · Fixed · Random.
Data Stream = continuous real-time flow; Mining Data Streams extracts patterns as data arrives, without fully storing.
Future 3 Trends: Real-Time Analytics · Advanced Predictive Analytics · Quantum Computing.

🧠Practice Quiz — test yourself on this chapter→