Introduction — Why Big Data Matters
In today's digital age, Big Data is a game-changer. Understanding its types, characteristics, and analytics helps us manage vast information, discover patterns, and unlock new possibilities in fields as diverse as healthcare, finance, entertainment, and environmental science.
🔹 Key Concepts You'll Learn
- Introduction to Big Data
- Types of Big Data
- Advantages and Disadvantages of Big Data
- Characteristics of Big Data (6 Vs)
- Big Data Analytics
- Working on Big Data Analytics
- Mining Data Streams
- Future of Big Data Analytics
Prerequisites: Understanding of the concept of data and reasonable fluency in English.
1.1 What Is Big Data? — Small Data vs Big Data
Small Data refers to datasets easily comprehensible by people — accessible, informative, actionable. Example: a small store tracks daily sales to decide what to restock.
Big Data refers to extremely large and complex datasets that regular programs cannot handle. It comes from three main sources:
- Transactional data — online purchases, banking transactions.
- Machine data — sensor readings, IoT devices.
- Social data — social media posts, tweets, comments.
1.2 Types of Big Data
| Aspect | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Definition | Quantitative data with defined structure | Mix of quantitative & qualitative properties | No inherent structure or formal rules |
| Data model | Dedicated data model | May lack a specific data model | Lacks a consistent data model |
| Organisation | Clearly defined columns | Less organised than structured | Organisation exhibits variability over time |
| Accessibility | Easily accessible & searchable | Depends on the specific data format | Accessible but harder to analyse |
| Examples | Customer info, transaction records, product directories | XML, CSV, JSON, HTML, semi-structured documents | Audio, images, videos, emails, social media posts |
2.1 Advantages of Big Data — 5 Key Benefits
🎯 1. Enhanced Decision-Making
Big Data analytics empowers organisations to make data-driven decisions based on insights derived from large & diverse datasets.
⚙️ 2. Improved Efficiency & Productivity
Analysing vast data helps businesses identify inefficiencies, streamline processes and optimise resource allocation.
👥 3. Better Customer Insights
Deeper understanding of customer behaviour, preferences and needs — enabling personalised marketing and improved customer experiences.
🏆 4. Competitive Advantage
Uncovers market trends, identifies opportunities, helps stay ahead of competitors.
💡 5. Innovation & Growth
Facilitates new products, services and business models based on data-driven insights — driving growth and expansion.
2.2 Disadvantages of Big Data — 5 Challenges
🔐 1. Privacy and Security Concerns
Large volumes of data raise risks of unauthorised access, data breaches and misuse of personal information.
❓ 2. Data Quality Issues
Ensuring accuracy, reliability and completeness is hard — Big Data often contains unstructured and heterogeneous sources, leading to errors and biases.
🛠️ 3. Technical Complexity
Implementing and managing Big Data infrastructure and analytics tools requires specialised skills & expertise.
📜 4. Regulatory Compliance
Organisations must meet data-protection laws like GDPR (General Data Protection Regulation) and India's Digital Personal Data Protection Act, 2023. Non-compliance invites legal risks and penalties.
💰 5. Cost & Resource Intensiveness
Acquisition, storage, processing, analytics and hiring skilled staff are expensive — a burden on smaller organisations with limited budgets.
3.1 The 3V and 6V Frameworks
Characteristics of Big Data start with the 3Vs — Volume, Velocity, Variety — and extend to the 6Vs by adding Veracity, Variability, Value.
⏩ 1. Velocity
Speed at which data is generated, delivered and analysed. Today millions of people store information online, so the speed is enormous.
📦 2. Volume
The sheer amount of data generated every day. Data crossing gigabytes falls into the realm of Big Data — ranging from petabytes → terabytes → exabytes.
🗂️ 3. Variety
Big Data spans many formats — structured, unstructured, semi-structured and highly complex. Ranges from simple numbers to text, images, audio, video. Unstructured data is hard to store in RDBMS but often provides the most valuable insights.
✅ 4. Veracity
Concerns consistency, accuracy, quality and trustworthiness of data. Not all data holds value — data must be cleaned before storage or analysis.
💎 5. Value
The business value derived from the data — perhaps the most critical characteristic. Without valuable insights, the other Vs hold little significance.
🔀 6. Variability
The consistency of the data stream even under extreme unpredictability. Ensures meaningful data can be extracted despite fluctuating conditions.
3.2 Case Study — OnDemandDrama (OTT Platform)
🔹 3V Framework
- Volume — huge data from millions of users: watch history, ratings, searches, preferences.
- Velocity — data processed in real time to adjust recommendations and surface trending content.
- Variety — diverse data: user profiles (structured), watch lists, video content, reviews — a mix of structured / semi-structured / unstructured.
🔹 Additional 3 Vs (6V Framework)
- Veracity — filters out irrelevant or low-quality data (e.g., incomplete profiles) for accurate recommendations.
- Value — personalises experiences to drive engagement and retention.
- Variability — handles inconsistencies in data streams from changing user behaviour, trends, events or regions.
4.1 Data Analytics vs Big Data Analytics
Big-Data Analytics covers the methodologies, tools and practices for data collection, organisation, storage, statistical analysis — used in business to refine processes, enhance decision-making and improve performance.
4.2 4 Types of Big Data Analytics (Recap)
From Unit 2 — Descriptive · Diagnostic · Predictive · Prescriptive.
4.3 4 Global Trends Driving Big Data Analytics
- Moore's Law — exponential growth of computing power enables handling & analysing massive datasets.
- Mobile Computing — widespread smartphones give real-time access to vast data from anywhere.
- Social Networking — Facebook, Foursquare, Pinterest generate massive user-generated content and interactions.
- Cloud Computing — access hardware/software remotely on a pay-as-you-go basis, no heavy on-premises investment.
5.1 4 Steps of Big Data Analytics
📥 Step 1 · Gather Data
Collect structured and unstructured data from various sources — cloud storage, mobile apps, IoT sensors, social feeds, internal databases.
⚙️ Step 2 · Process Data
Once collected, data must be processed correctly — especially when large and unstructured. Two common options:
- Batch Processing — examines large blocks of data over time.
- Stream Processing — processes small batches at once, giving quicker insight and shorter delay.
🧹 Step 3 · Clean Data
Scrubbing all data improves quality and results. Correct formatting and elimination of duplicate or irrelevant entries are essential. Erroneous or missing data lead to inaccurate insights.
📈 Step 4 · Analyze Data
Once the data is ready, advanced analytics turns Big Data into Big Insights.
🔹 Popular Big Data Analytics Tools
- Tableau — data visualisation.
- Apache Hadoop — distributed file system + processing engine for massive datasets.
- Cassandra — NoSQL distributed database.
- MongoDB — NoSQL document database.
- SAS — Statistical Analysis System.
5.2 Big Data Analytics with Orange — Heart Disease Dataset
📂 Step 1 · Gather Data
Use the File widget to load Orange's built-in Heart Disease dataset. Features: age, gender, chest pain, rest_spb (resting blood pressure), cholesterol, rest_ecg, max_hr, etc. Target: diameter narrowing (1 = narrowed arteries / heart-disease risk, 0 = healthy).
🧪 Step 2 · Process Data — Normalization
Use the Preprocess widget to normalise features (scale numerical values to a specific range like 0–1 or −1 to +1). Connect a Data Table to verify that all numerical values are now between 0 and 1.
🧼 Step 3 · Clean Data — Impute Missing Values
The Impute widget handles missing values. Strategies:
- Average (mean)
- Most frequent (mode)
- Fixed value
- Random value
Verify with a Data Table that missing values are replaced.
📊 Step 4 · Analyze — Build & Test Model
- Drop the Logistic Regression (or Decision Tree / K-Means) widget on the canvas.
- Add Test & Score widget. Connect learner + processed data.
- Double-click Test & Score → choose Cross-Validation.
- Connect Predictions widget to view outputs.
- Optional: use Scatter Plot · Box Plot · Heat Map to visualise patterns.
5.3 Mining Data Streams
6.1 Future of Big Data Analytics — 3 Key Trends
⚡ 1. Real-Time Analytics
Data processed instantaneously for immediate insights — live customer-behaviour monitoring, supply-chain tracking, fraud detection.
🧠 2. Advanced Predictive Analytics
Predictive analytics will evolve with sophisticated machine-learning and AI algorithms, enabling organisations to forecast trends and behaviours with greater precision.
⚛️ 3. Quantum Computing
Promises unprecedented processing power — quantum computers will solve complex problems much faster than classical computers, revolutionising Big Data analytics.
Activity 2 — List the 4 steps of Big Data Analytics in order (Gather → Process → Clean → Analyze).
- "Volume" in Big Data → amount of data generated.
- Key characteristic of Big Data → Variety.
- Verification is NOT one of the V's (they are Volume, Velocity, Variety, Veracity, Value, Variability).
- Primary purpose of preprocessing → improve data quality.
- Technique for patterns & relationships in large datasets → data mining.
- Extracting useful info from large datasets → data analytics.
- Benefit of BDA → improved decision-making.
- Hadoop = distributed file system for storing and processing big data.
- Primary veracity challenge → ensuring data quality and reliability.
- CSV & Excel → Structured; JSON & XML → Semi-structured; audio, images, social-media posts → Unstructured.
- "Positive / Negative / Neutral" terms → Sentiment Analysis.
- Analysing large textual materials to capture concepts & trends → Text Mining.
Quick Revision — Key Points to Remember
- Big Data = extremely large/complex datasets that normal programs cannot handle; sources: transactional · machine · social.
- 3 Types: Structured (tables, CSV) · Semi-Structured (XML, JSON, HTML) · Unstructured (audio, images, videos, social posts).
- 5 Advantages: Enhanced decision-making · Improved efficiency · Customer insights · Competitive advantage · Innovation & growth.
- 5 Disadvantages: Privacy & security · Data quality · Technical complexity · Regulatory compliance (GDPR, DPDP Act 2023) · Cost & resources.
- 3 Vs: Volume · Velocity · Variety.
- 6 Vs: Volume · Velocity · Variety · Veracity · Value · Variability.
- Volume: 328.77 million TB/day. Velocity: Google — 40,000+ searches/sec.
- Case study: OnDemandDrama uses all 6 Vs for personalised recommendations.
- Big Data Analytics uses advanced techniques on terabyte-to-zettabyte datasets.
- 4 Types of analytics (from Unit 2): Descriptive · Diagnostic · Predictive · Prescriptive.
- 4 Drivers of BDA: Moore's Law · Mobile Computing · Social Networking · Cloud Computing.
- 4 Working Steps: Gather → Process (Batch / Stream) → Clean → Analyze.
- 5 Tools: Tableau · Apache Hadoop · Cassandra · MongoDB · SAS.
- Orange BDA pipeline (Heart Disease): File → Preprocess (Normalize) → Impute (missing values) → Logistic Regression / Decision Tree → Test & Score → Predictions.
- Impute strategies: Average · Most Frequent · Fixed · Random.
- Data Stream = continuous real-time flow; Mining Data Streams extracts patterns as data arrives, without fully storing.
- Future 3 Trends: Real-Time Analytics · Advanced Predictive Analytics · Quantum Computing.