Introduction — Why Big Data Matters
In today's digital age, Big Data is a game-changer. Understanding its types, characteristics, and analytics helps us manage vast information, discover patterns, and unlock new possibilities in fields as diverse as healthcare, finance, entertainment, and environmental science.
Key Concepts You'll Learn
- Introduction to Big Data
- Types of Big Data
- Advantages and Disadvantages of Big Data
- Characteristics of Big Data (6 Vs)
- Big Data Analytics
- Working on Big Data Analytics
- Mining Data Streams
- Future of Big Data Analytics
Prerequisites: Understanding of the concept of data and reasonable fluency in English.
1.1 What Is Big Data? — Small Data vs Big Data
Small Data refers to datasets easily comprehensible by people — accessible, informative, actionable. Example: a small store tracks daily sales to decide what to restock.
Big Data refers to extremely large and complex datasets that regular programs cannot handle. It comes from three main sources:
- Transactional data — online purchases, banking transactions.
- Machine data — sensor readings, IoT devices.
- Social data — social media posts, tweets, comments.
1.2 Types of Big Data
| Aspect | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Definition | Quantitative data with defined structure | Mix of quantitative & qualitative properties | No inherent structure or formal rules |
| Data model | Dedicated data model | May lack a specific data model | Lacks a consistent data model |
| Organisation | Clearly defined columns | Less organised than structured | Organisation exhibits variability over time |
| Accessibility | Easily accessible & searchable | Depends on the specific data format | Accessible but harder to analyse |
| Examples | Customer info, transaction records, product directories | XML, CSV, JSON, HTML, semi-structured documents | Audio, images, videos, emails, social media posts |
2.1 Advantages of Big Data — 5 Key Benefits
1. Enhanced Decision-Making
Big Data analytics empowers organisations to make data-driven decisions based on insights derived from large & diverse datasets.
2. Improved Efficiency & Productivity
Analysing vast data helps businesses identify inefficiencies, streamline processes and optimise resource allocation.
3. Better Customer Insights
Deeper understanding of customer behaviour, preferences and needs — enabling personalised marketing and improved customer experiences.
4. Competitive Advantage
Uncovers market trends, identifies opportunities, helps stay ahead of competitors.
5. Innovation & Growth
Facilitates new products, services and business models based on data-driven insights — driving growth and expansion.
2.2 Disadvantages of Big Data — 5 Challenges
1. Privacy and Security Concerns
Large volumes of data raise risks of unauthorised access, data breaches and misuse of personal information.
2. Data Quality Issues
Ensuring accuracy, reliability and completeness is hard — Big Data often contains unstructured and heterogeneous sources, leading to errors and biases.
3. Technical Complexity
Implementing and managing Big Data infrastructure and analytics tools requires specialised skills & expertise.
4. Regulatory Compliance
Organisations must meet data-protection laws like GDPR (General Data Protection Regulation) and India's Digital Personal Data Protection Act, 2023. Non-compliance invites legal risks and penalties.
5. Cost & Resource Intensiveness
Acquisition, storage, processing, analytics and hiring skilled staff are expensive — a burden on smaller organisations with limited budgets.
3.1 The 3V and 6V Frameworks
Characteristics of Big Data start with the 3Vs — Volume, Velocity, Variety — and extend to the 6Vs by adding Veracity, Variability, Value.
⏩ 1. Velocity
Speed at which data is generated, delivered and analysed. Today millions of people store information online, so the speed is enormous.
2. Volume
The sheer amount of data generated every day. Data crossing gigabytes falls into the realm of Big Data — ranging from petabytes → terabytes → exabytes.
3. Variety
Big Data spans many formats — structured, unstructured, semi-structured and highly complex. Ranges from simple numbers to text, images, audio, video. Unstructured data is hard to store in RDBMS but often provides the most valuable insights.
4. Veracity
Concerns consistency, accuracy, quality and trustworthiness of data. Not all data holds value — data must be cleaned before storage or analysis.
5. Value
The business value derived from the data — perhaps the most critical characteristic. Without valuable insights, the other Vs hold little significance.
6. Variability
The consistency of the data stream even under extreme unpredictability. Ensures meaningful data can be extracted despite fluctuating conditions.
3.2 Case Study — OnDemandDrama (OTT Platform)
3V Framework
- Volume — huge data from millions of users: watch history, ratings, searches, preferences.
- Velocity — data processed in real time to adjust recommendations and surface trending content.
- Variety — diverse data: user profiles (structured), watch lists, video content, reviews — a mix of structured / semi-structured / unstructured.
Additional 3 Vs (6V Framework)
- Veracity — filters out irrelevant or low-quality data (e.g., incomplete profiles) for accurate recommendations.
- Value — personalises experiences to drive engagement and retention.
- Variability — handles inconsistencies in data streams from changing user behaviour, trends, events or regions.
4.1 Data Analytics vs Big Data Analytics
Big-Data Analytics covers the methodologies, tools and practices for data collection, organisation, storage, statistical analysis — used in business to refine processes, enhance decision-making and improve performance.
4.2 4 Types of Big Data Analytics (Recap)
From Unit 2 — Descriptive · Diagnostic · Predictive · Prescriptive.
4.3 4 Global Trends Driving Big Data Analytics
- Moore's Law — exponential growth of computing power enables handling & analysing massive datasets.
- Mobile Computing — widespread smartphones give real-time access to vast data from anywhere.
- Social Networking — Facebook, Foursquare, Pinterest generate massive user-generated content and interactions.
- Cloud Computing — access hardware/software remotely on a pay-as-you-go basis, no heavy on-premises investment.
5.1 4 Steps of Big Data Analytics
Step 1 · Gather Data
Collect structured and unstructured data from various sources — cloud storage, mobile apps, IoT sensors, social feeds, internal databases.
Step 2 · Process Data
Once collected, data must be processed correctly — especially when large and unstructured. Two common options:
- Batch Processing — examines large blocks of data over time.
- Stream Processing — processes small batches at once, giving quicker insight and shorter delay.
Step 3 · Clean Data
Scrubbing all data improves quality and results. Correct formatting and elimination of duplicate or irrelevant entries are essential. Erroneous or missing data lead to inaccurate insights.
Step 4 · Analyze Data
Once the data is ready, advanced analytics turns Big Data into Big Insights.
Popular Big Data Analytics Tools
- Tableau — data visualisation.
- Apache Hadoop — distributed file system + processing engine for massive datasets.
- Cassandra — NoSQL distributed database.
- MongoDB — NoSQL document database.
- SAS — Statistical Analysis System.
5.2 Big Data Analytics with Orange — Heart Disease Dataset
Step 1 · Gather Data
Use the File widget to load Orange's built-in Heart Disease dataset. Features: age, gender, chest pain, rest_spb (resting blood pressure), cholesterol, rest_ecg, max_hr, etc. Target: diameter narrowing (1 = narrowed arteries / heart-disease risk, 0 = healthy).
Step 2 · Process Data — Normalization
Use the Preprocess widget to normalise features (scale numerical values to a specific range like 0–1 or −1 to +1). Connect a Data Table to verify that all numerical values are now between 0 and 1.
Step 3 · Clean Data — Impute Missing Values
The Impute widget handles missing values. Strategies:
- Average (mean)
- Most frequent (mode)
- Fixed value
- Random value
Verify with a Data Table that missing values are replaced.
Step 4 · Analyze — Build & Test Model
- Drop the Logistic Regression (or Decision Tree / K-Means) widget on the canvas.
- Add Test & Score widget. Connect learner + processed data.
- Double-click Test & Score → choose Cross-Validation.
- Connect Predictions widget to view outputs.
- Optional: use Scatter Plot · Box Plot · Heat Map to visualise patterns.
5.3 Mining Data Streams
6.1 Future of Big Data Analytics — 3 Key Trends
1. Real-Time Analytics
Data processed instantaneously for immediate insights — live customer-behaviour monitoring, supply-chain tracking, fraud detection.
2. Advanced Predictive Analytics
Predictive analytics will evolve with sophisticated machine-learning and AI algorithms, enabling organisations to forecast trends and behaviours with greater precision.
3. Quantum Computing
Promises unprecedented processing power — quantum computers will solve complex problems much faster than classical computers, revolutionising Big Data analytics.
Activity 2 — List the 4 steps of Big Data Analytics in order (Gather → Process → Clean → Analyze).
- "Volume" in Big Data → amount of data generated.
- Key characteristic of Big Data → Variety.
- Verification is NOT one of the V's (they are Volume, Velocity, Variety, Veracity, Value, Variability).
- Primary purpose of preprocessing → improve data quality.
- Technique for patterns & relationships in large datasets → data mining.
- Extracting useful info from large datasets → data analytics.
- Benefit of BDA → improved decision-making.
- Hadoop = distributed file system for storing and processing big data.
- Primary veracity challenge → ensuring data quality and reliability.
- CSV & Excel → Structured; JSON & XML → Semi-structured; audio, images, social-media posts → Unstructured.
- "Positive / Negative / Neutral" terms → Sentiment Analysis.
- Analysing large textual materials to capture concepts & trends → Text Mining.
Quick Revision — Key Points to Remember
- Big Data = extremely large/complex datasets that normal programs cannot handle; sources: transactional · machine · social.
- 3 Types: Structured (tables, CSV) · Semi-Structured (XML, JSON, HTML) · Unstructured (audio, images, videos, social posts).
- 5 Advantages: Enhanced decision-making · Improved efficiency · Customer insights · Competitive advantage · Innovation & growth.
- 5 Disadvantages: Privacy & security · Data quality · Technical complexity · Regulatory compliance (GDPR, DPDP Act 2023) · Cost & resources.
- 3 Vs: Volume · Velocity · Variety.
- 6 Vs: Volume · Velocity · Variety · Veracity · Value · Variability.
- Volume: 328.77 million TB/day. Velocity: Google — 40,000+ searches/sec.
- Case study: OnDemandDrama uses all 6 Vs for personalised recommendations.
- Big Data Analytics uses advanced techniques on terabyte-to-zettabyte datasets.
- 4 Types of analytics (from Unit 2): Descriptive · Diagnostic · Predictive · Prescriptive.
- 4 Drivers of BDA: Moore's Law · Mobile Computing · Social Networking · Cloud Computing.
- 4 Working Steps: Gather → Process (Batch / Stream) → Clean → Analyze.
- 5 Tools: Tableau · Apache Hadoop · Cassandra · MongoDB · SAS.
- Orange BDA pipeline (Heart Disease): File → Preprocess (Normalize) → Impute (missing values) → Logistic Regression / Decision Tree → Test & Score → Predictions.
- Impute strategies: Average · Most Frequent · Fixed · Random.
- Data Stream = continuous real-time flow; Mining Data Streams extracts patterns as data arrives, without fully storing.
- Future 3 Trends: Real-Time Analytics · Advanced Predictive Analytics · Quantum Computing.