6.1 Features of Natural Languages
A natural language is a human language — French, Spanish, English, Hindi, Japanese, etc. Its features:
🔹 Same-Sounding Words Have Different Meanings
• "I am so tired; I want to take a ___." (nap / knap)
• "Let's ___ her a letter." (write / right)
Same sound, different spelling, totally different meaning!
🔹 Same Word, Different Meanings in Context
- His face turned red after he found out he took the wrong bag. → ashamed / angry?
- The red car zoomed past his nose. → colour of the car
- His face turns red after consuming the medicine. → allergic reaction?
6.2 Computer Language vs Natural Language
💻 Computer Languages
Languages used to interact with a computer — Python, C++, Java, HTML. Strict rules, no ambiguity.
🗣️ Natural Languages
Human languages — English, Hindi, Tamil. Many rules, many exceptions, lots of ambiguity.
6.3 Why is NLP Important?
- Computers can only process electronic signals in binary form — they don't understand "hello" directly.
- NLP converts natural language to digital form so computers can process it.
- Makes communication between humans and computers possible.
- Creates different tools and techniques for better communication of intent and context.
- Powers everyday tools — search engines, chatbots, email filters, voice assistants, translation apps.
- Enables accessibility — auto-captions for the hearing impaired, screen readers for the visually impaired.
6.4 Real-Life Applications of NLP
Step 1: Visit https://cloud.google.com/natural-language.
Step 2: Paste any paragraph in the text box → Click Analyze.
Observe: Keywords from the paragraph get highlighted in different colours — e.g., "Google", "Mountain View". Try your own text and see results!
6.5 Stages of Natural Language Processing
NLP processes text through 5 sequential stages — each stage adds a deeper layer of understanding:
📝 1. Lexical Analysis
Lexicon = collection of various words and phrases used in a language.
🔗 2. Syntactic Analysis / Parsing
Incorrect grammar: "Boy runs the fast." ✗ — rejected at this stage.
💭 3. Semantic Analysis
Does NOT make sense: "Hot ice cream" — semantically wrong. Rejected.
Also rejected: "The fox jumped into the dog." — grammatically right but semantically odd.
📖 4. Discourse Integration
"Rahul went to the store. He bought milk. Then he came home."
Each sentence connects with the next — the pronouns and connectors are clear.
🎯 5. Pragmatic Analysis
Pragmatic (intended): "Pass the salt." → the real meaning is a request, not a yes/no question. Pragmatic analysis captures this real-world intent.
🔹 Summary of 5 Stages
| Stage | What It Checks |
|---|---|
| 1. Lexical | Structure — break text into paragraphs, sentences, words, tokens. |
| 2. Syntactic | Grammar — are the words arranged correctly? |
| 3. Semantic | Meaning — does this make any sense? |
| 4. Discourse | Story flow — do sentences connect? |
| 5. Pragmatic | Intent — what's the real-world meaning? |
6.6 Chatbots
🔹 Play with Chatbots — Activity
- Elizabot — masswerk.at/elizabot/ (one of the earliest chatbots from 1966)
- Mitsuku / Kuki — kuki.ai (multiple times winner of the Loebner Prize)
- Cleverbot — cleverbot.com (learns from conversations)
- Singtel — singtel.com/personal/support (customer-service chatbot)
6.7 Script Bot vs Smart Bot
After interacting with chatbots, you'll notice two types:
📜 Script Bot (Traditional)
- Works on pre-defined scripts / rules.
- Can only handle topics the developer programmed.
- Cheap and fast to build.
- Cannot learn from conversations.
- Falls apart when user asks something unscripted.
- Example: a pizza-ordering bot that only handles pizza orders.
🧠 Smart Bot (AI-Powered)
- Uses Machine Learning and NLP.
- Handles wider range of topics naturally.
- Learns and improves from conversations.
- Costlier and more complex to build.
- Can maintain context across a conversation.
- Example: ChatGPT, Alexa, Mitsuku.
6.8 Why Text Processing?
Computers can only process numbers. So the first step of NLP is to convert human language to numbers. This happens in stages — starting with Text Normalisation.
6.9 Steps of Text Normalisation
✂️ 1. Sentence Segmentation
The whole corpus is divided into individual sentences. Each sentence becomes a separate data item.
🔤 2. Tokenisation
Each sentence is further divided into tokens. A token = any word, number, or special character occurring in the sentence. Each word, number and special character becomes its own separate token.
🚫 3. Removing Stop Words, Special Characters & Numbers
Examples of stop words: a, an, the, is, in, at, of, to, and, but, or, this, that, which, these, those, for, with, from, by, on, …
- Special characters and numbers are also removed if they don't matter.
- Sometimes they matter — e.g., in email-address dataset you keep "@" and ".".
- Removing stop words helps the computer focus on meaningful terms.
🔡 4. Converting Text to a Common Case
After stop-word removal, convert the whole text to lowercase (usually). This ensures that case sensitivity doesn't make the machine treat the same word as different.
🌱 5. Stemming
| Word | Affix | Stem |
|---|---|---|
| healed | -ed | heal |
| healing | -ing | heal |
| healer | -er | heal |
| studies | -es | studi (not meaningful!) |
| studying | -ing | study |
📖 6. Lemmatisation
| Word | Affix | Lemma |
|---|---|---|
| healed | -ed | heal |
| healing | -ing | heal |
| healer | -er | heal |
| studies | -es | study ✓ |
| studying | -ing | study |
🔹 Stemming vs Lemmatisation — Quick Compare
🌱 Stemming
Faster · removes affixes blindly · result may be meaningless (studies → studi).
📖 Lemmatisation
Slower · uses dictionary · result is always meaningful (studies → study).
6.10 Bag of Words Algorithm
- A vocabulary of words for the corpus.
- The frequency of these words (how many times each has occurred).
🔹 4-Step Bag of Words Implementation
🔹 Worked Example
• Doc 1: "Aman and Avni are stressed"
• Doc 2: "Aman went to a therapist"
• Doc 3: "Avni went to download a health chatbot"
🔹 Step 1 — After Text Normalisation
- Doc 1: [aman, and, avni, are, stressed]
- Doc 2: [aman, went, to, a, therapist]
- Doc 3: [avni, went, to, download, a, health, chatbot]
🔹 Step 2 — Create Dictionary (unique words from all docs)
Dictionary: aman · and · avni · are · stressed · went · to · a · therapist · download · health · chatbot (12 unique words)
🔹 Step 3 & 4 — Document Vector Table
| Doc | aman | and | avni | are | stressed | went | to | a | therapist | download | health | chatbot |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Doc 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| Doc 3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
Each cell = how many times the word appears in that document. This table is the document vector table.
6.11 TF-IDF (Term Frequency–Inverse Document Frequency)
📊 Term Frequency (TF)
In our example, TF of "aman" in Doc 1 = 1. TF of "stressed" in Doc 1 = 1.
📉 Document Frequency (DF)
From our vocabulary:
- DF of "aman" = 2 (appears in Docs 1 and 2)
- DF of "avni" = 2 (Docs 1 and 3)
- DF of "went" = 2 (Docs 2 and 3)
- DF of "to" = 2 (Docs 2 and 3)
- DF of "a" = 2 (Docs 2 and 3)
- DF of "stressed", "therapist", "download", "health", "chatbot" = 1 (one doc each)
📈 Inverse Document Frequency (IDF)
For our 3-document example: IDF of "aman" = 3/2 = 1.5. IDF of "stressed" = 3/1 = 3.
🎯 TF-IDF Formula
(log is to base 10.)
🔹 Understanding the Plot — Words vs Value
Highest frequency in all documents but negligible value — "a, and, the, is". Removed at pre-processing.
Adequate occurrence — talk about the document's subject. Some amount of value.
Lowest occurrence but highest value — they add the most meaning to the corpus.
🔹 TF-IDF Worked Example — "air pollution"
• Total documents = 10. "and" occurs in all 10 documents.
• IDF(and) = 10/10 = 1. log(1) = 0.
• TF-IDF(and) = TF × 0 = 0. Correctly identified as worthless.
Case 2 — "pollution" (valuable word):
• Total documents = 10. "pollution" occurs in 3 documents.
• IDF(pollution) = 10/3 = 3.33. log(3.33) ≈ 0.522.
• TF-IDF(pollution) = TF × 0.522 = high value. Correctly identified as important.
- Words in all documents with high TF → low value → stop words.
- For high TF-IDF → word needs high TF but low DF — important for one doc, not common everywhere.
- Higher the value → more important the word is for a given corpus.
6.12 Applications of TF-IDF
6.13 Code vs No-Code NLP Tools
| Type | Tools | Details |
|---|---|---|
| 💻 Code NLP | NLTK, SpaCy |
• NLTK (Natural Language Tool Kit) — Python package for text processing. Functions and modules for NLP. • SpaCy — open-source NLP library for tokenisation, POS tagging, named-entity recognition, dependency parsing. |
| 🟠 No-Code NLP | Orange, MonkeyLearn |
• Orange Data Mining — ML tool through Python and visual programming. Drag-and-drop steps. • MonkeyLearn — text-analysis platform offering NLP tools and ML models for classification, sentiment analysis, entity recognition. |
6.14 Sentiment Analysis — Use Case Walkthrough
🔹 Real-World Use Cases of Sentiment Analysis
- Product reviews — is this review positive/negative? (Amazon, Flipkart)
- Social-media monitoring — public opinion about a brand, politician, movie.
- Customer service — identifying angry customers for priority handling.
- Stock prediction — Twitter sentiment affects market movements.
- Election analysis — voter sentiment from social posts.
- Feedback analysis — post-event feedback forms.
🔹 Sentiment Analysis in Orange Data Mining
Orange has dedicated NLP widgets for sentiment analysis. The flow mirrors the AI Project Cycle:
| Stage | Orange Widget / Action |
|---|---|
| 1. Problem Scoping | Decide — are we analysing movie reviews, tweets, or customer feedback? |
| 2. Data Acquisition | Load text dataset via Corpus or File widget. |
| 3. Data Exploration | Word Cloud, Data Table to see the data. |
| 4. Pre-processing | Preprocess Text widget — tokenisation, stop-word removal, lowercase, stemming/lemmatisation. |
| 5. Modelling | Sentiment Analysis widget (VADER / Liu-Hu) — scores each sentence. |
| 6. Evaluation / Visualisation | Box Plot, Scatter Plot to see the positive/negative/neutral distribution. |
🔹 Interpretation of Sentiment Scores
Quick Revision — Key Points to Remember
- NLP = AI sub-field to analyse, understand, process human language. Purpose: enable communication between computers and humans.
- Features of natural languages: governed by rules · redundant · change over time · same-sounding words · context-dependent meaning.
- Applications: Auto-captions · Voice Assistants (Alexa, Siri, Google) · Translation (Google Translate) · Sentiment Analysis · Text Classification · Keyword Extraction · Spam Filters · Chatbots.
- 5 Stages of NLP: Lexical → Syntactic (Parsing) → Semantic → Discourse Integration → Pragmatic Analysis.
- Chatbot = program that simulates human conversation.
- Script Bot = rule-based, cheap, limited. Smart Bot = AI-powered, adaptive, ChatGPT/Mitsuku.
- Corpus = whole collection of textual data from multiple documents.
- Text Normalisation (6 steps): Sentence Segmentation → Tokenisation → Remove Stop Words → Common Case → Stemming → Lemmatisation.
- Token = word/number/special character in a sentence.
- Stop Words = common words (a, the, is, and, to) that add no meaning. Removed.
- Stemming = fast affix-removal, result may be meaningless (studies → studi).
- Lemmatisation = slower, meaningful result (studies → study).
- Bag of Words = extract vocabulary + frequency. 4 steps: Pre-process → Dictionary → Document Vector → Repeat.
- Document Vector Table = matrix with unique words as columns, documents as rows, counts in cells.
- TF-IDF = Term Frequency × log(Inverse Document Frequency). Formula: TF-IDF(W) = TF(W) × log(IDF(W)).
- TF = frequency of word in ONE document.
- IDF = Total Documents / Document Frequency.
- Word value: Stop words = low · Frequent words = medium · Rare valuable words = high.
- TF-IDF Applications: Document Classification · Topic Modelling · Information Retrieval · Stop-Word Filtering.
- Code NLP tools: NLTK, SpaCy. No-Code NLP tools: Orange Data Mining, MonkeyLearn.
- Sentiment Analysis = classify text as positive / negative / neutral. Used in reviews, social-media monitoring, customer service, stock prediction.