PART B ▪ UNIT 6 · Natural Language Processing

  root@vm-learning
  ~
  $
  open
  ch-b6
  

PART B ▪ UNIT 6

Natural Language Processing

Stages · Chatbots · Text Processing · Bag of Words · TF-IDF

Natural Language Processing (NLP) is a sub-field of AI focused on enabling computers to analyse, understand and process human languages to derive meaningful information. It is what makes Google Translate, Alexa, ChatGPT and email spam filters work.

Computers only understand binary language (0s and 1s). NLP facilitates the conversion from natural language into digital form, making communication between computer systems and humans possible.

Learning Outcome 1: Understand Natural Languages & the need for NLP

6.1 Features of Natural Languages

A natural language is a human language — French, Spanish, English, Hindi, Japanese, etc. Its features:

Governed by RulesSyntax, lexicon and semantics — every language follows set rules.

RedundancyThe same information can be conveyed in multiple ways — "I am tired" / "I feel exhausted".

⏳ Languages Change Over TimeNew words are added, old ones fade; grammar evolves.

Same-Sounding Words Have Different Meanings

Choose the right word:
• "I am so tired; I want to take a ___." (nap / knap)
• "Let's ___ her a letter." (write / right)

Same sound, different spelling, totally different meaning!

Same Word, Different Meanings in Context

Three sentences using "red":

His face turned red after he found out he took the wrong bag. → ashamed / angry?
The red car zoomed past his nose. → colour of the car
His face turns red after consuming the medicine. → allergic reaction?

Context is everything! We understand sentences almost intuitively based on our history of using the language. The same word can have multiple meanings — NLP must figure out the intended meaning from context.

6.2 Computer Language vs Natural Language

Computer Languages

Languages used to interact with a computer — Python, C++, Java, HTML. Strict rules, no ambiguity.

Natural Languages

Human languages — English, Hindi, Tamil. Many rules, many exceptions, lots of ambiguity.

6.3 Why is NLP Important?

Computers can only process electronic signals in binary form — they don't understand "hello" directly.
NLP converts natural language to digital form so computers can process it.
Makes communication between humans and computers possible.
Creates different tools and techniques for better communication of intent and context.
Powers everyday tools — search engines, chatbots, email filters, voice assistants, translation apps.
Enables accessibility — auto-captions for the hearing impaired, screen readers for the visually impaired.

Learning Outcome 2: Explore real-life applications of NLP

6.4 Real-Life Applications of NLP

Auto-Generated CaptionsCaptions generated by turning natural speech into text in real-time. Enhances accessibility. Example: YouTube and Google Meet auto-captions.

Voice AssistantsTake natural speech, process it, return an output. "Hey Google, set alarm at 3:30 pm" · "Hey Alexa, play music" · "Hey Siri, what's the weather?"

Language TranslationConverts text/speech from one language to another. Example: Google Translate, Microsoft Translator — helps cross-linguistic communication.

Sentiment AnalysisDetects whether a text is positive, negative or neutral. Used in customer-service feedback, social-media monitoring, brand analysis.

Text ClassificationClassifies a sentence or document category-wise. Example: news articles categorised into Food, Sports, Politics.

Keyword ExtractionAutomatically extracts the most important words / expressions from text. Useful for SEO, customer service, social-media insights.

Spam FiltersIdentifies spam emails by analysing subject line, content and sender — one of the earliest NLP apps.

ChatbotsSimulate human conversation for customer support, bookings, FAQs.

Activity — Keyword Extraction with Google API:
Step 1: Visit https://cloud.google.com/natural-language.
Step 2: Paste any paragraph in the text box → Click Analyze.
Observe: Keywords from the paragraph get highlighted in different colours — e.g., "Google", "Mountain View". Try your own text and see results!

Learning Outcome 3: Understand the stages of NLP

6.5 Stages of Natural Language Processing

NLP processes text through 5 sequential stages — each stage adds a deeper layer of understanding:

5 STAGES OF NLP

1. Lexical 2. Syntactic 3. Semantic 4. Discourse 5. Pragmatic

1. Lexical Analysis

NLP starts with identifying the structure of input words. It is the process of dividing a large chunk of words into structural paragraphs, sentences, and words.

Lexicon = collection of various words and phrases used in a language.

Long text → broken down into chunks: sentences → words → tokens.

2. Syntactic Analysis / Parsing

Process of checking the grammar of sentences and phrases. Forms a relationship among words and eliminates logically incorrect sentences.

Correct grammar: "The boy runs fast."
Incorrect grammar: "Boy runs the fast." — rejected at this stage.

3. Semantic Analysis

Input text is checked for meaning. Every word and phrase is checked for meaningfulness.

Makes sense: "The dog jumped over the fox."
Does NOT make sense: "Hot ice cream" — semantically wrong. Rejected.
Also rejected: "The fox jumped into the dog." — grammatically right but semantically odd.

4. Discourse Integration

Process of forming the story of the sentence. Every sentence should have a relationship with its preceding and succeeding sentences.

Good flow:
"Rahul went to the store. He bought milk. Then he came home."
Each sentence connects with the next — the pronouns and connectors are clear.

5. Pragmatic Analysis

Sentences are checked for their relevance in the real world. Pragmatic means practical or logical — understanding the intent of the sentence and discarding the literal meaning when necessary.

Literal: "Can you pass the salt?" → semantic answer: "Yes."
Pragmatic (intended): "Pass the salt." → the real meaning is a request, not a yes/no question. Pragmatic analysis captures this real-world intent.

Summary of 5 Stages

Stage	What It Checks
1. Lexical	Structure — break text into paragraphs, sentences, words, tokens.
2. Syntactic	Grammar — are the words arranged correctly?
3. Semantic	Meaning — does this make any sense?
4. Discourse	Story flow — do sentences connect?
5. Pragmatic	Intent — what's the real-world meaning?

Learning Outcome 4: Understand Chatbots — Script Bot vs Smart Bot

6.6 Chatbots

A chatbot is a computer program designed to simulate human conversation through voice commands, text chats, or both. It can learn over time how to interact better, answer questions, troubleshoot problems, qualify leads and boost sales.

Play with Chatbots — Activity

Try each chatbot below and compare the experience:

Elizabot — masswerk.at/elizabot/ (one of the earliest chatbots from 1966)
Mitsuku / Kuki — kuki.ai (multiple times winner of the Loebner Prize)
Cleverbot — cleverbot.com (learns from conversations)
Singtel — singtel.com/personal/support (customer-service chatbot)

Discussion: Which chatbot did you try? What was its purpose? Did it feel like talking to a human or a robot? Does the chatbot have a personality?

6.7 Script Bot vs Smart Bot

After interacting with chatbots, you'll notice two types:

Script Bot (Traditional)

Works on pre-defined scripts / rules.
Can only handle topics the developer programmed.
Cheap and fast to build.
Cannot learn from conversations.
Falls apart when user asks something unscripted.
Example: a pizza-ordering bot that only handles pizza orders.

Smart Bot (AI-Powered)

Uses Machine Learning and NLP.
Handles wider range of topics naturally.
Learns and improves from conversations.
Costlier and more complex to build.
Can maintain context across a conversation.
Example: ChatGPT, Alexa, Mitsuku.

Learning Outcome 5: Understand Text Processing — Normalisation, Bag of Words, TF-IDF

6.8 Why Text Processing?

Computers can only process numbers. So the first step of NLP is to convert human language to numbers. This happens in stages — starting with Text Normalisation.

Text Normalisation cleans up textual data so its complexity is reduced to a level the machine can handle. Before we begin, we need to understand: a corpus is the whole collection of written text from multiple documents.

6.9 Steps of Text Normalisation

TEXT NORMALISATION PIPELINE

1. Sentence Segmentation 2. Tokenisation 3. Remove Stop Words

4. Convert to Common Case 5. Stemming 6. Lemmatisation

1. Sentence Segmentation

The whole corpus is divided into individual sentences. Each sentence becomes a separate data item.

2. Tokenisation

Each sentence is further divided into tokens. A token = any word, number, or special character occurring in the sentence. Each word, number and special character becomes its own separate token.

"I love AI!" → Tokens: ["I", "love", "AI", "!"]

3. Removing Stop Words, Special Characters & Numbers

Stop words are words that occur very frequently in the corpus but do not add value to the meaning. They are grammatical glue — useful for humans, useless for machines.

Examples of stop words: a, an, the, is, in, at, of, to, and, but, or, this, that, which, these, those, for, with, from, by, on, …

Special characters and numbers are also removed if they don't matter.
Sometimes they matter — e.g., in email-address dataset you keep "@" and ".".
Removing stop words helps the computer focus on meaningful terms.

4. Converting Text to a Common Case

After stop-word removal, convert the whole text to lowercase (usually). This ensures that case sensitivity doesn't make the machine treat the same word as different.

All 6 forms — "Hello", "HELLO", "hello", "Hellohi", "HeLLo", "hEllO" — become "hello" and are treated as the same word.

5. Stemming

Process in which the affixes (prefixes/suffixes) of words are removed and words are converted to their base form.

Word	Affix	Stem
healed	-ed	heal
healing	-ing	heal
healer	-er	heal
studies	-es	studi (not meaningful!)
studying	-ing	study

Stemming limitation: The stemmed word may not be meaningful — it just removes affixes mechanically. "studies" becomes "studi" which isn't a word. Stemming is faster but less accurate.

6. Lemmatisation

Like stemming — but the result (called a lemma) is always a meaningful word. Lemmatisation takes longer but produces accurate base forms.

Word	Affix	Lemma
healed	-ed	heal
healing	-ing	heal
healer	-er	heal
studies	-es	study
studying	-ing	study

Stemming vs Lemmatisation — Quick Compare

Stemming

Faster · removes affixes blindly · result may be meaningless (studies → studi).

Lemmatisation

Slower · uses dictionary · result is always meaningful (studies → study).

6.10 Bag of Words Algorithm

Bag of Words is an NLP model that extracts features from text which can be used in machine-learning algorithms. It gives us:

A vocabulary of words for the corpus.
The frequency of these words (how many times each has occurred).

The "bag" term symbolises that the sequence of words does not matter — we only care about unique words and their frequency.

4-Step Bag of Words Implementation

BAG OF WORDS

1. Text Processing 2. Create Dictionary 3. Create Document Vectors 4. Repeat for All Documents

Worked Example

Three sample documents:
• Doc 1: "Aman and Avni are stressed"
• Doc 2: "Aman went to a therapist"
• Doc 3: "Avni went to download a health chatbot"

Step 1 — After Text Normalisation

Doc 1: [aman, and, avni, are, stressed]
Doc 2: [aman, went, to, a, therapist]
Doc 3: [avni, went, to, download, a, health, chatbot]

Step 2 — Create Dictionary (unique words from all docs)

Dictionary: aman · and · avni · are · stressed · went · to · a · therapist · download · health · chatbot (12 unique words)

Step 3 & 4 — Document Vector Table

Doc	aman	and	avni	are	stressed	went	to	a	therapist	download	health	chatbot
Doc 1	1	1	1	1	1	0	0	0	0	0	0	0
Doc 2	1	0	0	0	0	1	1	1	1	0	0	0
Doc 3	0	0	1	0	0	1	1	1	0	1	1	1

Each cell = how many times the word appears in that document. This table is the document vector table.

6.11 TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF = Term Frequency × Inverse Document Frequency. It helps us identify the value of each word — words that appear a lot in one document but rarely across the corpus are highly valuable.

Term Frequency (TF)

Term Frequency = frequency of a word in one document. Found directly in the document vector table.

In our example, TF of "aman" in Doc 1 = 1. TF of "stressed" in Doc 1 = 1.

Document Frequency (DF)

Document Frequency = number of documents in which the word occurs, irrespective of how many times.

From our vocabulary:

DF of "aman" = 2 (appears in Docs 1 and 2)
DF of "avni" = 2 (Docs 1 and 3)
DF of "went" = 2 (Docs 2 and 3)
DF of "to" = 2 (Docs 2 and 3)
DF of "a" = 2 (Docs 2 and 3)
DF of "stressed", "therapist", "download", "health", "chatbot" = 1 (one doc each)

Inverse Document Frequency (IDF)

IDF(word) = Total Number of Documents / Document Frequency of the word

For our 3-document example: IDF of "aman" = 3/2 = 1.5. IDF of "stressed" = 3/1 = 3.

TF-IDF Formula

TF-IDF(W) = TF(W) × log(IDF(W))

(log is to base 10.)

Understanding the Plot — Words vs Value

Stop Words

Highest frequency in all documents but negligible value — "a, and, the, is". Removed at pre-processing.

Frequent Words

Adequate occurrence — talk about the document's subject. Some amount of value.

Rare / Valuable Words

Lowest occurrence but highest value — they add the most meaning to the corpus.

TF-IDF Worked Example — "air pollution"

Case 1 — "and" (stop word):
• Total documents = 10. "and" occurs in all 10 documents.
• IDF(and) = 10/10 = 1. log(1) = 0.
• TF-IDF(and) = TF × 0 = 0. Correctly identified as worthless.

Case 2 — "pollution" (valuable word):
• Total documents = 10. "pollution" occurs in 3 documents.
• IDF(pollution) = 10/3 = 3.33. log(3.33) ≈ 0.522.
• TF-IDF(pollution) = TF × 0.522 = high value. Correctly identified as important.

Summary of TF-IDF Insight:

Words in all documents with high TF → low value → stop words.
For high TF-IDF → word needs high TF but low DF — important for one doc, not common everywhere.
Higher the value → more important the word is for a given corpus.

6.12 Applications of TF-IDF

Document ClassificationClassifies the type and genre of a document.

Topic ModellingPredicts the topic for a corpus.

Information RetrievalExtracts important information from a large corpus — search engines.

️ Stop-Word FilteringHelps remove unnecessary words from text automatically.

6.13 Code vs No-Code NLP Tools

Type	Tools	Details
Code NLP	NLTK, SpaCy	• NLTK (Natural Language Tool Kit) — Python package for text processing. Functions and modules for NLP. • SpaCy — open-source NLP library for tokenisation, POS tagging, named-entity recognition, dependency parsing.
No-Code NLP	Orange, MonkeyLearn	• Orange Data Mining — ML tool through Python and visual programming. Drag-and-drop steps. • MonkeyLearn — text-analysis platform offering NLP tools and ML models for classification, sentiment analysis, entity recognition.

6.14 Sentiment Analysis — Use Case Walkthrough

Sentiment Analysis is the process of using NLP to determine whether a piece of text expresses a positive, negative, or neutral sentiment. Also called opinion mining.

Real-World Use Cases of Sentiment Analysis

Product reviews — is this review positive/negative? (Amazon, Flipkart)
Social-media monitoring — public opinion about a brand, politician, movie.
Customer service — identifying angry customers for priority handling.
Stock prediction — Twitter sentiment affects market movements.
Election analysis — voter sentiment from social posts.
Feedback analysis — post-event feedback forms.

Sentiment Analysis in Orange Data Mining

Orange has dedicated NLP widgets for sentiment analysis. The flow mirrors the AI Project Cycle:

Stage	Orange Widget / Action
1. Problem Scoping	Decide — are we analysing movie reviews, tweets, or customer feedback?
2. Data Acquisition	Load text dataset via Corpus or File widget.
3. Data Exploration	Word Cloud, Data Table to see the data.
4. Pre-processing	Preprocess Text widget — tokenisation, stop-word removal, lowercase, stemming/lemmatisation.
5. Modelling	Sentiment Analysis widget (VADER / Liu-Hu) — scores each sentence.
6. Evaluation / Visualisation	Box Plot, Scatter Plot to see the positive/negative/neutral distribution.

Interpretation of Sentiment Scores

Positive (> 0.05)"Great product!", "Loved it!", "Amazing service".

Neutral (−0.05 to 0.05)"The product arrived today." "The shop opens at 10."

Negative (< −0.05)"Worst service!", "Totally disappointed", "Broken on arrival".

Quick Revision — Key Points to Remember

NLP = AI sub-field to analyse, understand, process human language. Purpose: enable communication between computers and humans.
Features of natural languages: governed by rules · redundant · change over time · same-sounding words · context-dependent meaning.
Applications: Auto-captions · Voice Assistants (Alexa, Siri, Google) · Translation (Google Translate) · Sentiment Analysis · Text Classification · Keyword Extraction · Spam Filters · Chatbots.
5 Stages of NLP: Lexical → Syntactic (Parsing) → Semantic → Discourse Integration → Pragmatic Analysis.
Chatbot = program that simulates human conversation.
Script Bot = rule-based, cheap, limited. Smart Bot = AI-powered, adaptive, ChatGPT/Mitsuku.
Corpus = whole collection of textual data from multiple documents.
Text Normalisation (6 steps): Sentence Segmentation → Tokenisation → Remove Stop Words → Common Case → Stemming → Lemmatisation.
Token = word/number/special character in a sentence.
Stop Words = common words (a, the, is, and, to) that add no meaning. Removed.
Stemming = fast affix-removal, result may be meaningless (studies → studi).
Lemmatisation = slower, meaningful result (studies → study).
Bag of Words = extract vocabulary + frequency. 4 steps: Pre-process → Dictionary → Document Vector → Repeat.
Document Vector Table = matrix with unique words as columns, documents as rows, counts in cells.
TF-IDF = Term Frequency × log(Inverse Document Frequency). Formula: TF-IDF(W) = TF(W) × log(IDF(W)).
TF = frequency of word in ONE document.
IDF = Total Documents / Document Frequency.
Word value: Stop words = low · Frequent words = medium · Rare valuable words = high.
TF-IDF Applications: Document Classification · Topic Modelling · Information Retrieval · Stop-Word Filtering.
Code NLP tools: NLTK, SpaCy. No-Code NLP tools: Orange Data Mining, MonkeyLearn.
Sentiment Analysis = classify text as positive / negative / neutral. Used in reviews, social-media monitoring, customer service, stock prediction.

Practice Quiz — test yourself on this chapter→