Chapter 06 · Encoding Schemes

  root@vm-learning
  ~
  $
  open
  ch-1-6
  

UNIT 1 ▪ CHAPTER 6

Encoding Schemes

ASCII · ISCII · Unicode · UTF-8 · UTF-32

Learning Outcome 1: Explain encoding schemes — ASCII, ISCII and Unicode (UTF-8 / UTF-32).

6.1 Why Encoding?

A computer can store numbers directly in binary, but what about letters like A or क, or a smiley emoji? Each such character must first be assigned a unique number (called its code point); then the computer simply stores that number in binary. The rule that maps every character to a number is called a character encoding scheme.

How the letter "A" is stored:

6.2 ASCII — American Standard Code for Information Interchange

Introduced in 1963. Became the dominant scheme for English text on computers.
Originally a 7-bit code — gives 2⁷ = 128 unique code points (0–127).
Covers upper & lower case English letters, digits, punctuation and control codes (e.g., ENTER, TAB, BELL).
Later extended to 8 bits (Extended ASCII) → 256 code points, adding symbols like é, ñ, ©, ½.

Character	ASCII (decimal)	ASCII (binary, 8-bit)
`A`	65	01000001
`B`	66	01000010
`Z`	90	01011010
`a`	97	01100001
`z`	122	01111010
`0`	48	00110000
`9`	57	00111001
Space	32	00100000
`!`	33	00100001
Enter (CR)	13	00001101

Handy facts:

Upper-case letters start at 65 (A) → 90 (Z).
Lower-case letters start at 97 (a) → 122 (z).
Difference between upper- and lower-case of the same letter is exactly 32.
Digits '0'…'9' start at 48 → 57. The digit character '5' is not the number 5 — it is 53!

6.3 ISCII — Indian Script Code for Information Interchange

Developed by the Bureau of Indian Standards (BIS) in 1991, updated in 1999.
An 8-bit scheme (256 code points). Lower 128 are identical to ASCII; upper 128 are used for Indian scripts.
Supports Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam — ten Indian scripts through a clever script-switching byte.
Served pre-Unicode Indian computing (Govt. records, Akashvani broadcast-automation systems) but has been largely replaced by Unicode.

6.4 Unicode — One Code for Every Character in the World

By the 1990s, with the Internet spreading globally, each country had its own encoding (Shift-JIS for Japanese, GB for Chinese, ISCII for Indian, etc.). This made sharing documents across regions a nightmare. Unicode was designed to end the chaos: one universal code table for every character of every living script, plus historical scripts, mathematical symbols and even emojis.

Started in 1991; maintained by the Unicode Consortium.
Each character is assigned a code point written as U+HHHH (hexadecimal).
Examples — A = U+0041, अ = U+0905, € = U+20AC, = U+1F600.
Current standard contains over 1,40,000 characters spanning 150+ scripts.

6.4.1 UTF-8 vs UTF-32

A code point is a number; UTF-8 and UTF-32 are two different ways of encoding that number as bytes on disk.

Scheme	Bytes per character	ASCII-compatible?	Pros	Cons
UTF-8	Variable: 1 – 4 bytes	Yes — ASCII fits in 1 byte unchanged	Space-efficient for English text; dominant on the Web & in files.	Slightly more work to find the n-th character (variable length).
UTF-32	Fixed: always 4 bytes	No — every ASCII char takes 4 bytes	Every character is the same size — easy to index.	Wastes space for English-heavy text.

Same word, different encodings. The word "Hi" (2 characters):

ASCII       :  48 69                         (2 bytes)
UTF-8       :  48 69                         (2 bytes, same as ASCII)
UTF-32      :  00 00 00 48  00 00 00 69      (8 bytes)

The word "नमस्ते" cannot be written in ASCII at all; UTF-8 stores it in about 12–18 bytes, UTF-32 in exactly 24 bytes.

Why UTF-8 won the Web: it is backward compatible with ASCII, keeps English text tiny, and still represents every character on Earth. Today > 98% of all web pages are served in UTF-8. When you save a Python .py file in VS Code, it is UTF-8 by default.

Quick Revision — Chapter 6 at a Glance

Encoding = rule mapping every character to a unique number (code point).
ASCII (1963) — 7-bit / 128 codes; extended 8-bit / 256. 'A' = 65, 'a' = 97, diff = 32.
ISCII — 8-bit Indian standard (BIS, 1991). Lower 128 = ASCII; upper 128 = Indian scripts.
Unicode — one universal table, code points written U+HHHH. Over 1,40,000 characters.
UTF-8 — variable 1–4 bytes, ASCII-compatible, over 98% of the Web.
UTF-32 — fixed 4 bytes per character — easy to index, wasteful for English text.

Unit 2

Computational Thinking
& Programming – I

Problem-solving · Algorithms · Python Programming

Ch 7 • Problem Solving • Ch 8 • Python Basics • Ch 9 • Data Types • Ch 10 • Operators
Ch 11 • Expressions & I/O • Ch 12 • Errors • Ch 13 • Flow of Control
Ch 14 • Conditionals • Ch 15 • Loops • Ch 16 • Strings
Ch 17 • Lists • Ch 18 • Tuples • Ch 19 • Dictionaries • Ch 20 • Modules

Practice Quiz — test yourself on this chapter→