6.1 Why Encoding?
A computer can store numbers directly in binary, but what about letters like A or क, or a smiley emoji? Each such character must first be assigned a unique number (called its code point); then the computer simply stores that number in binary. The rule that maps every character to a number is called a character encoding scheme.
6.2 ASCII — American Standard Code for Information Interchange
- Introduced in 1963. Became the dominant scheme for English text on computers.
- Originally a 7-bit code — gives 27 = 128 unique code points (0–127).
- Covers upper & lower case English letters, digits, punctuation and control codes (e.g.,
ENTER,TAB,BELL). - Later extended to 8 bits (Extended ASCII) → 256 code points, adding symbols like é, ñ, ©, ½.
| Character | ASCII (decimal) | ASCII (binary, 8-bit) |
|---|---|---|
A | 65 | 01000001 |
B | 66 | 01000010 |
Z | 90 | 01011010 |
a | 97 | 01100001 |
z | 122 | 01111010 |
0 | 48 | 00110000 |
9 | 57 | 00111001 |
| Space | 32 | 00100000 |
! | 33 | 00100001 |
| Enter (CR) | 13 | 00001101 |
- Upper-case letters start at
65 (A)→90 (Z). - Lower-case letters start at
97 (a)→122 (z). - Difference between upper- and lower-case of the same letter is exactly 32.
- Digits
'0'…'9'start at48→57. The digit character'5'is not the number 5 — it is 53!
6.3 ISCII — Indian Script Code for Information Interchange
- Developed by the Bureau of Indian Standards (BIS) in 1991, updated in 1999.
- An 8-bit scheme (256 code points). Lower 128 are identical to ASCII; upper 128 are used for Indian scripts.
- Supports Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam — ten Indian scripts through a clever script-switching byte.
- Served pre-Unicode Indian computing (Govt. records, Akashvani broadcast-automation systems) but has been largely replaced by Unicode.
6.4 Unicode — One Code for Every Character in the World
By the 1990s, with the Internet spreading globally, each country had its own encoding (Shift-JIS for Japanese, GB for Chinese, ISCII for Indian, etc.). This made sharing documents across regions a nightmare. Unicode was designed to end the chaos: one universal code table for every character of every living script, plus historical scripts, mathematical symbols and even emojis.
- Started in 1991; maintained by the Unicode Consortium.
- Each character is assigned a code point written as
U+HHHH(hexadecimal). - Examples —
A=U+0041,अ=U+0905,€=U+20AC,😀=U+1F600. - Current standard contains over 1,40,000 characters spanning 150+ scripts.
6.4.1 UTF-8 vs UTF-32
A code point is a number; UTF-8 and UTF-32 are two different ways of encoding that number as bytes on disk.
| Scheme | Bytes per character | ASCII-compatible? | Pros | Cons |
|---|---|---|---|---|
| UTF-8 | Variable: 1 – 4 bytes | Yes — ASCII fits in 1 byte unchanged | Space-efficient for English text; dominant on the Web & in files. | Slightly more work to find the n-th character (variable length). |
| UTF-32 | Fixed: always 4 bytes | No — every ASCII char takes 4 bytes | Every character is the same size — easy to index. | Wastes space for English-heavy text. |
ASCII : 48 69 (2 bytes) UTF-8 : 48 69 (2 bytes, same as ASCII) UTF-32 : 00 00 00 48 00 00 00 69 (8 bytes)The word "नमस्ते" cannot be written in ASCII at all; UTF-8 stores it in about 12–18 bytes, UTF-32 in exactly 24 bytes.
.py file in VS Code, it is UTF-8 by default.
📌 Quick Revision — Chapter 6 at a Glance
- Encoding = rule mapping every character to a unique number (code point).
- ASCII (1963) — 7-bit / 128 codes; extended 8-bit / 256. 'A' = 65, 'a' = 97, diff = 32.
- ISCII — 8-bit Indian standard (BIS, 1991). Lower 128 = ASCII; upper 128 = Indian scripts.
- Unicode — one universal table, code points written U+HHHH. Over 1,40,000 characters.
- UTF-8 — variable 1–4 bytes, ASCII-compatible, over 98% of the Web.
- UTF-32 — fixed 4 bytes per character — easy to index, wasteful for English text.
& Programming – I
Ch 11 • Expressions & I/O • Ch 12 • Errors • Ch 13 • Flow of Control
Ch 14 • Conditionals • Ch 15 • Loops • Ch 16 • Strings
Ch 17 • Lists • Ch 18 • Tuples • Ch 19 • Dictionaries • Ch 20 • Modules