Legacy encodings vs. Unicode
| LEGACY ENCODINGS |
1. Character Set vs. Encoding
- Character set: bucket of characters. e.g. ASCII: 94 printable characters and control characters (extends to ISO 8859-1:1998)
- Encoding specifies how the code points in a coded character set are to be represented and transmitted with a computer, simply the process of mapping a character to a numeric value.
- Asian Character encoding (usually 2 bytes: row 256 x cell 256 -> typically 94×94 matrix is used for printable characters to represent Asian characters)
- A mixed one- and two-byte character stream efficiently represents a mixture of English and Asian text.
e.g. EUC-KR (0xA1A1 ~ 0xFEFE ) = 5E x 5E => 94 x 94 => 8836 code points (covers KS C 5601 = 4888(hanja) + 2350(hangul) + 986(symbols) = 8224
2. Encoding Types
- MODAL (requires escape sequences for switching between character sets, 7bits): ISO 2022, UTF-7
- NON-MODAL (makes use of numeric values of bytes in order to decide when to switch, 8bits): Big Five, EUC, GBK, Johab, Shift-JIS, UTF-8, UTF-16 and so forth.
- FIXED WIDTH: ASCII, UCS-2, UCS-4 (for internal text processing)
-Simplified:
| Charsets | GB 2312-80 | Encodings | HZ(7bit, ASCII + GB2312-80, MIME:HZ-GB-2312), CN-GB(8bit, ASCII + GB2312-80, MIME: CN-GB), EUC-CN(8bit, GB 2312:80), CP936 |
-Traditional:
| Charsets | GB 12345-90, CNS 11643: 1992(Plane 1 and 2 are identical to Big Five), Big Five(8bit), Big Five Plus, Eten, HKSCS | Encodings | EUC-TW(8bit, modal encoding because shift chars are used), Big Five, CP950 |
- Korean
| Charsets | KS X 1001:1992(= KSC 5601), KSX 1003:1993 (KS Roman), KSX 1005:1995(Unicode 2.0) | Encodings | EUC-KR(=KSX 1001 + KSX 1003), CP949, ISO 2022-KR, Johab, CP1361 |
- Unicode and GB 18030
| Charsets | Unicode/ISO 10646/GB 13000-1992, GB 18030-2000 | Encodings | Unicode/ISO 10646, GB 18030-2000, ISO 2022-CN, ISO 2022-CN-EXT |
Big Five and GB 18030 are defined as a character set and an encoding.
KS X 1001 defines ISO 2022 for a byte-encoding standard.
3. Escape Sequence
- ASCII and double-byte characters can be distinguished by use of the shift function.
Reference: IRCCSS (International Register Of Coded Character Sets To Be Used With Escape Sequences): http://www.itscj.ipsj.or.jp/ISO-IR/
- ISO 2022
Two-stage mapping needed to encode 94 / 96 character sets using escape sequences in ISO 2022:
- ISO 2022-CN
7 bit, multibyte encoding
- Include four charsets: ASCII, GB 2312-80, CNS 11643-1992 Plane 1, CNS 11643-1986 Plane 2
Designators: (Must appear on each line containing characters from a double byte character set)
| ESC $ ) A | GB 2312-80 | ESC $ ) G | CNS 11643-1992 Plane 1 |
| ESC $ * H | CNS 11643-1992 Plane 2 |
| SO (0×0e = Ctrl-N) | to double byte mode(e.g. KSX 1001, GB 2312, CNS 11643-1992 Plane 1) |
| SI(0×0f = Ctrl-O) | to single byte mode (e.g. ASCII) |
| SS2(Single Shift Sequence) | 0×1B 0×4E -> ISO 2022-CN, CNS 11643-1992 Plane 2: double byte mode for only next two bytes. |
| SS3 | 0×1B 0×4F: CNS 11643-1992 Plane 3 – 7 |
| NOTE: ESC = 0×1B | |
e.g. [GB 2312 characters] and [CNS 11643-1986 Plane 1 characters]
0×1B $ ) A 0×0E [GB 2312 characters] 0×0F and 0×1B $ ) G 0×0E [CNS 11643-1986 Plane 1 characters] 0×0F
- ISO 2022-KR
| Designator | ESC $ ) C (0×1b 24 29 43) -> G1 | SO (0×0e = Ctrl-N) | Shift out to double byte mode(KSX 1001) |
| SI(0×0f = Ctrl-O) | Shift in to single byte mode (e.g. ASCII) |
| SS2, SS3 - not available in KSX 1001 | |
- HZ
~{ to GB
~} to out of GB
~~ in ASCII
- EUC
8-bit, non-modal double byte encoding.
Define four code sets. Code Set 0 is ASCII, Code Sets 1 – 3 are language dependent.
Shift Characters: Only SS2 for code set 2, SS3 for code set 3 are used. No Shift Character for Code Set 1 because MSB can be used.
EUC-KR :
8 bit, Encodes KS X 1001 + KS X 1003
EUC-KR to ISO 2022-KR 7bit (especially sending an email)
e.g. A GA B -> 41 B0A1 42-> ESC $ ) C A ^N 3021 ^O B
EUC-CN:
8 bit, Encodes GB2312:80 in Code Set 1 (= CN-GB), no Code set 2 and 3.
EUC-TW: only EUC encoding that needs shift sequences; therefore, it is a modal encoding.
8 bit
ASCII in Code Set 0
CNS 11643-1992 Plane 1 in Code Set 1 (2byte)
CNS 11643-1992 Plane 1-7 in Code Set 2 (4byte) : Shift characters -> SS2 0xA[1-7] Row Cell
Code Set 3 is not used
- Big Five
8-bit, non-modal.
Correspond to CNS 11643-1992 planes 1 and 2.
MIME: CN-BIG5
- GB 18030:2000
Descendent of GB 2312 and GBK
Adds character repertoire of Unicode 3.0
1,2,4 byte encoding
4. Problem with legacy encodings
Transcoding issues: some of the characters not encoded in GB2312, but encoded in GBK or CP936, but most of Chinese web pages specify GB2312 as their charset.
Too many variations on Big Five
No International character repertoire.
Difficult to convert between encodings
| UNICODE / ISO 10646 |
ISO (International Organization For Standardization) and Unicode are the same but different.
Both define same code ranges. Only big difference is probably that Unicode provides more detailed explanation on character ranges and character handling mechanism, but defines same code ranges.
ISO 646 = CJKV Roman (=US-ASCII, backslash (\) replaced with currency symbol = KS X 1003)
ISO 2022
ISO 10646-1:1993(Unicode) => KS X 1001-1:1995
ISO 10646-1(Unicode 1.5 -> hangul 6656) => KS X 1005-1 (Hangul(11172): ISO 10646-1 AMD 5)
Unicode 1.1 – 1.5 : 6656 -> 11172(Unicode 2.0) = ISO 10646-1 AMD 5
- UNICODE/ISO 10646
UCS-4 : (Universal multiple-octet Coded character Set)
0×00000000 ~ 0×7FFFFFFF (01111111 11111111 11111111 11111111)
G=00, P=00 ~ G=7f, P=ff
1 plane represents 2 bytes = 65536
Total 32768 planes
UCS -2 :
BMP(Basic Multilingual Plane) : only one plane [ G=00, P=00 ]
UTF ( UCS Transformation Format) – Can’t use UCS in 7bit environment because UCS is using C0 (0×00-1f) and C1(0×80-9f). Therefore it is not following ISO 2022
C0 -> 000xxxxxx , C1 -> 100xxxxx
Therefore, x00xxxxx code points can not be used.
UTF-1 and UTF-7 are not using C0 and C1, but UTF-16 and 8 are using them, therefore, base64 and QP have to be used when transmitting data in those formats.
- UTF-16 and UTF-8
UTF-16 : BMP + Surrogate area = (0×00000000 ~ 0×00100000 -> 17 planes)
BMP(G=00 P=00) + Surrogate [G=00, P=10]
High Surrogate :0xd800 – dbff (1024 code points)
Low Surrogate: 0xdc00 – dfff (1024 code points)
There are no characters defined in two-byte range of Surrogate areas.
e.g. The range of BMP is 0×0000 – 0xffff, How about 0xffff + 0×0001= ?
d800 dc00 => 0x 00001 0000
16 planes (1Million) + 1 plane (65,536) = 1,065,536
UTF-8 : 1 – 6 bytes
US-ASCII: 1byte
UCS-2: 2 to 3 bytes
UCS-4: 4 to 6 bytes
e.g. Korean Character ‘GA (AC00)’
AC00 -> 1010 1100 0000 0000 -> 1010 110000 000000
add 1110 1010
add 10 110000
add 10 000000
Therefore, 11101010 10110000 10000000
e a b 0 8 0
UCS 4 to UTF-8
0000 0000 – 0000 007F:
00000000 00000000 00000000 0AAAAAAA ->
0aaaaaaa
0000 0080 – 0000 07FF:
00000000 00000000 00000AAA AABBBBBB ->
110aaaaa 10bbbbbb
0000 0800 – 0000 FFFF: ——> BMP
00000000 00000000 AAAABBBB BBCCCCCC ->
1110aaaa 10bbbbbb 10bbbbbb
0001 0000 – 001F FFFF:
00000000 000AAABB BBBBCCCC CCDDDDDD ->
11110aaa 10bbbbbb 10cccccc 10dddddd
0020 0000 – 03FF FFFF:
000000AA BBBBBBCC CCCCDDDD DDEEEEEE ->
111110aa 10bbbbbb 10cccccc 10dddddd 10eeeeee
0400 0000 – 7FFF FFFF:
0ABBBBBB CCCCCCDD DDDDEEEE EEFFFFFF ->
1111110a 10bbbbbb 10cccccc 10dddddd 10eeeeee 10ffffff
No Comments
Leave a comment