Deprecated: Assigning the return value of new by reference is deprecated in /home/translation/www/blog/wp-includes/cache.php on line 36

Deprecated: Assigning the return value of new by reference is deprecated in /home/translation/www/blog/wp-includes/query.php on line 21

Deprecated: Assigning the return value of new by reference is deprecated in /home/translation/www/blog/wp-includes/theme.php on line 508
Vincent Park » Legacy encodings vs. Unicode

Legacy encodings vs. Unicode

LEGACY ENCODINGS

1. Character Set vs. Encoding

- Character set: bucket of characters. e.g. ASCII: 94 printable characters and control characters (extends to ISO 8859-1:1998)

- Encoding specifies how the code points in a coded character set are to be represented and transmitted with a computer, simply the process of mapping a character to a numeric value.

- Asian Character encoding (usually 2 bytes: row 256 x cell 256 -> typically 94×94 matrix is used for printable characters to represent Asian characters)

- A mixed one- and two-byte character stream efficiently represents a mixture of English and Asian text.

e.g. EUC-KR (0xA1A1 ~ 0xFEFE ) = 5E x 5E => 94 x 94 => 8836 code points (covers KS C 5601 = 4888(hanja) + 2350(hangul) + 986(symbols) = 8224

2. Encoding Types

- MODAL (requires escape sequences for switching between character sets, 7bits): ISO 2022, UTF-7

- NON-MODAL (makes use of numeric values of bytes in order to decide when to switch, 8bits): Big Five, EUC, GBK, Johab, Shift-JIS, UTF-8, UTF-16 and so forth.

- FIXED WIDTH: ASCII, UCS-2, UCS-4 (for internal text processing)

-Simplified:

Charsets GB 2312-80 Encodings HZ(7bit, ASCII + GB2312-80, MIME:HZ-GB-2312), CN-GB(8bit, ASCII + GB2312-80, MIME: CN-GB), EUC-CN(8bit, GB 2312:80), CP936

-Traditional:

Charsets GB 12345-90, CNS 11643: 1992(Plane 1 and 2 are identical to Big Five), Big Five(8bit), Big Five Plus, Eten, HKSCS Encodings EUC-TW(8bit, modal encoding because shift chars are used), Big Five, CP950

- Korean

Charsets KS X 1001:1992(= KSC 5601), KSX 1003:1993 (KS Roman), KSX 1005:1995(Unicode 2.0) Encodings EUC-KR(=KSX 1001 + KSX 1003), CP949, ISO 2022-KR, Johab, CP1361

- Unicode and GB 18030

Charsets Unicode/ISO 10646/GB 13000-1992, GB 18030-2000 Encodings Unicode/ISO 10646, GB 18030-2000, ISO 2022-CN, ISO 2022-CN-EXT

Big Five and GB 18030 are defined as a character set and an encoding.

KS X 1001 defines ISO 2022 for a byte-encoding standard.

3. Escape Sequence

- ASCII and double-byte characters can be distinguished by use of the shift function.

Reference: IRCCSS (International Register Of Coded Character Sets To Be Used With Escape Sequences): http://www.itscj.ipsj.or.jp/ISO-IR/

- ISO 2022

Two-stage mapping needed to encode 94 / 96 character sets using escape sequences in ISO 2022:

1

Graphic Character Sets (Slots: G0, G1, G2, G3)Initially, G0 contains ISO 646 and G1 - 3 are not defined.

This can be changed using Designators

Designators:

esc $ ( ID

shift multi-byte 94-character set into G0

esc $ ) ID

shift multi-byte 94-character set into G1

esc $ * ID

shift multi-byte 94-character set into G2

esc $ + ID

shift multi-byte 94-character set into G3

Designator ID for multi-byte 94-character sets are:

@ : JIS C 6226-1978 A : GB 2312-1980 B : JIS X0208-1990 C : KSC 5601-1987 D : JIS X0212-1990 E : GB 2312-1980 plus GB 8565-1989

G : CNS 11643-1986 level 1 H : CNS 11643-1986 level 2 I : CNS 11643-1992 plane 3 J : CNS 11643-1992 plane 4 K : CNS 11643-1992 plane 5

L : CNS 11643-1992 plane 6 M : CNS 11643-1992 plane 7

2

The slots above are assigned to two byte ranges:GL(Graphic Left): encoding whose bytes have the eighth bit turned off = 0×20 7F for 96 character sets, 21 7E for 94 character sets (space and del defined)

GR(Graphic Right) : encoding whose bytes have the eighth bit turned on = A0 FF for 96 character sets, A1 FE for 94 character sets

Initially G0 is assigned to GR and G1 to GL range

These assignments may be changed with shift sequences:

o A single shift changes an assignment for a single character (one or more bytes).

o A locking shift makes a permanent assignment.

[ISO 2022 shift sequences]

Shifts for use with 7-bit encodings(only using GL)

The GL range may be switched between G0-3 with the locking shift sequences:

si

(called LS0 in 8-bit form) lock G0 into GL

so

(called LS1 in 8-bit form) lock G1 into GL

esc n

LS2: lock G2 into GL

esc o

LS3: lock G3 into GL

A single character from G2 or G3 may be placed in GL by preceding it with

esc N

SS2: shift single character from G2 into GL

esc O

SS3: shift single character from G3 into GL

Shifts for use with 8-bit encodings

The GR range may be switched between G1-3 with the locking shift sequences:

esc ~

LS1R: lock G1 into GR

esc }

LS2R: lock G2 into GR

esc |

LS3R: lock G3 into GR

In an 8-bit environment, the following single shifts are used:

8E

SS2: shift single character from G2

8F

SS3: shift single character from G3

The high bit on the following character is ignored.

- ISO 2022-CN

7 bit, multibyte encoding

- Include four charsets: ASCII, GB 2312-80, CNS 11643-1992 Plane 1, CNS 11643-1986 Plane 2

Designators: (Must appear on each line containing characters from a double byte character set)

ESC $ ) A GB 2312-80 ESC $ ) G CNS 11643-1992 Plane 1
ESC $ * H CNS 11643-1992 Plane 2
SO (0×0e = Ctrl-N) to double byte mode(e.g. KSX 1001, GB 2312, CNS 11643-1992 Plane 1)
SI(0×0f = Ctrl-O) to single byte mode (e.g. ASCII)
SS2(Single Shift Sequence) 0×1B 0×4E -> ISO 2022-CN, CNS 11643-1992 Plane 2: double byte mode for only next two bytes.
SS3 0×1B 0×4F: CNS 11643-1992 Plane 3 7
NOTE: ESC = 0×1B

e.g. [GB 2312 characters] and [CNS 11643-1986 Plane 1 characters]

0×1B $ ) A 0×0E [GB 2312 characters] 0×0F and 0×1B $ ) G 0×0E [CNS 11643-1986 Plane 1 characters] 0×0F

- ISO 2022-KR

Designator ESC $ ) C (0×1b 24 29 43) -> G1 SO (0×0e = Ctrl-N) Shift out to double byte mode(KSX 1001)
SI(0×0f = Ctrl-O) Shift in to single byte mode (e.g. ASCII)
SS2, SS3 - not available in KSX 1001

- HZ

~{ to GB

~} to out of GB

~~ in ASCII

- EUC

8-bit, non-modal double byte encoding.

Define four code sets. Code Set 0 is ASCII, Code Sets 1 – 3 are language dependent.

Shift Characters: Only SS2 for code set 2, SS3 for code set 3 are used. No Shift Character for Code Set 1 because MSB can be used.

EUC-KR :

8 bit, Encodes KS X 1001 + KS X 1003

EUC-KR to ISO 2022-KR 7bit (especially sending an email)

e.g. A GA B -> 41 B0A1 42-> ESC $ ) C A ^N 3021 ^O B

EUC-CN:

8 bit, Encodes GB2312:80 in Code Set 1 (= CN-GB), no Code set 2 and 3.

EUC-TW: only EUC encoding that needs shift sequences; therefore, it is a modal encoding.

8 bit

ASCII in Code Set 0

CNS 11643-1992 Plane 1 in Code Set 1 (2byte)

CNS 11643-1992 Plane 1-7 in Code Set 2 (4byte) : Shift characters -> SS2 0xA[1-7] Row Cell

Code Set 3 is not used

- Big Five

8-bit, non-modal.

Correspond to CNS 11643-1992 planes 1 and 2.

MIME: CN-BIG5

- GB 18030:2000

Descendent of GB 2312 and GBK

Adds character repertoire of Unicode 3.0

1,2,4 byte encoding

4. Problem with legacy encodings

Transcoding issues: some of the characters not encoded in GB2312, but encoded in GBK or CP936, but most of Chinese web pages specify GB2312 as their charset.

Too many variations on Big Five

No International character repertoire.

Difficult to convert between encodings

UNICODE / ISO 10646

ISO (International Organization For Standardization) and Unicode are the same but different.

Both define same code ranges. Only big difference is probably that Unicode provides more detailed explanation on character ranges and character handling mechanism, but defines same code ranges.

ISO 646 = CJKV Roman (=US-ASCII, backslash (\) replaced with currency symbol = KS X 1003)

ISO 2022

ISO 10646-1:1993(Unicode) => KS X 1001-1:1995

ISO 10646-1(Unicode 1.5 -> hangul 6656) => KS X 1005-1 (Hangul(11172): ISO 10646-1 AMD 5)

Unicode 1.1 – 1.5 : 6656 -> 11172(Unicode 2.0) = ISO 10646-1 AMD 5

- UNICODE/ISO 10646

UCS-4 : (Universal multiple-octet Coded character Set)

0×00000000 ~ 0×7FFFFFFF (01111111 11111111 11111111 11111111)

G=00, P=00 ~ G=7f, P=ff

1 plane represents 2 bytes = 65536

Total 32768 planes

UCS -2 :

BMP(Basic Multilingual Plane) : only one plane [ G=00, P=00 ]

UTF ( UCS Transformation Format) – Can’t use UCS in 7bit environment because UCS is using C0 (0×00-1f) and C1(0×80-9f). Therefore it is not following ISO 2022

C0 -> 000xxxxxx , C1 -> 100xxxxx

Therefore, x00xxxxx code points can not be used.

UTF-1 and UTF-7 are not using C0 and C1, but UTF-16 and 8 are using them, therefore, base64 and QP have to be used when transmitting data in those formats.

- UTF-16 and UTF-8

UTF-16 : BMP + Surrogate area = (0×00000000 ~ 0×00100000 -> 17 planes)

BMP(G=00 P=00) + Surrogate [G=00, P=10]

High Surrogate :0xd800 – dbff (1024 code points)

Low Surrogate: 0xdc00 – dfff (1024 code points)

There are no characters defined in two-byte range of Surrogate areas.

e.g. The range of BMP is 0×0000 – 0xffff, How about 0xffff + 0×0001= ?

d800 dc00 => 0x 00001 0000

16 planes (1Million) + 1 plane (65,536) = 1,065,536

UTF-8 : 1 – 6 bytes

US-ASCII: 1byte

UCS-2: 2 to 3 bytes

UCS-4: 4 to 6 bytes

e.g. Korean Character ‘GA (AC00)’

AC00 -> 1010 1100 0000 0000 -> 1010 110000 000000

add 1110 1010

add 10 110000

add 10 000000

Therefore, 11101010 10110000 10000000

e a b 0 8 0

UCS 4 to UTF-8

0000 0000 – 0000 007F:

00000000 00000000 00000000 0AAAAAAA ->

0aaaaaaa

0000 0080 – 0000 07FF:

00000000 00000000 00000AAA AABBBBBB ->

110aaaaa 10bbbbbb

0000 0800 – 0000 FFFF: ——> BMP

00000000 00000000 AAAABBBB BBCCCCCC ->

1110aaaa 10bbbbbb 10bbbbbb

0001 0000 – 001F FFFF:

00000000 000AAABB BBBBCCCC CCDDDDDD ->

11110aaa 10bbbbbb 10cccccc 10dddddd

0020 0000 – 03FF FFFF:

000000AA BBBBBBCC CCCCDDDD DDEEEEEE ->

111110aa 10bbbbbb 10cccccc 10dddddd 10eeeeee

0400 0000 – 7FFF FFFF:

0ABBBBBB CCCCCCDD DDDDEEEE EEFFFFFF ->

1111110a 10bbbbbb 10cccccc 10dddddd 10eeeeee 10ffffff

No Comments

Leave a comment

mukkamu