charset 1 encoding 1 study 4

Character encoding

http://nozer0.github.io/en/technology/system/character-encoding/
2014-06-08 by nozer0

Last time, we talk about numbers in computer, and we will take a look how characters stored in computer this time.

Introduce

Here are some terminologies in this article, such as ‘character set’, ‘character encoding system’, ‘QuWei code’ and ‘code point’, we can have an overall concept from the diagram below first, and simple explanations for each will be referred next.

	                 CharSet                     Computer
	              +-----------+                  Encoding
	              | a, b, ... |   encode    +-----------------+
	              | |  |   |  | ----------> | 0x61, 0x62, ... |
	code point -->| 97 98 ... |             +-----------------+
	              +-----------+

First, let us check the concepts of Character set and Encoding. Always, people talk these two interchangeably, but sometimes, they have their own meanings for each other.

For most situations in Computer Science, these 2 things are same, like ‘ASCII’, ‘ISO8859’; sometimes are not, for example, ‘EUC-CN’ is an encoding way of character set ‘GB2312’.

SBCS

Single-byte character system, we can know from name that these character system only need 7 or 8 bits to represent characters, in other words, maximum 128 or 256 characters in these systems.

We will introduce the three of them, ‘ASCII’ and ‘ISO 8859’, both are most popular worldwide, and the last special one ‘JIS X 0201’ for Japanese.

ASCII

‘American Standard Code for Information Interchange’ system, also AKA ‘ISO646-US’ or ‘cp367’ from IBM. For most people, it should be the earliest encoding bring in.

ASCII bases on the English alphabet, includes 96 printable characters(26 characters, 0-9 numbers, some basic punctuation symbols and blank space), 32 control codes, 128 characters totally, which can be represented by 7 bits. Only 1 byte is enough to represent almost everything for English people, and here is one more bit left which can be used for parity check, simple and small, isn’t it?

Full ASCII table.

Hex 0 1 2 3 4 5 6 7
0 NUL (Null character) DLE (Data Link Escape) SP 0 @ P ` p
1 SOH (Start of Header) DC1 (Device Control 1) ! 1 A Q a q
2 STX (Start of Text) DC2 (Device Control 2) 2 B R b r
3 ETX (End of Text) DC3 (Device Control 3) # 3 C S c s
4 EOT (End of Transmission) DC4 (Device Control 4) $ 4 D T d t
5 ENQ (Enquiry) NAK (Negative Acknowledgment) % 5 E U e u
6 ACK (Acknowledgment) SYN (Synchronous idle) & 6 F V f v
7 BEL (Bell) ETB (End of Transmission Block) 7 G W g w
8 BS (Backspace, ‘\b’) CAN (Cancel) ( 8 H X h x
9 HT (Horizontal Tab, ‘\t’) EM (End of Medium) ) 9 I Y i y
A LF (Line feed, ‘\n’) SUB (Substitute) * : J Z j z
B VT (Vertical Tab, ‘\v’) ESC (Escape) + ; K [ k {
C FF (Form feed, ‘\f’) FS (File Separator) , < L \ l |
D CR (Carriage return, ‘\r’) GS (Group Separator) - = M ] m }
E SO (Shift Out) RS (Record Separator) . > N ^ n ~
F SI (Shift In) US (Unit Separator) / ? O _ o DEL

ISO/IEC 8859

ASCII is good enough for US, however, for other Latin language countries, they need more additional symbols to be represented, so the extended encoding system using one more bit than ASCII was defined. And it has several parts for different locales, such as ‘ISO-8859-1’ for Western European, ‘ISO-8859-2’ for Central European, etc.

All ‘ISO-8859’ variable parts are use 8 bits, and the beginning 128 characters are the same as ASCII, this is called ASCII compliant character set. Actually, ‘ISO 8859’ only defines the bytes range ‘0xA0~0xFF’, and ‘ISO 6429’ defines the C0 and C1 control characters(0x00~0x1F, 0x80~0x9F), more accurately, ISO-8859 = ASCII + ISO 6429 + ISO 8859, be aware of the dash symbol ‘-‘.

JIS X 0201

This is the single byte character set early used in Japan, different than normal ASCII, it replaces ‘' to ‘¥’ on ‘0x5C’, that’s why we always see such path ‘C:¥windows¥system’ in Japanese Environment, and also ‘~’ to ‘‾’ on ‘0x7E’. And the bytes of ‘0xA1~0xDF’ is used for half-width katakana.

MBCS

Multi-byte character system, contrast to SBCS, which is mainly used for Latin language family countries, MBCS is defined for the other countries who have much more characters to be represented, like China or Japan. Obviously, only 1 byte is not enough.

ISO/IEC 2022

ISO 2022 is neither a character set nor encoding system, it’s a standard or code extension technique to use single encoding system to represent multiple character sets. It specifies that 1 byte (accurately 7 bits) to represent 94 printable characters, to be consistent with ASCII, using ‘0x21~0x7E’; 2 bytes, 94 rows x 94 columns = 8836 characters, for most language, this is enough; specially, 3 bytes, 94 planes x 8836 = 830584 characters.

In addition, ISO 2022 specifies 2-layer mapping between character codes and display characters, here are 4 working sets designated, named from G0 to G3, it should detect which working set is used now, and then get the correct character in the set based on byte values. This is the default situations.

Set Range Name
CL 0x00~0x1F control codes on left, default as C0
GL 0x20~0x7F graph characters on left, default as G0
CL 0x80~0x9F default as C1
GR 0xA0~0xFF default as G1

Moreover, ISO 2022 can use ‘Escape sequences’ to indicate which working set should be used for next bytes.

Code Hex Abbr Name Effect
SI 0F LS0 Shift In / Locking shift zero GL encodes G0 from now on
SO 0E LS1 Shift Out / Locking shift one GL encodes G1 from now on
ESC n 1B 6E LS2 Locking shift two GL encodes G2 from now on
ESC o 1B 6F LS3 Locking shift three GL encodes G3 from now on
ESC N 8E/1B 4E SS2 Single shift two GL encodes G2 for next character only
ESC O 8F/1B 4F SS3 Single shift three GL encodes G3 for next character only
ESC ~ 1B 7E LS1R Locking shift one right GR encodes G1 from now on
ESC } 1B 7D LS2R Locking shift two right GR encodes G2 from now on
ESC | 1B 7C LS3R Locking shift three right GR encodes G3 from now on
ESC ! 1B 21 CZD C0-designate C0 control character set to be used
ESC “ 1B 22 C1D C1-designate C1 control character set to be used
ESC % 1B 25 DOCS Designate other coding system 8-bit code; use ESC % @ to return to ISO 2022
ESC % / 1B 25 2F DOCS Designate other coding system 8-bit code; there is no standard way to return
ESC ( 1B 28 GZD4 G0-designate 94-set 94-character set to be used for G0
ESC ( B 1B 28 42     switch to ASCII
ESC ( I/J 1B 28 49/4A     switch to JIX X 0201
ESC ) 1B 29 G1D4 G1-designate 94-set 94-character set to be used for G1
ESC * 1B 2A G2D4 G2-designate 94-set 94-character set to be used for G2
ESC + 1B 2B G3D4 G3-designate 94-set 94-character set to be used for G3
ESC - 1B 2D G1D6 G1-designate 96-set 96-character set to be used for G1
ESC . 1B 2E G2D6 G2-designate 96-set 96-character set to be used for G2
ESC . A 1B 2E 41     switch to ISO 8859-1
ESC . F 1B 2E 46     switch to ISO 8859-7
ESC / 1B 2F G3D6 G3-designate 96-set 96-character set to be used for G3
ESC $ ( 1B 24 28 GZDM4 G0-designate multibyte 94-set 94^n-character set to be used for G0
ESC $ ( C 1B 24 44     switch to KS X 1001
ESC $ ( D 1B 24 44     switch to JIS X 0212
ESC $ ( O..Q 1B 24 4F..51     switch to JIS X 0213
ESC $ ) 1B 24 29 G1DM4 G1-designate multibyte 94-set 94^n-character set to be used for G1
ESC $ ) A 1B 24 29 41     switch to GB 2312
ESC $ ) C 1B 24 29 43     switch to KS X 1001
ESC $ * 1B 24 2A G2DM4 G2-designate multibyte 94-set 94^n-character set to be used for G2
ESC $ + 1B 24 2B G3DM4 G3-designate multibyte 94-set 94^n-character set to be used for G3
ESC $ - 1B 24 2D G1DM6 G1-designate multibyte 96-set 96^n-character set to be used for G1
ESC $ . 1B 24 2E G2DM6 G2-designate multibyte 96-set 96^n-character set to be used for G2
ESC $ / 1B 24 2F G3DM6 G3-designate multibyte 96-set 96^n-character set to be used for G3
ESC $ @/B 1B 24 40/42     switch to JIS X 0208
ESC $ A 1B 24 41     switch to GB 2312

So that we can know byte ‘0x0F21’ represents the first character in G0 of regarding character set, correspondingly, ‘0x0E21’ represents the first character in G1. For instance, the bytes of ‘啊’ are ‘0x1B 0x24 0x29 0x41 0x0E 0x30 0x21’, the first 4 bytes ‘ESC $ ) A’ indicates to use ‘GB2312’, and ‘S0’ denotes G1, the last 2 bytes ‘0x3021’ points to ‘16-01’ + 0x20(G1), which is the code point of ‘啊’ in GB2312.

GB2312

‘GB2312’, and the later extend character sets, ‘GBK’, and ‘GB18030’, are all the official character sets of China. ‘GB’ abbreviates ‘GuoBiao(国家标准)’, which means national standard.

GB arranges characters into 94x94 grid, which is also called QuWei(区位) code, following ISO/IEC 2022 standard. And it calls the character zone number as ‘Qu(区)’, the position within zone as ‘Wei(位)’. For example, ‘啊’ is the first character of 16th Qu, so that its code point is 16-01.

GB2312 includes totally 7445 characters, and groups them into 3 levels.

Zone Num Description
1~9 682 Comprising punctuation and other special characters; also Hiragana, Katakana, Greek, Cyrillic, Pinyin.
16~55 3755 The first plane for Chinese characters, arranged according to Pinyin.
56~87 3008 The second plane for Chinese characters, arranged according to radical and strokes.

Let us see several regarding encoding systems.

GBK

GBK is the extension of GB2312, 21886 characters totally, in addition to the ones included in GB2312, traditional Chinese characters and other unrepresentable Chinese characters are also included. It has 5 levels, the level 1 and 2 is the same as GB2312, and level 3~5 take the byte range unused in GB2312. Check the table for details.

Level Byte 1 Byte 2
GBK/1 0xA1~0xA9 0xA1~0xFE
GBK/2 0xB0~0xF7 0xA1~0xFE
GBK/3 0x81~0xA0 0x40~0xFE except 0x7F
GBK/4 0xAA~0xFE 0x40~0xA0 except 0x7F
GBK/5 0xA8~0xA9 0x40~0xA0 except 0x7F

CNS 11643

CNS 11643, Chinese National Standard 11643, it is the official standard of Republic of China. Please pay attention, it’s Republic of China(中华民国), not PRC(People’s Republic of China), so it’s used in Taiwan, in fact, Big5 is more commonly used.

It contains 16 planes, from plane 12 to 15 are designated for user-defined characters, 141376 characters maximumly. The relevant EUC form encoding is ‘EUC-TW’.

Bytes Byte 1 Byte 2 Byte 3 Byte 4
1 0x21~0x7E      
2 0xA1~0xFE 0xA1~0xFE    
4 0x8E 0xA1~0xB0 0xA1~0xFE 0xA1~0xFE

Big5

Big5 is both a traditional Chinese character set and the relevant encoding system, which is popular used in Taiwan, HongKong and Macau, also referred as ‘cp950’ in Windows.

It does not conform the ISO 2022 standard, and with the first byte range ‘0x81~0xFE’, second byte range ‘0x40~0x7E, 0xA1~0xFE’, the detail structure is.

Byte 1 Byte 2 Description
0x81~0xA0 0x40~0x7E, 0xA1~0xFE Reserved for user-defined characters
0xA1~0xA3 0x40~0x7E, 0xA1~0xBF Graphical characters
0xA3~0xA3 0xC0~0xFE Reserved, not for user-defined characters
0xA4~0xC6 0x40~0x7E Frequently used characters, arranged according to radical and strokes
0xC6~0xC8 0xA1~0xFE Reserved for user-defined characters
0xC9~0xF9 0x40~0x7E, 0xA1~0xD5 Less frequently used characters, arranged according to radical and strokes
0xF9~0xFE 0xD6~0xFE Reserved for user-defined characters

Strictly speaking, it is a fixed DBCS, double bytes character system, it needs to use other SBCS like ASCII to represent the ones like English characters, generally, we can take it as variable-width character system like other CJK encoding systems.

And because the byte range of second byte is overlapped with SBCS ‘0x20~0x7F’, it may lead problem sometimes, for example, ‘功(0xA55C)’, the second byte is the same value of ‘\(0x5C)’, which is always used as escape character. This problem only occurs in Big5 and Shift-JIS, and won’t happen in ISO 2022 and EUC compliant encodings.

JIS X

‘JIS X 0208’, Japanese Industrial Standards, it’s very similar like GB2312, groups all characters into 94 row x 94 positions called ‘Kuten(区点)’ code, the same thing called ‘QuWei(区位)’ code in GB. Just like ‘啊’ in GB2312, the code point 16-01 in JIS X points to ‘亜’. Actually, the same structure is employed by Korean with ‘KS X’. As well as GB2312, JIS X 0208 has new extend character sets, ‘JIS X 0212’ and ‘JIS X 0213’.

JIS X 0208 includes 6879 graph characters totally, among it 6355 kanji and 524 hikanji. Unlike GB2312, it’s not fully compliant with ASCII, it uses bytes ‘0xA1~0xDF’ to represent half-width kana.

Zone Num Description
1~8 524 Special characters, numbers, Hiragana, Katakana, Greek, Cyrillic, etc.
16~47 2965 The kanji characters of level 1.
48~84 3390 The kanji characters of level 2.

Here are some regarding encoding systems.

‘Extend Unix code for JP’, it is most often used encoding system on Unix-like system in Japan, just like Shift-JIS in Windows. It has some differences with ‘EUC-CN’, for the half-width kana, with the lead byte ‘0x8E’ and latter byte ‘0xA1~0xDF’, for example, ‘0x8EA7’ represents ‘ァ’. The same reason, we can get the first byte range is ‘0xA1~0xF4’, and second ‘0xA1~0xFE’. To be noticed, if the represented character is from JIS X 0212(C3) instead of JIS X 0208, it may use 3 bytes with lead one ‘0x8F’, and others ‘0xA1~0xFE’.
| Bytes |   Byte 1  |  Byte 2   |  Byte 3   |
|:-----:|-----------|-----------|-----------|
|   1   | 0x00~0x7F |           |           |
|   2   | 0x8E      | 0xA1~0xDF |           |
|   2   | 0xA1~0xF4 | 0xA1~0xFE |           |
|   3   | 0x8F      | 0xA1~0xFE | 0xA1~0xFE |

‘JIS X 0212’ is the extended set of JIS X 0208 for supplementary plane, including 6067 characters totally, effectively dead. And the later one, ‘JIS X 0213’, 11233 characters totally, divided into 2 planes, plane 1 as the superset of ‘JIS X 0208’, and plane 2 as the revised ‘JIS X 0212’.

UCS

There are so many different languages in the world so as to define so many different characters sets, it is difficult to transfer data between different systems. Finally, someone stands up to do the unique work that groups all characters into one super big set, the universal character set, abbreviates ‘UCS’.

Here are 2 organizations publish their own definition at almost same time, Unicode calls it as ‘Unicode’, and ISO calls ‘ISO 10646’, both of them are compatible to each other, but Unicode is easy to remember so as to be more widely known. Sometimes, when saying ‘UCS’ or ‘Unicode’, taken as the same thing.

It has 16 planes, the first one, which is called as ‘BMP’ with the first 256 characters identical to ISO-8859-1, is always the mapping set for some encodings.

Plane Code Point Description
0 U+0000~U+FFFF BMP, Basic Multilingual Plane
  U+D800~U+DBFF High-surrogate codes, 1024 totally
  U+DC00~U+DFFF Low-surrogate codes, 1024 totally
  U+E000~U+F8FF Private use area, 6400 totally
  U+FDD0~U+FDEF Noncharacters, never to be used
1 U+10000~U+13FFF, U+16000~U+16FFF, U+1B000~U+1BFFF, U+1D000~U+1FFFF SMP, Supplementary Multilingual Plane
2 U+20000~U+2BFFF, U+2F000~U+2FFFF SIP, Supplementary Ideographic Plane
3~13 U+30000~U+DFFFF  
14 U+E0000~U+EFFFF SSP, Supplementary Special-purpose Plane
15~16 U+F0000~U+10FFFF S PUA, Supplementary Private Use Area

* All codes ending with ‘FFFE/FFFF’ are noncharacters, such as ‘U+FFFE’ and ‘U+10FFFF’.

Several corresponding encoding systems.

	lead surrogate = x >> 10 & 0x3FF + 0xD800
	trail surrogate = x & 0x3FF + 0xDC00

BOM

Byte Order Mark, a Unicode character preceded the first actual characters, which used to detect the byte order, and also indicates the encoding system sometimes.

UTF-16 uses ‘0xFEFF’ as BOM, if the opposite order byte ‘0xFFFE’ is detected at the first of Unicode text, we can know that the byte order is little-endian, this is also why ‘U+FFFE’ is defined as noncharacter in Unicode.

UTF-8 uses ‘0xEF 0xBB 0xBF’ as BOM instead, which is the corresponding representation of ‘U+FEFF’ in UTF-8, because UTF-8 doesn’t care the byte order, it always use to denote UTF-8 encoding if appears in the beginning of text.

Others

Here are some other encoding ways used for data transmission or other areas.

ANSI

ANSI, American National Standards Institute. The first code page ‘cp1252’ defined by Microsoft, is based on a ANSI draft of ‘ISO-8859-1’, so as to the Windows code pages are also known as ANSI code pages.

If we choose save file with ‘ANSI’, that means we choose the default encoding based on system language, for examples, ‘ASCII’ for EN, ‘GB2312 / GBK’ for CN, ‘JIS’ for JP.

Base64

This is an encoding way to represent binary data with 64 ASCII characters, which uses ‘A-Za-z0-9’ fixed 62 characters, and other 2 printable characters which may be different in each variations. We can always see this encoding in email or images representation.

Since it has 64 characters totally, so that 3 bytes are encoded into 4 Base64 characters, if string bytes number is not divisible by 3, append 0 in bits at the end, and use pad symbol to indicate how many significant byte in last 3 bytes group.

Let us take an example based on RFC 1421, it uses ‘+/’ as last 2 characters, and ‘=’ as pad symbol.

	   base64
	h ------> aA==

	|      h      |
	|     68      |
	| 01 10 10 00 |
	| 01 10 10 00 00 00 |
	|    26   |    0    |
	|    a    |    A    |

	   base64
	hi ------> aGk=

	|      h     |     i     |
	|     68     |    69     |
	| 01 10 10 00 01 10 10 01 |
	| 01 10 10 00 01 10 10 01 00 |
	|    26   |   6    |   36    |
	|    a    |   G    |    k    |

‘UTF-7’ is a variant of Base64, it encodes the UTF-16 encoding bytes via the way of Base64, and it can use ‘+/v8 | +/v9 | +/v+ | +/v/’ as BOM, so that sometimes UTF-7 encoding strings can jump over the normal ways to check script tags like htmlentities.

Quoted-printable

This is an encoding to change all bytes except printable ASCII characters to 2 hexadecimal characters with ‘=’ as prefix, for instance, the bytes of ‘啊’ in GB2312 are ‘0xB0A1’, changes to the QP encoding way is ‘=B0=A1’. The ‘=’ character itself is represented by ‘=3D’.

This is a simple email source with the same string ‘你好,世界’ for both subject and content, you can see that the subject is use Base64 encoding indicates by ‘?B?’ and content QP encoding.

Subject: =?utf-8?B?5L2g5aW977yM5LiW55WM?=
Content-Transfer-Encoding: quoted-printable
Content-Type: TEXT/PLAIN; charset=utf-8
=E4=BD=A0=E5=A5=BD=EF=BC=8C=E4=B8=96=E7=95=8C

Similar ‘Percent-encoding’ is used for URL.

Review

This is the table as summary including all encodings byte ranges and BOM mentioned above. Basically, character encoding detect tool will check BOM or escape sequence first, this is the most significant and simple way to know the encoding; and then excludes the encodings if out of byte range; if still no result gets, it checks the bytes based on word frequency to infer the most closed encoding.

Encoding Byte 1 Byte 2 Byte 3 Byte 4 BOM / Escape sequence
ASCII 0x00~0x7F        
ISO 8859-* 0x00~0xFF        
ISO 2022-CN 0x00~0x7F        
  0x21~0x7E 0x21~0x7E     0x1B 0x24 [0x29] 0x41
ISO 2022-JP 0x00~0x7F        
  0x21~0x7E 0x21~0x7E     0x1B 0x24 0x42
EUC-CN 0x00~0x7F        
  0xA1~0xF7 0xA1~0xFE      
EUC-TW 0x00~0x7F        
  0xA1~0xFE 0xA1~0xFE      
  0x8E 0xA1~0xB0 0xA1~0xFE 0xA1~0xFE  
EUC-JP 0x00~0x7F        
  0x8E 0xA1~0xDF      
  0xA1~0xF4 0xA1~0xFE      
  0x8F 0xA1~0xFE 0xA1~0xFE    
Big5 0x00~0x7F        
  0xB1~0xA0 0x40~0x7E, 0xA1~0xFE      
  0xA1~0xA3 0x40~0x7E, 0xA1~0xBF      
  0xA3~0xA3 0xC0~0xFE      
  0xA4~0xC6 0x40~0x7E      
  0xC6~0xC8 0xA1~0xFE      
  0xC9~0xF9 0x40~0x7E, 0xA1~0xD5      
  0xF9~0xFE 0xD6~0xFE      
Shift-JIS 0x00~0x7F, 0xA1~0xDF        
  0x81~0x9F, 0xE0~0xEA 0x40~0x7E, 0x80~0xFC      
UTF-8 0x00~0x7F       0xEF 0xBB 0xBF
  0xC2~0xDF 0x80~0xBF      
  0xE0~0xEF 0x80~0xBF 0x80~0xBF    
  0xF0~0xF4 0x80~0xBF 0x80~0xBF 0x80~0xBF  
UCS-2 0x00~0xFF 0x00~0xFF      
UTF-16 0x00~0x7F       0xFF 0xFE / 0xFE 0xFF
  0x00~0xFF 0x00~0xFF      

Thanks so much for the work of ashtuchkin, here are the detail tables for different encoding systems.


Here also has a small tool to convert between different character encodings.

charset 1 encoding 1 study 4