Character encoding notes: ASCII, Unicode and UTF-8

Character encoding notes: ASCII, Unicode and UTF-8

A long, long time ago, there was a group of people who decided to combine 8 transistors that can be turned on and off into different states to represent everything in the world. They see that the 8 switch states are good, so they call this "byte". Later, they built some machines that could process these bytes. When the machines started, they could use the bytes to compose many states, and the states began to change. They see that this is good, so they call this machine "computer". Initially, computers were only used in the United States. A total of 256 (2 to the 8th power) different states can be combined with an eight-bit byte. They defined the 32 states with numbers starting from 0 for special purposes. Once the agreed-upon bytes on the terminal and printer are passed over, they need to do some agreed actions. When it encounters 00x10, the terminal will wrap, and when it encounters 0x07, the terminal will beep to people. For example, when it encounters 0x1b, the printer will print reversed words, or the terminal will display letters in color. They see this is very good, so they call these byte states below 0x20 "control codes". They also represented all spaces, punctuation marks, numbers, and uppercase and lowercase letters in consecutive byte states, until the number 127, so that the computer can use different bytes to store English text. Everyone feels good when they see this, so everyone calls this program the ANSI "ASCII" code (American Standard Code for Information Interchange, American Standard Code for Information Interchange). At that time, all computers in the world used the same ASCII scheme to save English text. Later, just like building the Tower of Babylon, computers all over the world began to use, but many countries did not use English, and many of their letters were not in ASCII. In order to save their text on the computer, they decided to use 127. The space after the number represents these new letters and symbols, and a lot of shapes such as horizontal lines, vertical lines, and crosses that need to be used when drawing tables have been added, and the serial number has been numbered to the last state 255. The character set on this page from 128 to 255 is called "extended character set" . Since then, greedy mankind has no new state to use. US imperialism may not have thought that people in third world countries also hope to use computers! When the Chinese people get the computer, there is no byte state that can be used to represent Chinese characters, and there are more than 6000 commonly used Chinese characters that need to be saved. But this is not difficult for the wise Chinese people. We unceremoniously canceled the strange symbols after 127, stipulating: a character less than 127 has the same meaning as the original, but two characters greater than 127 are connected together. When, it means a Chinese character, the first byte (he called the high byte) is used from 0xA1 to 0xF7, and the next byte (low byte) is from 0xA1 to 0xFE, so that we can combine about 7000 Simplified Chinese characters. In these codes, we have also compiled mathematical symbols, Roman Greek letters, and Japanese pseudonyms. Even the numbers, punctuation, and letters that exist in ASCII have all been recoded into two-byte long codes. , This is what is often called "full-width" characters, and those below 127 are called "half-width" characters. The Chinese people saw this very well, so they called this Chinese character plan "GB2312". GB2312 is a Chinese extension to ASCII. But there are too many Chinese characters in China, and we soon discovered that there are many people whose names cannot be typed here, especially some national leaders who are troublesome for others. So we have to continue to find out the code points that are not used by GB2312 and use them honestly. Later, it was not enough, so the low byte is no longer required to be the internal code after 127. As long as the first byte is greater than 127, it will always indicate that this is the beginning of a Chinese character, regardless of whether it is followed by an extended character set. Content. As a result, the expanded coding scheme is called the GBK standard. GBK includes all the contents of GB2312, and at the same time nearly 20,000 new Chinese characters (including traditional characters) and symbols have been added. Later, ethnic minorities also used computers, so we expanded and added thousands of new ethnic minority characters. GBK was expanded to GB18030. From then on, the culture of the Chinese nation can be passed on in the computer age. Chinese programmers saw that this series of Chinese character encoding standards were good, so they called them "DBCS". (Double Byte Charecter Set). In the DBCS series of standards, the biggest feature is that two-byte long Chinese characters and one-byte English characters coexist in the same encoding scheme. Therefore, in order to support Chinese processing in their programs, they must pay attention to the characters in the string. The value of each byte, if the value is greater than 127, then it is considered that a character in a double-byte character set has appeared. At that time, all computer monks who had received blessings and were able to program had to recite the following mantra hundreds of times a day: "One Chinese character counts as two English characters! One Chinese character counts as two English characters..." Because of the various countries at that time They both developed their own coding standards like China, and as a result, no one understood each other’s coding, and no one supported other’s coding. Even the mainland of China and Taiwan, China are only 150 nautical miles apart and use the same one. The brother regions of the language also adopted different DBCS coding schemes. At that time, the Chinese wanted to make the computer display Chinese characters, they had to install a "Chinese character system", which was specially used to deal with the display and input of Chinese characters, but that Fortune-telling programs written by ignorant feudal people in Taiwan, China must install another "Yitian Chinese Character System" that supports BIG5 encoding to be used. If the wrong character system is installed, the display will be messed up! How can this be done? Moreover, there are still poor people in the forest of nations who do not have access to computers for a while. What about their writing? It's really the Babylonian Tower proposition of computers! At this moment, the archangel Gabriel appeared in time-an international organization called ISO (International Organization for Standardization) decided to tackle this problem. The method they adopted is very simple: abolish all regional coding schemes, and rebuild a code that includes all cultures, all letters and symbols on the earth! They plan to call it "Universal Multiple-Octet Coded Character Set", abbreviated as UCS, commonly known as "UNICODE" . When UNICODE was first formulated, the memory capacity of the computer was greatly developed, and space was no longer a problem. So ISO directly stipulates that two bytes, that is, 16 bits, must be used to uniformly represent all characters. For those "half-width" characters in ascii, UNICODE keeps its original encoding unchanged, but changes its length from the original 8 The bits are expanded to 16 bits, and characters of other cultures and languages ​​are all re-encoded uniformly. Since the "half-width" English symbol only needs the lower 8 bits, the upper 8 bits are always 0, so this atmospheric scheme will waste twice as much space when saving English text. At this time, programmers who came from the old society began to discover a strange phenomenon: their strlen function was unreliable, and a Chinese character was no longer equivalent to two characters, but one! Yes, starting from UNICODE, whether it is half-width English letters or full-width Chinese characters, they are all unified "one character"! At the same time, they are all unified "two bytes", please pay attention to "character" and "byte" The two terms are different, "byte" is an 8-bit physical storage unit, and "character" is a culturally related symbol. In UNICODE, a character is two bytes. The era when one Chinese character counts as two English characters is almost gone. In the past, when multiple character sets existed, companies that made multi-language software encountered a lot of trouble. In order to sell the same set of software in different countries, they had to add the double-byte character set spell when regionalizing the software. Not only must be careful not to make mistakes everywhere, but also to turn the text in the software around in different character sets. UNICODE is a good package solution for them, so starting from Windows NT, MS took the opportunity to change their operating system and changed all the core codes to a version that works in UNICODE. At the beginning, WINDOWS system finally does not need to install various local language systems, and can display characters of all cultures in the world. However, UNICODE did not consider maintaining compatibility with any existing encoding scheme when formulating it. This makes GBK and UNICODE completely different in the internal code layout of Chinese characters. There is no simple arithmetic method to change the text content from UNICODE code and another kind of code are converted, this kind of conversion must be carried out by looking up the table. As mentioned earlier, UNICODE is represented by two bytes as one character, which can combine a total of 65,535 different characters, which can probably cover all cultural symbols in the world. It doesn’t matter if it’s not enough. ISO has prepared the UCS-4 program. Simply put, four bytes are used to represent a character, so that we can combine 2.1 billion different characters (the highest bit has other uses). This can probably be used on the day the Galactic Federation was established! When UNICODE came, it also came with the rise of computer networks. How UNICODE is transmitted on the network is also a problem that must be considered. So many transmission-oriented UTF (UCS Transfer Format) standards appeared. As the name suggests, UTF8 is 8 every time. The unit bit transmits data, and UTF16 is 16 bits each time, but for the reliability of transmission, there is no direct correspondence from UNICODE to UTF, but some algorithms and rules are required to convert. A section of conversion rules from UNICODE to UTF8 is quoted from the Internet: Unicode UTF-8 000 0-007F 0xxxxxxx 0080-07FF 110xxxxx 10xxxxxx 0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx For example, the Unicode code of "Chinese" is 6C49. 6C49 is between 0800-FFFF, so use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Write 6C49 in binary as: 0110 1100 0100 1001. Divide this bit stream into 0110 110001 001001 according to the three-byte template segmentation method, and replace the x in the template in turn to get: 1110-0110 10-110001 10-001001, that is E6 B1 89, this is its UTF8 encoding. Speaking of this, let's talk about a very famous strange phenomenon by the way: when you create a new file in the Windows Notepad, enter the word "UNICOM", save, close, and then open again, you will find these two The word has disappeared, replaced by a few garbled characters! Hehe, some people say that this is the reason why China Unicom can't fight mobile. In fact, this is because the GB2312 encoding and the UTF8 encoding have an encoding collision. When you create a new text file, the encoding of Notepad is ANSI by default. If you enter Chinese characters in ANSI encoding, then it is actually the encoding method of the GB series. Under this encoding, the internal code of "Unicom" is: c1 1100 0001 aa 1010 1010 cd 1100 1101 a8 1010 1000 Have you noticed? The first two bytes and the beginning of the third and fourth bytes are "110" and "10", which are exactly the same as the two-byte template in the UTF8 rules, so when you open the Notepad again, you will make a note I mistakenly thought this was a UTF8-encoded file. Let us remove the 110 of the first byte and the 10 of the second byte, and we will get "00001 101010". Then align the bits and add the leading ones. 0, you get "0000 0000 0110 1010" , I'm sorry, this is UNICODE's 006A, which is the lowercase letter "j", and the next two bytes are decoded with UTF8 to be 0368. This character is nothing. This is the reason why files with only the word "UNICOM" cannot be displayed normally in Notepad. And if you enter a few more words after "Unicom", the encoding of other words may not happen to be the bytes starting with 110 and 10, so when you open it again, Notepad will not insist that this is a utf8 encoded file. , And will use ANSI to interpret it, then the garbled code does not appear. The computer monks who have received the blessing of network programming know that there is a very important problem when transmitting information on the network. It is the way to interpret the high and low data. Some computers use the low-end first transmission method, such as our PC. In the INTEL architecture, this is called little endian, while others use the high-order transmission method, which is called big endian. When exchanging data in the network, in order to check whether the two sides have the same understanding of high and low bits, a very simple method is adopted, which is to send an identifier to the other party at the beginning of the text stream-if the following text is high. If it is in position, then send "FEFF", otherwise, send "FFFE". If you don’t believe me, you can open a file in UTF-X format in binary mode and see if the first two bytes are these two bytes? By the way, let me mention the origins of the two internet terms little endian and big endian: In <<Gulliver’s Travels>>, Xiaoren Guozhong is divided into different factions due to the debate about whether to eat eggs from the big end or from the small end. There was a war, and even the emperor was killed. In the development of computer technology, the communication between the hardware of different systems also has the same serious problem because the big head is in front or the small head is in front. Therefore, the humorous part of the technical experts-the vast majority People-use "endian" This term has a strong political metaphor. Well, I can finally answer NICO’s question. In the database, the string type with the n prefix is ​​the UNICODE type. In this type, two bytes are fixed to represent a character, regardless of whether the character is a Chinese character or an English letter. , Or something else. The following example should be able to illustrate the difference between unicode and ansi type fields: We build a table in any type of database, containing the following fields: nc nchar(10) c char(10) Then, we try to add Add the following records: "1234567890", "1234567890" "One Two 3.4.5.6.7.8.9.Ten", "One Two 3.4.5.6.7.8.9.Ten" For the first record, 10 can be inserted in both fields At the same time, even one character cannot be stored. But for the second record, the nc field can store all the data from "one" to "ten", while the c field can only store up to "five", and more will cause an error. why? Because in the nchar field, one Chinese character is one character, a 10-character wide field can store 10 Chinese characters. In the char field, one Chinese character counts as two characters, and a 10-character wide field can only store 5 Chinese characters. , And more will make mistakes. why? Because in the nchar field, one Chinese character is one character, a 10-character wide field can store 10 Chinese characters. In the char field, one Chinese character counts as two characters, and a 10-character wide field can only store 5 Chinese characters. , And more will make mistakes. why? Because in the nchar field, one Chinese character is one character, a 10-character wide field can store 10 Chinese characters. In the char field, one Chinese character counts as two characters, and a 10-character wide field can only store 5 Chinese characters. The information stored in the computer is represented by binary numbers; the English, Chinese characters and other characters we see on the screen are the result of binary numbers conversion. In layman's terms, according to what kind of rules to store characters in the computer, such as the expression of'a', it is called "encoding"; conversely, the binary number stored in the computer is parsed and displayed, called "decoding", as Encryption and decryption in cryptography. During the decoding process, if the wrong decoding rules are used, it will cause'a' to be parsed into'b' or garbled.

Charset : It is a collection of all abstract characters supported by the system. Characters are the general term for various characters and symbols, including national characters, punctuation marks, graphic symbols, numbers, etc.

Character Encoding : It is a set of rules that can be used to pair a set of natural language characters (such as an alphabet or a syllable table) with a set of other things (such as numbers or electrical pulses). That is to establish a corresponding relationship between the symbol set and the digital system, it is a basic technology of information processing. Usually people use a collection of symbols (in general, words) to express information. The computer-based information processing system uses a combination of different states of components (hardware) to store and process information. The combination of the different states of the component can represent the numbers of the digital system, so the character code is to convert the symbol into the number of the digital system that the computer can accept, called the digital code.

Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, etc. To accurately process the characters of various character sets, the computer needs to perform character encoding so that the computer can recognize and store various characters. 1. ASCII code

We know that in the computer, all information is finally represented as a binary string. Each binary bit (bit) has two states, 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states in total, and each state corresponds to a symbol, which is 256 symbols, ranging from 0000000 to 11111111.

The United States has formulated a set of character codes, and unified regulations on the relationship between English characters and binary digits. This is called ASCII code and is still in use today.

The ASCII code specifies a total of 128 characters. For example, the space "SPACE" is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) occupy only the last 7 bits of a byte, and the first bit is uniformly defined as 0.

2. Non-ASCII encoding

English encoding with 128 symbols is enough, but for other languages, 128 symbols are not enough. For example, in French, if there is a phonetic symbol above the letter, it cannot be represented by ASCII code. As a result, some European countries decided to use the highest bit of the unused byte to program a new symbol. For example, the code of é in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent up to 256 symbols.

Different countries have different letters, so even if they all use 256 symbol encoding methods, they represent different letters. For example, 130 represents é in the French encoding, but it represents the letter Gimel (ג) in the Hebrew encoding, and it represents another symbol in the Russian encoding. But anyway, in all these encoding methods, the symbols represented by 0-127 are the same, and the only difference is the segment of 128-255.

As for the scripts of Asian countries, there are more symbols used, and there are as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough. You must use multiple bytes to represent one symbol. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so in theory, it can represent up to 256x256=65536 symbols.

The issue of Chinese encoding needs to be discussed in a dedicated article, which is not involved. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB type has nothing to do with the Unicode and UTF-8 in the following text.

3.Unicode

There are many encoding methods in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise garbled characters will appear if you decode it with the wrong encoding method. Why do emails often appear garbled? It is because the coding method used by the sender and the recipient is different.

It is conceivable that if there is a code, all the symbols in the world are included. Each symbol is given a unique code, then the garbled problem will disappear. This is Unicode, as its name implies, this is an encoding of all symbols.

Unicode is of course a large collection, and the current scale can hold more than 1 million symbols. The encoding of each symbol is different. For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character "strict". For specific symbol correspondence tables, you can query the Chinese character correspondence tables .

4. The Unicode problem

It should be noted that Unicode is only a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the unicode of the Chinese character "Yan" is the hexadecimal number 4E25, which is converted into a binary number with 15 bits (100111000100101), which means that the representation of this symbol requires at least 2 bytes. Represents other larger symbols, which may require 3 bytes or 4 bytes, or even more.

There are two serious questions here. The first question is, how can we distinguish between Unicode and ASCII? How does the computer know that three bytes represent a symbol instead of three symbols separately? The second problem is that we already know that English letters are represented by only one byte. If Unicode stipulates that each symbol is represented by three or four bytes, then there must be two before each English letter. Up to three bytes is 0, which is a great waste of storage. The size of the text file will therefore be two or three times larger, which is unacceptable.

The result of them is that there are multiple storage methods of Unicode, which means that there are many different binary formats that can be used to represent Unicode.

5.UTF-8

UTF-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but they are basically not used on the Internet. UTF-8 is one of the implementations of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.

UTF-8 encoding rules are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the following 7 bits are the unicode code of this symbol. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For n-byte symbols (n>1), the first n bits of the first byte are all set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10. The remaining binary bits not mentioned are all unicode codes of this symbol.

The following table summarizes the coding rules, the letter x represents the available coded bits.

Unicode symbol range | UTF-8 encoding method (hexadecimal) | (binary) --------------------+---------- ----------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, the byte alone is a character; if the first bit is 1, how many 1s are in succession, it means how many bytes the current character occupies.

Next, take the Chinese character "strict" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the "strict" unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the "strict" UTF-8 encoding requires three bytes , That is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary bit of "strict", fill in the x in the format from back to front, and add 0 to the extra bits. In this way, the "strict" UTF-8 encoding is "11100100 10111000 10100101", and converted to hexadecimal is E4B8A5.

6. Conversion between Unicode and UTF-8

You can see that the "strict" Unicode code is 4E25, and the UTF-8 code is E4B8A5. The two are different. The conversion between them can be achieved through programs.

On the Windows platform, one of the simplest conversion methods is to use the built-in Notepad applet Notepad.exe. After opening the file, click the "Save As" command in the "File" menu, and a dialog box will pop up, with a drop-down bar of "Encoding" at the bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding method. For English files, it is ASCII encoding, and for simplified Chinese files, it is GB2312 encoding (only for Windows simplified Chinese version, if it is traditional Chinese version, Big5 code will be used).

2) Unicode encoding refers to the UCS-2 encoding method, that is, the Unicode code that directly uses two bytes to store characters. The little endian format used for this option.

3) Unicode big endian encoding corresponds to the previous option. I will explain the meaning of little endian and big endian in the next section.

4) UTF-8 encoding, which is the encoding method discussed in the previous section.

After selecting "Encoding Method", click the "Save" button, and the encoding method of the file will be converted immediately.

7. Little endian and Big endian

Unicode codes can be directly stored in UCS-2 format. Taking the Chinese character "strict" as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian mode; 25 is in the front and 4E is in the back, which is the Little endian mode. Big endian and little endian are different ways of CPU processing multi-byte numbers. For example, the Unicode code of "Chinese" is 6C49. So when writing to the file, should 6C be written first or 49? If you write 6C in front, it is big endian. Still write 49 in front, which is little endian. 

The first byte is the "big endian" (Big endian), and the second byte is the "little endian" (Little endian).

So naturally, there will be a question: how does the computer know which way to encode a file is large or small?

It is defined in the Unicode specification that a character representing the encoding order is added to the front of each file. The name of this character is called "ZERO WIDTH NO-BREAK SPACE" (ZERO WIDTH NO-BREAK SPACE), which is represented by FEFF. This is exactly two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that the file adopts big-end mode; if the first two bytes are FF FE, it means that the file adopts small-end mode.

8. Examples

Below, give an example.

Open the "Notepad" program Notepad.exe, create a new text file, the content is a "strict" word, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding in turn.

Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding of the file.

1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 encoding, which also implies that GB2312 is stored in bulk.

2) Unicode: The encoding is four bytes "FF FE 25 4E", where "FF FE" indicates that it is stored in small headers, and the real encoding is 4E25.

3) Unicode big endian: The encoding is four bytes "FE FF 4E 25", where "FE FF" indicates that it is stored in big endian format.

4) UTF-8: The encoding is six bytes "EF BB BF E4 B8 A5", the first three bytes "EF BB BF" indicate that this is UTF-8 encoding, the last three "E4B8A5" are "strict" Specific coding, its storage order is consistent with the coding order.