Table of Contents
What is unicode ?
– what is a string? what is text? you might ask!
– A collection of characters?
– good guess.
– and a character is ?
– any human readable one, as in ¥
, or é
, 😀
, (
, Յ
, أ
.
– and to translate this for the computer?
– well you form a character set, which contains the characters, that you wish to represent, as in {a, b, c ...}
– any well known character sets?
– Ascii, ISO-8859-1 , and the unicode character set, which is the set, of all the characters in the world.
– okay.
– and the computer understands?
– numbers .
– correct! , so at this stage, you assign to each character, in the character set, a number , this number is given a name, for example in unicode, it is called a unicode code point.
– yeh, but the computer understands what?
– bits .
– ah ok .
– so you need to translate your numbers into bits, meaning 0
and 1
.
– okay .
– this can be done, either in a direct manner.
– as in ?
– the code point, chosen for the character in the character set, is converted into its binary representation, and that is it.
– oh, okay, so this what encoding is about!
– correct, and if you prefer not to use, the character code number, binary representation directly as its encoding.
– yes?
– you can specify a different encoding scheme, so a different way of having the code point, of a character in a character set , represented, or encoded using 1
and 0
.
– Any examples?
– yes! Unicode defines, utf-8
, utf-16
and utf-32
, for encoding the code points, of the unicode character set.
– a concrete one?
okay! and to give a concrete example, the letter c
which belongs to the unicode character set, has a code point of U+0063
. its utf8
encoding is 01100011
, its utf16
encoding on a little endian machine is, 0110001100000000
, and its utf32
encoding on a little endian machine is, 01100011000000000000000000000000
.
Unicode code points
As explained in the previous section, Unicode is a character set, representing all the characters of the world, each character in this character set, has a code point assigned to it, and additionally, it has an encoding.
A code point is just an number, which is written in hexadecimal. Example of characters and their code points, are for instance a
, which has a unicode code point of U+0061
, or the ohm sign Ω
, which has a code point of U+2126
.
Code points can be between 0x0
, and 0x10_FF_FF
in hexadecimal, or in other words, between 0
and 1_114_111
in decimal .
This being said, 21
bits, are sufficient for representing all the unicode code points. The range of 21
bits, is actually between 0
and 2_097_151
in decimal, but unicode code points, are restricted to having values, between 0
and 1_114_111
in decimal.
Encoding
As explained, in the first section , computers understand only numbers, and mainly bits, which are represented using electric signals.
After having decided, what characters are to form a character set, numbers which are code points, are assigned to these characters, and these numbers or code points, are next converted into bits, and this is what is called encoding.
Conversion can be direct, as in having the code point, binary representation, being used for the encoding. A historical example, is the ascii character set. So in the ascii character set, the letter a
has a code point of 97
in decimal, and its encoding is also 97
in decimal, which means in bits, it is 1100001
. ascii uses 7 bits for encoding.
In Unicode, multiple encoding schemes exist:
-
utf-8
is a variable length encoding. For the first128
unicode code points, so for code points from0
till127
, they are encoded using their binary representation, as for code points larger than127
, their unicode code point binary representation, is dispersed among multiple bytes. -
utf-16
is a variable length encoding, the first65536
code points, so code points from0
till65535
, are encoded using their binary bits, whereas code points larger than65535
, are subtracted from0x1_00_00
, and what results, have its bits dispersed among4
bytes. -
utf-32
is a fixed length encoding, which uses4
bytes to encode all code points. Using4
bytes, means that all code points, can be encoded directly, using their binary representation.
utf-8
In utf-8
, a code point can be encoded using either, 1
, 2
, 3
, or 4
bytes.
1 byte encoding
Unicode code points, between 0
and 127
in decimal, are encoded in utf-8
, using only 1
byte, by directly using their binary representation.
How does this map to bits you might ask? well 0
maps to 00000000
, and 127
which is 0x7F
in hexadecimal, maps to 01111111
.
This being said, when using 1
byte, to encode code points between 0
and 127
in decimal, or 0
and 0x7F
in hexadecimal, the leading bit, is always 0
.
0000 0000 0111 1111
A demo of how to get code points, and encodings:
# python >>> ord ('0' ) //Code point # ord can be used to get the # code point in decimal of a # character in python. # The character 0, has a code # point of 48. 48 >>> hex (ord ('0' )) //hex Code point # If you want to get the code points # in hexadecimal in python, then you # can use hex. '0x30' # The unicode code point in hexadecimal # of 0 is 0x30 >>> bin (ord ('0' )) //Binary Code point # For code points between 0 # and 127 , you can get their # binary encoding in python , # by first using ord, to get # their code point, and later on # using bin to get its bits # representation. '0b00110000' # Swift > import Foundation > var char :Character = "0" > char .unicodeScalars .forEach{print (String (format:"%u",$0.value ))} /* Code point of 0 in decimal .*/ 48 > char .unicodeScalars .forEach{print (String (format:"0x%X",$0.value ))} /* Code point of 0 in hex .*/ 0x30 char .utf8 .forEach {print (String ($0 , radix:2 ))} /* utf-8 encoding of 0 .*/ 00110000 #lisp > (char-code #\0 ) #| Get the character code |# #| in decimal.|# 48 > (setq *print-base* 16 ) #|Set the print base to 16 |# > (char-code #\0 ) #| Get the character code in |# #| hex |# 30 > (setq *print-base* 2 ) #| Set the print base to binary. |# > (char-code #\0 ) #| For unicode code points, between |# #| 0 and 127 , the character code |# #| in binary, is the encoding. |# 110000
2 bytes encoding
When using utf-8
two bytes encoding, the encoding scheme to be used for encoding a code point, using these two bytes, is as follows:
110_00000 10_000000 0xC0_80 110_11111 10_111111 0xDF_BF
Only the last 5
bits, of the first byte are used, and only the last 6
bits, of the second bytes are used. The code point binary representation, is split on these 11
bits.
What does this mean you might ask? Well the 6
bits, in the second byte, allows to represent 2 ** 6 = 64
values, and the 5
bits in the first byte, allows to represent 2 ** 5 = 32
values.
The multiplication of 32
by 64
, yields 2048
possibilities, that can be used for encoding.
Using utf-8
one byte encoding, the code points between U+0000
and U+007F
, were already encoded, as such, when using utf-8
two bytes encoding, the code points between U+0080
, and U+07FF
, are encoded.
To give a concrete example, on how encoding is done, let us take the first unicode code point, that is encoded using utf-8
two bytes encoding, and which is U+0080
, and which represents a control character.
U+0080
, has a binary representation of 10_000000
, so what is going to happen, is that the last six bits, 000000
, of its binary representation, are going to be placed on the second byte of the utf-8
two bytes encoding, and the remaining first two bits, 10
, of its binary representation, are placed on the first byte of the utf-8 two bytes encoding, so this leads to U+0080
, binary representation of 10_000000
, to be encoded as 110_00010 10_000000
in binary, or 0xC2_80
in hexadecimal.
This being said, utf-8
two bytes encoding, starts from 0xC2_80
, since the first 128
code points, from 0
till 127
, were already encoded using utf-8
, one byte encoding. So what this means, is that the range of encoding using utf-8 two bytes, which is between 0xC0_80
, and 0xC2_7F
inclusive, doesn’t represent any code point, and if you try to decode it, this will result in an error.
# Python >>> encoding = b"\xC0\x00" # C0_00 does not represent the # encoding of any code point in # utf-8 2 bytes encoding. >>> encoding .decode("utf-8") # Trying to decoded it, causes an error. Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can t decode byte 0xc0 in position 0: invalid start byte >>> encoding = bytes("\u0080", "utf-8") # The encoding of U+0080 is : b'\xc2\x80' # Swift import Foundation > var char :Character = "\u{0080}" > char.utf8.map{ String (format: "%x" ,$0 )}.joined (separator: " ") /* Get the utf-8 encoding, of the code point U+0080 .*/ c2 80
3 bytes encoding
The encoding scheme for utf-8
three bytes encoding, is about dispersing, the binary representation of the code points, over three bytes.
1110_0000 10_000000 10_000000 0xE0_80_80 1110_1111 10_111111 10_111111 0xEF_BF_BF
For the first byte, the first four bits, are reserved, to having a value of 1110
, as to differentiate between utf-8, 1
, 2
, 3
, and 4
bytes encoding. What this means, is that only the last 4
bits of the first byte, of utf-8 3
bytes encoding, are used for encoding a code point. So you have 2 ** 4
, which gives sixteen possibilities.
For the second and third bytes, only six bits are reserved for encoding, so what this leaves you with, is 2 ** 6
, which is 64
possibilities, to perform encoding, for each of the second and third bytes.
What this means, is that the total number of possible encodings, that can be performed using utf-8
three bytes encoding, is 65536
. Having already encoded, the first 2048
code points, they are not encoded again.
The range of code points that is encoded when using utf-8
three bytes encoding, is between U+0800
, and U+FFFF
.
To give a concrete example, the Samaritan letter alaf, is represented in unicode, using U+0800
. The code point is in hexadecimal, and has a binary representation of 10_00_00_00_00_00
.
The last six bits, of the binary representation, are stored in the last byte of the utf-8 three bytes encoding, the previous six bits, are stored in the second byte, of the utf-8 three bytes encoding, and the first four bits, are stored in the first byte, of the utf-8 three bytes encoding.
So what this leaves you with, is that U+0800
, is encoded using 1110_0000 10_100000 10_000000
in binary, and which is 0xE0_A0_80
, in hexadecimal .
This being said, utf-8 three bytes encoding values, between 0xE0_00_80
, and 0xE0_A0_7F
inclusively, are not used for encoding, and they don’t represent any unicode code point , as such they cannot be decoded into one.
# Python >>> encoding = b'\xE0\x00\x80' # utf-8 three bytes encoding, which # does not represent, any valid code # point. >>> encoding.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can t decode byte 0xe0 in position 0: invalid continuation byte >>> encoding = bytes("\u0080","utf-8" ) # The encoding of U+0080 using utf-8 # 3 bytes encoding is: b'\xe0\xa0\x80' # Swift import Foundation > var char :Character = "\u{0800}" > char.utf8.map{ String (format: "%x" ,$0 )}.joined (separator: " ") /* The utf-8 encoding of the code point U+0800 is: */ e0 a0 80
4 bytes encoding
As explained earlier, to encode a code point, using utf8, the code point binary representation, is dispersed among the bytes to be used for encoding, in this case it is among four bytes.
11110_000 10_000000 10_000000 10_000000 0xF0_80_80_80 11110_111 10_111111 10_111111 10_111111 0xF7_BF_BF_BF
The last three bytes, have only six bits reserved for encoding, so this gives 2 ** 6
, which is 64
possibilities for each.
The first byte, is the one which is going to be used to differentiate, between the various utf-8 , 1
, 2
, 3
, and 4
byte encodings. This first byte, first five bits, are reserved to having the value 11110
, and as such only three bits are available to be used for encoding, which gives 2 ** 3
, which is 8
possibilities.
This means what, you might ask? Well it means, that the number of possible encodings, which can be done using utf-8 four bytes encoding, is 8 * 64 * 64 * 64
, which is 2097152
values.
The first 65536
unicode code points, have already been encoded, using 1
, 2
, and 3
bytes utf-8 encoding, so this means, that what must be encoded, is only unicode code points, between U+10000
and U+10FFFF
.
U+10000
is the LINEAR B SYLLABLE B008 A
, and it has a binary representation of 1_00_00 00_00_00 00_00_00
. To encode U+10000
, the last six bits, of its code point binary representation, are placed in the fourth byte of the utf8 four byte encoding, the previous six bits, are placed in the third byte, the previous previous six bits, are placed in the second byte, and the first three bits, are stored in the first byte of the utf8 four bytes encoding.
What this gives, is that U+10000
, is encoded using utf8 four bytes encoding, as 11110_000 10_010000 10_000000 10_000000
in binary, which is 0xF0_90_80_80
in hexadecimal.
Unicode last code point is U+10FFFF, it has a binary representation of 1_00 00_11_11 11_11_11 11_11_11
, so its encoding in binary is 11110_100 10_001111 10_111111 10_111111
, which is 0xF4_8F_BF_BF
in hexadecimal.
This being said, when using utf8 four bytes encoding, values between 0xF0_80_80_80
and 0xF0_90_A0_79
inclusively, are not used to encode any code points, additionally values larger than 0xf4_8f_bf_bf
are not used for encoding code points.
# Python >>>> encoding = b'\xF0\x80\x80\x80' # Encoding between 0xF0_80_80_80 , # and >0xF0_90_A0_79 inclusively, # and larger than 0xf4_8f_bf_bf # are not used to encode any unicode # code point. >>> encoding.decode('utf-8') # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte >>> encoding = bytes("\U00010000","utf-8") >>> encoding b'\xf0\x90\x80\x80' # Swift import Foundation > var char :Character = "\u{10000}" > char.utf8.map{ String (format: "%x" ,$0 )}.joined (separator: " ") /* U+10000 utf-8 encoding is: */ f0 90 80 80
How it is done in C?
The following is an example source code, which converts data encoded in utf-8, to code points, and code points into utf-8 data.
#include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <stdbool.h> typedef uint8_t byte_t; typedef uint32_t code_point_t; typedef uint64_t length_t; typedef uint8_t utf8_t; typedef uint16_t utf16_t; typedef uint32_t utf32_t; length_t decode_utf8 (utf8_t * utf8 , length_t bytes_count , code_point_t ** code_points ){ *code_points = calloc (bytes_count , sizeof (code_point_t )); // free when done. length_t code_points_idx = 0; utf8_t byte = 0; utf8_t byte_1 = 0; utf8_t byte_2 = 0; utf8_t byte_3 = 0; utf8_t byte_4 = 0; bool isByte_1 = false; bool isByte_2 = false; bool isByte_3 = false; code_point_t tmp_code_point; for (length_t l = 0 ; l < bytes_count ; l++ ){ byte = utf8 [l ]; if (!(byte >> 3 ^ 0b11110 )){ // 11110_000 , 4 byte encoding isByte_1 = true; byte_1 = ((utf8_t ) (byte << 5 )) >> 5;} else if (!(byte >> 4 ^ 0b1110 )){ //1110_0000 isByte_2 = true; byte_2 = ((utf8_t ) (byte << 4 )) >> 4;} else if (!(byte >> 5 ^ 0b110 )){ //110_00000 isByte_3 = true; byte_3 = ((utf8_t ) (byte << 3 )) >> 3;} else if (!(byte >> 6 ^ 0b10 )){ //10_000000 if (!isByte_1 && !isByte_2 && isByte_3 ){ byte_4 = ((utf8_t ) (byte << 2)) >> 2 ; tmp_code_point = (byte_3 << 6 ) + byte_4; (*code_points ) [code_points_idx++ ] = tmp_code_point; tmp_code_point = 0; byte_3 = 0; byte_4 = 0; isByte_3 = false;} else if (!isByte_1 && isByte_2 ){ if (!isByte_3 ){ isByte_3 = true; byte_3 = ((utf8_t ) (byte << 2)) >> 2 ;} else { byte_4 = ((utf8_t ) (byte << 2)) >> 2 ; tmp_code_point = (byte_2 << 12 ) + (byte_3 << 6 ) + byte_4; (*code_points ) [code_points_idx++ ] = tmp_code_point; tmp_code_point = 0; byte_2 = 0; byte_3 = 0; byte_4 =0; isByte_2 = false; isByte_3 = false; }} else if (isByte_1 ){ if (!isByte_2 ){ isByte_2 = true; byte_2 = ((utf8_t ) (byte << 2)) >> 2 ;} else if (!isByte_3 ){ isByte_3 = true; byte_3 = ((utf8_t ) (byte << 2)) >> 2 ;} else { byte_4 = ((utf8_t ) (byte << 2)) >> 2 ; tmp_code_point = (byte_1 << 18 ) + (byte_2 << 12 ) + (byte_3 << 6 ) + byte_4; (*code_points ) [code_points_idx++ ] = tmp_code_point; tmp_code_point = 0; byte_1 = 0; byte_2 = 0; byte_3 = 0; byte_4 =0; isByte_1 = false; isByte_2 = false; isByte_3 = false;}}} else if (!(byte >> 7 ^ 0 )){// 0_0000000 (* code_points ) [code_points_idx++ ] = byte;}} return code_points_idx;} length_t encode_utf8 (code_point_t * code_points, length_t code_points_count, utf8_t ** utf8 ){ *utf8 = calloc (code_points_count * 4 , sizeof (code_point_t )); //Free when done. code_point_t code_point = 0; length_t utf8_idx = 0; byte_t byte_11110_000 = (byte_t ) 0b11110000u; // 4 bytes encoding leading byte_t byte_1110_0000 = (byte_t ) 0b11100000u; // 3 bytes encoding leading byte_t byte_110_00000 = (byte_t ) 0b11000000u; // 2 bytes encoding leading byte_t byte_10_000000 = (byte_t ) 0b10000000u; for (length_t l = 0 ; l < code_points_count ; l++ ){ code_point = code_points [l ]; if (code_point >= 0x10000u && code_point <= 0x10FFFFu ){ //encoding using 4 bytes (* utf8 ) [utf8_idx++ ] = byte_11110_000 + (code_point >> 18 << 29 >> 29 ); // 21 bits, keep upper 3 bits, [18 .. 20] count from 0. (* utf8 ) [utf8_idx++ ] = byte_10_000000 + (code_point >> 12 << 26 >> 26 ); // kept bits 12 to 20 , after kep bits [12 ... 17] (* utf8 ) [utf8_idx++ ] = byte_10_000000 + (code_point >> 6 << 26 >> 26 ); // kept bits 6 to 20 , after kept bits [6 .. 11] (* utf8 ) [utf8_idx++ ] = byte_10_000000 + (code_point << 26 >> 26 );} // kept bits [0..5] else if (code_point >= 0x0800u && code_point <= 0xFFFFu ){ // 3 bytes encoding (* utf8 ) [utf8_idx++ ] = byte_1110_0000 + (code_point >> 12 << 28 >> 28 ); // 16 bits, kept upper 4 bits , [12 .. 15] (* utf8 ) [utf8_idx++ ] = byte_10_000000 + (code_point >> 6 << 26 >> 26 ); // kept bits [6 .. 11] (* utf8 ) [utf8_idx++ ] = byte_10_000000 + (code_point << 26 >> 26 );} // kept bits [0..5] else if (code_point >= 0x0080u && code_point <= 0x07FFu ){ // 2 bytes encoding (* utf8 ) [utf8_idx++ ] = byte_110_00000 + (code_point >> 6 << 27 >> 27 ); // 11 bit , kept bits [6..10] (* utf8 ) [utf8_idx++ ] = byte_10_000000 + (code_point << 26 >> 26 );} // kept bits [0..5] else{ // 1 byte encoding (* utf8 ) [utf8_idx++ ] = (utf8_t ) (code_point << 25 >> 25);}} //keep only 7 bits return utf8_idx; }
utf-16
utf16 is a variable length encoding, and the reason is that it uses, two bytes for encoding code points between U+0000
and U+FFFF
inclusive, and it uses four bytes to encode code points, between U+10000
and U+10FFFF
inclusive.
2 bytes encoding
The range of code points, which is encoded when using utf16, two bytes encoding, is between U+0000
and U+FFFF
inclusive. To be more correct, code points between U+D800
, and U+DFFF
, do not represent any character, because of the way utf16 four bytes encoding is done.
The binary bits of the code points, are directly used as the encoding, and the range of possible encoding values is:
0000 0000 0000 0000 0x00_00 1111 1111 1111 1111 0xFF_FF
To give an example, the code point U+0000
has a binary representation of all zeros, and as such it has an encoding of all zeros, which means 0000_0000 0000_0000
in binary, or 0x0000
in hexadecimal.
As another example, the code point U+FFFF
, is the last code point, encoded using utf-16 two bytes encoding. It has a binary bits representation of 1111_1111 1111_1111
, its encoding is the same, and is 1111_1111 1111_1111
in binary , or 0xFF_FF
in hexadecimal.
# python >>> encoding = bytes("\u0000","utf-16") b'\xff\xfe\x00\x00' # The first two bytes are called, # the BOM mark, and they specify # the order of bits, if its little, # or big endian. This will be # explained in further details, # in the BOM mark section. # Swift import Foundation > var char :Character = "\u{0000}" > char.utf16.map{ String (format: "%x" ,$0 )}.joined (separator: " ") /* U+0000 utf-16 encoding is: */ 0
4 bytes encoding
The unicode code points range, between U+0000
and U+FFFF
inclusive, was already encoded, using utf16 two bytes encoding, so what remains to be encoded using utf16 four bytes encoding, is the range between U+10000
and U+10FFFF
inclusive.
This being said, the range of bits for the encoding, is:
110110_00 0000_0000 110111_00 0000_0000 0xD8_00_DC_00 110110_11 1111_1111 110111_11 1111_1111 0xDB_FF_DF_FF
The first and third bytes, has the first six bits assigned a constant value each, as such only two bits of each, can be used for encoding, this gives four possibilities for each.
The reason that these bits were reserved, as it can be seen from the hexadecimal range, is in order to differentiate, between utf16, two and four bytes encoding. In unicode, the range between U+D800
, and U+DFFF
inclusive, does not represent any character.
The second and fourth bytes, do not have any reserved bits. This gives, 2 ** 8
, which means 256
possibilities for each, that can be used for encoding.
In other words, the total possible encoding values, that can be used, when using utf16 four bytes encoding, is 4* 4 * 256 * 256
, which is 1048576
possible encoding values.
This being said, the question to ask, is how to disperse, the code point bits, into the encoding bits? The scheme is quite simple, simply subtract the hexadecimal value, 0x1_00_00
, from the code point. The resulting value, is twenty bits long. The upper 10
bits of the result, are dispersed among the upper two bytes of utf16 four bytes encoding, and they are called the high surrogate, whereas the lower 10
bits, are dispersed among the lower two bytes, and they are called the lower surrogate.
Too difficult!!
So simple! let us give an example. The old italic letter BE , 𐌁
, has a unicode code point of U+10301
, which means the code point is 0x1_03_01
. To encode this code point, we must first subtract it from 0x01_00_00
. The result of this subtraction is 0x0_03_01
, which in binary is 0000_0000_00 11_0000_0001
.
The higher ten bits are 0000_0000_00
, so they are placed in the upper two bytes, and the lower ten bits are 11_0000_0001
, so they are placed in the lower two bytes, which gives an encoding of 110110_00 0000_0000 110111_11 0000_0001
, in binary, or 0xD8_00_DF_01
in hexadecimal.
# python >>> encoding = bytes("\U00010301","utf-16") >>> encoding # The first two bytes are called the # BOM mark, these are reserved bytes, # used to detect if the encoding # is done, in little or big endian, # and they will be explained in more # details, in their section. # In all cases, note that in this case # the encoding is a little endian one, # and hence every two bytes, are actually # reversed. b'\xff\xfe\x00\xd8\x01\xdf' # Swift import Foundation > var char :Character = "\u{10301}" > char.utf16.map{ String (format: "%x" ,$0 )}.joined (separator: " ") /* U+10301 utf-16 encoding is: */ d800 df01
How it is done in C?
This code shows, how to encode unicode code points, into utf16, in a machine dependent endianness way, additionally, it shows how to decode, utf16 big endian data, into code points.
#include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <stdbool.h> typedef uint8_t byte_t; typedef uint32_t code_point_t; typedef uint64_t length_t; typedef uint8_t utf8_t; typedef uint16_t utf16_t; typedef uint32_t utf32_t; length_t decode_utf16 /* Big endian utf16 decoder.*/ (utf16_t * utf16, length_t utf16_count, code_point_t ** code_points ){ *code_points = calloc (utf16_count , sizeof (code_point_t )); length_t code_points_idx = 0; utf16_t data = 0; uint32_t utf16_cst = 0x10000; uint16_t upper_encoding = 0; uint16_t lower_encoding = 0; code_point_t tmp_code_point; for (length_t l = 0 ; l < utf16_count ; l++ ){ data = utf16 [l ]; if (!(data >> 10 ^ 0b110110 )){// High utf16 check , Big endian upper_encoding = ((utf16_t ) (data << 6)) >> 6 ;} else if (!(data >> 10 ^ 0b110111 )){ // low utf16 check, Big Endian lower_encoding = ((utf16_t ) (data << 6)) >> 6 ; tmp_code_point = utf16_cst + (upper_encoding << 10 ) + lower_encoding; (*code_points ) [code_points_idx++ ] = tmp_code_point; tmp_code_point = 0; upper_encoding = 0; lower_encoding = 0;} else {//encoding is code point (* code_points ) [code_points_idx++ ] = data;}} return code_points_idx; } length_t encode_utf16 (code_point_t * code_points, length_t code_points_count, utf16_t ** utf16 ){ /* Encode utf16, in machine indieness .*/ *utf16 = calloc (code_points_count * 2 , sizeof (code_point_t )); code_point_t code_point = 0; length_t utf16_idx = 0; uint8_t byte_110110_00 = (uint8_t) 0b11011000u; // 4 bytes utf16, 110110_00 in first 16 bits uint8_t byte_110111_00 = (uint8_t) 0b11011100u; // 4 bytes utf16, 110111_00 in second 16 bits utf16_t tmp_utf16 = 0; for (length_t l = 0 ; l < code_points_count ; l++ ){ code_point = code_points [l ]; if (code_point >= 0x10000u && code_point <= 0x10FFFFu ){ //encoding using 4 bytes code_point = code_point - 0x10000u; // 20 bits tmp_utf16 = 0; tmp_utf16 = byte_110110_00 + (code_point >> 18 << 30 >> 30 ); // first 2 bits tmp_utf16 = tmp_utf16 << 8; tmp_utf16 = tmp_utf16 + (code_point >> 10 << 24 >> 24 ) ; // next 8 bits (* utf16 ) [utf16_idx++ ] = tmp_utf16; // first 2 bytes tmp_utf16 = 0; // next 2 bytes tmp_utf16 = byte_110111_00 + (code_point >> 8 << 30 >> 30 ); tmp_utf16 = tmp_utf16 << 8; tmp_utf16 = tmp_utf16 + (code_point << 24 >> 24 ); (* utf16 ) [utf16_idx++ ] = tmp_utf16 ;} else{ // 1 byte encoding (* utf16 ) [utf16_idx++ ] = code_point << 16 >> 16 ;}} return utf16_idx; }
utf-32
4 bytes encoding
When using utf-32, the encoding is always done, using four bytes, hence utf-32 is a fixed length encoding. This being said, utf-32 has an encoding range of:
0000_0000 0000_0000 0000_0000 0000_0000 0x00_00_00_00 1111_1111 1111_1111 1111_1111 1111_1111 0xFF_FF_FF_FF
As such, and as it can be seen from the encoding range, all unicode code points, can be encoded directly, using their binary representation. So the range of code points, that can be encoded is between, U+0000
to U+10FFFF
, inclusive.
To give an example, the unicode code point U+AB11
, has a binary representation of 0000 0000 0000 0000 1010 1011 0001 0001
, as such this will be its encoding, in binary, as for its encoding in hex, it is 0x_00_00_00_00_00_00_AB_11
.
# python >> encoding = bytes("\uAB11","utf-32") /* Encoding of U+AB11 using utf32 .*/ b'\xff\xfe\x00\x00\x11\xab\x00\x00' # The first four bytes, are the # BOM mark, and are used to tell # if encoding, is little, or big # endian. This will be discussed # in the next sections.
How is it done in C?
This code shows, how to encode and decode, unicode code points, into utf32, in a machine endianness, dependent manner.
#include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <stdbool.h> typedef uint8_t byte_t; typedef uint32_t code_point_t; typedef uint64_t length_t; typedef uint8_t utf8_t; typedef uint16_t utf16_t; typedef uint32_t utf32_t; length_t decode_utf32 (utf32_t * utf32 , length_t utf32_count , code_point_t ** code_points ){ /* Decoding is based on the machine endianness .*/ *code_points = calloc (utf32_count , sizeof (code_point_t )); length_t code_points_idx = 0; for (length_t l = 0 ; l < utf32_count ; l++ ){ (*code_points ) [code_points_idx++ ] = utf32 [l ];} return code_points_idx; } length_t encode_utf32 (code_point_t * code_points, length_t code_points_count, utf32_t ** utf32 ){ /* Encoding Endianness is machine dependent .*/ *utf32 = calloc (code_points_count * 1 , sizeof (code_point_t )); length_t utf32_idx = 0; for (length_t l = 0 ; l < code_points_count ; l++ ){ (*utf32 ) [utf32_idx++ ] = code_points [l ];} return utf32_idx; }
BOM , little endian big endian
What is big endian, and little endian?
Big endian, and little endian, is how to order multi-bytes data, in a computer . In big endian , the bytes are ordered, from the most significant, to the least significant . In little endian , the bytes are ordered, from the least significant, to the most significant . So in big endian, the most significant byte, is stored to the left , and in little endian, the most significant byte, is stored to the right .
For example:
###### data is 0x03_05 # most significant ix 0x03 # least significant is 0x05 # Big Endian ordering is 0x03_05 # most significant byte 0x03 is stored to the left or first # little endian ordering is 0x05_03 # most significant byte 0x03 is stored to the right or last ###### data is 0x55 # most significant is 0x55 # Big endian is 0x55 # little endian is 0x55
utf-16 and utf-32 encodings, are multi-bytes , since they use more than one byte, to encode a character. As such, when a utf-16, or a utf-32 encoding, is stored, it must specify byte ordering, like is it a big endian, or a little endian one?
utf-16
Little endian encoding, is specified by using UTF-16LE
.
For instance, in this example:
>>> bytes('ˆ','utf-16LE') b'\xc6\x02'
We have specified, that we want to encode the circumflex accent ˆ
, using utf-16 , and that the byte ordering, is little endian. As such the most significant byte, is stored last .
Big endian encoding, is specified by using utf-16BE
. For instance , in this example,
>>> bytes('ˆ','utf-16BE') b'\x02\xc6'
the circumflex accent ˆ
, is encoded using utf-16. We have specified that the byte ordering, is to be big endian , hence the most significant byte, is stored first .
If we don’t specify the byte ordering, as in by using utf-16
, unicode recommends, that the default ordering, is to be big endian , but python uses little endian .
>>> bytes('ˆ','utf-16') b'\xff\xfe\xc6\x02'
As it can be seen, the byte ordering is little endian, and you can also notice, that there are two bytes, which are added before the encoding of the circumflex accent character. These two bytes, are actually the inverse, of the byte order mark.
The byte order mark, is 0xFE_FF
, and it allows us, to detect the byte order, when using utf-16
. So if it is encountered, and is 0xFE_FF
, the byte ordering is big endian , and if its encountered, and is 0xFF_FE
, this means that the byte ordering, is little endian. If it is not encountered, then unicode recommends to assume, that the encoding is big endian, when using utf-16
.
So when using :
-
utf-16LE
: the encoding and decoding, is little endian, and no bom mark is added . if a bom mark is encountered when decoding, it is decoded into a code point. utf-16BE
: the encoding and decoding is big endian, and no bom mark is added . If a bom mark is encountered when decoding, it is decoded into a code point .utf-16
: when encoding, a Bom mark is added , unicode recommends, the default encoding to be big endian . Java uses big endian, and python uses little endian, as default. This being said, java will add the bom mark of0xFE_FF
, and do a big endian encoding , and python will add the bom mark of0xFF_FE
, and do a little endian encoding . When decoding using utf-16 , if the bom mark is encountered, in the first two bytes , we decode based on this bom mark. If it is not encountered, unicode recommends, the decoding to be big endian . Python will use little endian , and java will use big endian . Bom marks encountered after the first two bytes, are decoded into their corresponding code point
# encoding using utf-16BE >>> encoding = bytes('\u00F8','utf-16BE') # encoding is specified to # be big endian. # when encoding using # utf-16BE or utf-16LE # no bom mark is added >>> encoding b'\x00\xf8' >>> encoding.decode('utf-16') '\uf800' # When decoding using utf-16, # if no bom mark is found, # unicode recommends, that # decoding should be done, as # if the encoding was done # using big endian. But python, # decodes, as if the encoding, # was done, using little # endian. The gotten character, # is different from the # originally encoded one. >>> encoding.decode('utf-16BE') 'ø' # utf-16BE , will always decode using # big endian , ø has a code point # of U+00F8 . >>> encoding.decode('utf-16LE') '\uf800' # utf-16LE , will always decode using # little endian, we ended up with a # character different from the # originally encoded one. # encoding using utf-16 >>> encoding = bytes('\u00F8','utf-16') # when encoding using # utf-16, a bom mark is # always added . >>> encoding b'\xff\xfe\xf8\x00' # python utf-16 default encoding # is little endian, hence 0xFF_FE # is added. # unicode recommends the default # encoding, to be big endian. >>> encoding.decode('utf-16') 'ø' # when decoding using utf-16, # if the first two bytes, are the # BOM mark, they are used to set, # if the decoding is to be, a # big or a little endian. # if the bom mark is not encountered, # unicode recommends, that the default # decoding, is to be Big endian, # python uses little endian , java uses # Big endian. >>> encoding.decode('utf-16LE') '\ufeffø' # utf-16LE will decode the bom mark, # as the unicode code point U+FEFF >>> encoding.decode('utf-16BE') '\ufffe\uf800' # utf-16BE will decode the bom mark, # as the unicode code point U+FFFE
utf-32
When using utf-32
, unicode recommends, the ordering to be big endian . Python uses little endian, and Java uses big endian .
The BOM, or byte order mark, used by utf-32
, is the same as the one used in utf-16
, but now it is stored as the four bytes, 0x00_00_FE_FF
, when using big endian, or as the four bytes 0xFF_FE_00_00
, when using little endian .
When encoding using utf-32
, the bom mark is added . When decoding using utf-32
, and the byte order mark 0x_00_00_FE_FF
is encountered in the first fourth bytes , this means that the byte order is big endian . When its reverse 0xFF_FE_00_00
, is encountered in the first four bytes , this means that the byte order is little endian . When no bom mark is encountered by utf-32
, unicode recommends the byte order, to be assumed as big endian . Python assume it to be little endian , and java assume it to be big endian . If a bom is encountered after the first four byte, it is decoded into its code point .
When encoding using utf-32LE
, the encoding is little endian , and no bom mark is added to the encoding. When decoding using utf-32LE
, if a bom mark is encountered, it is decoded into a code point .
When encoding using utf-32BE
, the encoding is Big endian , and no bom mark is added to the encoding. When decoding using utf-32BE
, if a bom mark is encountered, it is decoded into a code point .
# encoding using utf-32BE >>> encoding = bytes('\uab11', 'utf-32BE') >>> encoding b'\x00\x00\xab\x11' # U+ab11 is encoded using utf-32BE # no bom is added >>> encoding.decode('utf-32') UnicodeDecodeError: 'utf-32-le' codec can t decode bytes in position 0-3: code point not in range(0x110000) # When using python, utf-32 will # decode using utf-32LE, # when no bom mark is encountered. # Unicode actually recommends decoding # using utf-32BE . # The encoding and the decoding, is done # using different endianness, as such # in this case, an error was thrown. >>> encoding.decode('utf-32LE') UnicodeDecodeError: 'utf-32-le' codec can t decode bytes in position 0-3: code point not in range(0x110000) # utf-32LE will try to decode using little # endian encoding , so it will try to decode # 0x11_ab_00_00 # which is not in the range of valid # code points # encoding using utf-32LE >>> encoding = bytes('\uab11', 'utf-32LE') >>> encoding b'\x11\xab\x00\x00' # U+ab11 is encoded using utf-32LE # no bom is added >>> encoding.decode('utf-32') 'ꬑ' # In the case of python, when using # utf-32, and since no bom mark # is encountered, the decoding is done, # as if the encoding was little endian. # In this case, the decoding succeed, # since the encoding was originally # little endian. # Unicode recommends, that when no # bom mark is encountered, that # the decoding is to be big endian. >>> encoding.decode('utf-32BE') UnicodeDecodeError: 'utf-32-be' codec can t decode bytes in position 0-3: code point not in range(0x110000) # When trying to decode using big endian, # the original encoding was little endian, # so when decoding 0x11_ab_00_00 , it # is actually read from left to right, # the encoding is not in the range of # valid code points, hence this causes # an error. # encoding using utf-32 >>> encoding = bytes('\uab11', 'utf-32') >>> encoding b'\xff\xfe\x00\x00\x11\xab\x00\x00' # U+ab11 is encoded using utf-32, # for python this means encoding using # utf-32LE, and a bom mark is added. # Unicode recommends, that the encoding # is to be Big endian, in this case. >>> encoding.decode('utf-32') 'ꬑ' # utf-32 will check the first 4 bytes # for the bom mark , if it is found # decoding is done based on the bom # mark. # if it is not found, in the case of # python, decoding is done, as if # encoding was little endian, unicode # recommends it, to be big endian. >>> encoding.decode('utf-32LE') '\ufeffꬑ' # utf-32LE will decode using little endian # the bom mark will be decoded into its # code point >>> encoding.decode('utf-32BE') UnicodeDecodeError: 'utf-32-be' codec can t decode bytes in position 0-3: code point not in range(0x110000) # when trying to decode using big endian, # the encoding was originally done using # little endian, so data is read in the # inverse direction, in this case this # has caused an error.