Unicode encoding : utf-8 , utf-16 , utf-32

A character in unicode is represented by a unicode code points. So for example the character a has a unicode code point of U+0061 . Code points can be between 0x0 and 0x10_FF_FF in hexadecimal or between 0 and 1_114_111 in decimal . As such unicode code points can be represented by 21 bits .

21 bits can have a min value of 0 , and a max value of 2_097_151 in decimal . The reason why the max unicode code point is 0x10_FF_FF , is because of how a unicode character is encoded .

Encoding is about storing a unicode point in the computer. For example the character a has a unicode code point of U+0061 . This code point is represented in binary as 0b0110_0001 . The encoding of this code point , so the encoding of the binary or numerical representation of this code point , can be the same as the code point binary representation or it can be a different binary or numerical representation.

There are many encoding that can be used with unicode , they are

  • utf-8
  • utf-16
  • utf-32

utf-8

utf-8 encoding can encode (store or represent in binary ) a unicode code point by using either :

  • 1 byte
  • 2 bytes
  • 3 bytes
  • 4 bytes

1 byte encoding

A utf-8 1 byte encoding uses 8 bits to encode a unicode code point . The first bit is always 0 , and the rest 7 bits can be either 0 or 1 . As such , in 1 byte utf-8 encoding , the binary values that we can have are between

0000 0000
0111 1111

so we can use these values to represent code points which have a binary representation between 0000 0000 and 0111 1111 . So we can represent unicode code points between 0 and 127 in decimal or between 0 and 0x7F in hexadecimal .

# character 0 has a unicode code points of U+0030 
# the code point number is 0x0030 written in hexadecimal
# its decimal value is 48 which is in binary 0110_000
# 0110_000 can be represented using utf-8 1 byte encoding , since
# utf-8 1 byte encoding can have binary values between
# 0000_000 and 0111_000
 
 
>>> ord('0')
48
# unicode code point in decimal of 0 is 48
 
>>> hex(ord('0'))
'0x30'
# unicode code point in hexadecimal of 0 is 0x30
 
>>> bin(48)
'0b0110000'
# the decimal number 48 in binary

Hence when encoding using 1 byte in utf-8 , the unicode code point and the encoding are the same .

2 bytes encoding

when encoding a unicode code point using two bytes , the first byte must start with 110 , and the second byte must start with 10 . As such the binary values that can be represented by two bytes are in the range of

1100_0000 1000_0000
1101_1111 1011_1111

or 0xC0_80 and 0xDF_BF in hexadecimal .

The second byte can only use the last 6 bits , since the first 2 bits are always 10 , hence it can have 2**6 = 64 values .

The first byte can have only 32 values , since the first three bits cannot change , and 5 bits can have 2**5= 32 values .

So when using utf-8 2-bytes encoding , we can only encode 32 * 64 = 2048 unicode code points. So we can encode code points from U+0000 till U+077F . The first 127 code points are already encoded using utf-8 1-byte , as such we don’t re encode them .

The first unicode code point that is encoded is U+0080 .This code point represents a control character , and it is the code point that comes after the last code point that is encoded using utf-8 1-byte . U+07FF is the last unicode code point that can be encoded using utf-8 2-bytes .

utf-8 2-bytes encoding starts from 0xC2_80 , since the first 128 values from 0 till 127 , are already encoded using utf-8 1-byte . As such encoding in the range 0xC0_800xC2_7F which is smaller than 0xC2_80 doesn’t represent any code point.

So to give some examples , U+0080 is a control character . In binary it is written as 1000_0000 and in decimal it is written as 128. It cannot be encoded using utf-8 1-byte encoding since the max value that can be represented by this encoding is 0111_1111 which is 127 in decimal . As such we must use utf-8 2-byte encoding . utf-8 2-bytes encoding starts at 1100_0010 1000_0000 , so we can use the second byte to store this value , and as such U+0080 is encoded into 1100_0010 1000_0000 in binary . So its encoding can be represented in hexadecimal as 0xC2_80 .

The inverted question mark character ¿ has a unicode code point of U+00BF which is 191 in decimal. The unicode code point is represented in binary by 1011_1111 , as such it can be stored in utf-8 2 byte encoding using 1100_0010_1011_1111 , and its encoding in hexadecimal is 0xC2_BF

The latin capital letter A with grave , À has a unicode code point of U+00C0 which is 192 in decimal . The unicode code point binary representation is 1100_0000 . it cannot be stored using the second byte of utf-8 2-byte encoding , since the maximum value that can be stored in the second byte is 1011_1111 . As such to store the unicode code point , using utf-8 2-byte encoding we increment the first byte , and the encoding of À is 1100_0011 1000_0000 . This encoding is 0xC3_80 in hexadecimal.

if we try to decode bytes in the ranges 0xC0_00 - 0xC2_7F , they don’t represent any encoding , as such they will not be decoded , into a code point .

>>> _bytes = b'\xC0\x00'
# bytes object containing 2 byte 0xc0 and 0x00

>>> _bytes.decode('utf-8')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte


>>> _bytes = b'\xC2\xBf'
# bytes object containing 2 byte 0xC2 and 0xBf

>>> _bytes.decode('utf-8')
'¿'
#decode the _bytes object using utf-8 , and the result is the ¿ character

3 bytes encoding

When encoding a code point using three bytes , the first byte must start with 1110 , the second and third bytes must start with 10 . As such the binary values that can be represented by three bytes are in the range of

1110_0000 1000_0000 1000_0000 
1110_1111 1011_1111 1011_1111

or 0xE0_80_80 and 0xEF_BF_BF in hexadecimal .

In the first byte we are only using 4 bits , since the first 4 bits are always 1110 . Four bits can have values between 0 min and 15 max, as such they can represent 16 values .

The second and third byte can only use 6 bits so they can have values between 0 min and 63 max , as such they can represent 64 values each .

Hence the total number of values that we can represent using 3-bytes’ utf-8 encoding is 16 * 64 * 64 = 65536 values .

The first 2048 code points are already encoded using utf-8 2-bytes , so we must only encode code points larger than 2048 , hence we start encoding from 1110_0000 1010_0000 1000_0000 in binary , which is 0xE0_A0_80 in hexadecimal.

The first code point that is encoded using utf-8 3-bytes is U+0800 which is just after the last code point that can be encoded using utf-8 2 bytes. The last code point that can be encoded using utf-8 3-bytes is U+FFFF .

The Samaritan letter alaf is represented in unicode using U+0800 . U+0800 is represented in binary by 0000_0100 0000_0000 . The encoding of the code point using utf-8 3-bytes is 1110_0000 1010_0000 1000_0000 in binary which is 0xE0_A0_80 in hexadecimal .

utf-8 3-bytes values between 0xE0_00_80 and 0xE0_A0_7F are not used , and they don’t represent any unicode code point , as such they cannot be decoded into a code point.

>>> _bytes = b'\xE0\x00\x80'
>>> _bytes.decode('utf-8')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 0: invalid continuation byte

>>> _bytes = b'\xE0\xA0\x7F'
>>> _bytes.decode('utf-8')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

>>> _bytes = b'\xE0\xAF\x80'
>>> _bytes.decode('utf-8')
'ீ'

4 bytes encoding

When encoding a code point using four bytes , the first byte must start with 11110 and the second third , and fourth byte must start with 10 . As such the binary values that can be represented by four bytes are in the range of

1111_0000 1000_0000 1000_0000 1000_0000
1111_0111 1011_1111 1011_1111 1011_1111

or 0xF0_80_80_80 and 0xF7_BF_BF_BF in hexadecimal .

In the first byte we are only using 3 bits , since the first 5 bits are always 11110 . Three bits can have values between 0 min and 7 max, as such they can represent 8 values .

The second third and fourth byte can only use 6 bits so they can have values between 0 min and 63 max , as such they can represent 64 values each .

Hence the total number of values that we can represent using 4 bytes’ utf-8 encoding is 8 * 64 * 64 * 64 = 2097152 values , so values from 0 till 2097151.

utf-8 4-bytes encode values between U+10000 , the value not encoded by utf-8 3 bytes , and U+10FFFF , which is the max unicode code point .

The first 65536 code points are already encoded using utf-8 3-bytes , so we must only encode code points larger than 65536 , hence we start encoding from 1111_0000 1001_0000 1000_0000 1000_0000 in binary , which is 0xF0_90_A0_80 in hexadecimal.

The first code point that is encoded using utf-8 4-bytes is U+10000 which is just after the last code point that can be encoded using utf-8 3-bytes. The last code point that can be encoded using utf-8 4 bytes is U+10FFFF .

The linear B syllable B008 A , letter is represented in unicode using U+10000 . U+10000 is represented in binary by 1_0000_0000_0000_0000 . The encoding of U+10000 using utf-8 4 bytes is 1111_0000 1001_0000 1000_0000 1000_0000 in binary or 0xF0_90_A0_80 in hexadecimal .

The unassigned code point U+10FFFF is the last unicode code point and is represented in binary by 1_0000_1111_1111_1111_1111 . The encoding of this code point using utf-8 4 bytes is 1111_0100 1000_1111 1011_1111 1011_1111 in binary or 0xf4_8f_bf_bf in hexadecimal .

utf-8 4-byte values between 0xF0_80_80_80 and 0xF0_90_A0_79 and larger than 0xf4_8f_bf_bf are not used , and they don’t represent any unicode code point , as such they cannot be decoded into a code point.

>>> _bytes = b'\xF0\x80\x80\x80'
>>> _bytes.decode('utf-8')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte

>>> _bytes = b'\xF0\x90\xA0\x79'
>>> _bytes.decode('utf-8')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte

>>> _bytes = b'\xf4\x8f\xbf\xCF'
>>> _bytes.decode('utf-8')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte

utf-16

utf-16 encoding can encode a unicode code point , using either :

  • 2 bytes
  • 4 bytes

2 bytes encoding

A utf-16 2 bytes encoding use 16 bits to encode a unicode code point , as such the binary values that can be represented using this encoding are between

0000 0000 0000 0000
1111 1111 1111 1111

or 0x00_00 and 0xFF_FF in hexadecimal . As such using utf-16 2 bytes we can encode code points between U+0000 and U+FFFF .

The code point U+0000 represents the NULL control character, it has a binary representation of 0000_0000 0000_0000 . it is encoded using utf-16 2 bytes into 0000_0000 0000_0000 in binary or 0x00_00 in hexadecimal .

The last code point U+FFFF is not assigned to any character , it has a binary representation of 1111_1111 1111_1111 and its encoded using utf-16 2-bytes encoding into 1111_1111 1111_1111 in binary or 0xFF_FF in hexadecimal .

4 bytes encoding

When encoding a code point using four bytes , the first byte must start with 1101_10 , and the third byte must start with 1101_11 . As such the binary values that can be represented by four bytes are in the range of

1101_1000 0000_0000 1101_1100 0000_0000
1101_1011 1111_1111 1101_1111 1111_1111
or 0xD8_00_DC_00 and 0xDB_FF_DF_FF in hexadecimal .

The code points that are encoded using utf-16 4-bytes , are in the range U+10000 and U+10FFFF .

To encode a code point , we first subtract it from the hexadecimal value 0x1_00_00 . The subtraction result will be formed of 20 bits , because a unicode code point has a max value of 0x10_00_00 , and 0x10_00_00 - 0x1_00_00 cannot be greater than 20 bits .The higher 10 bits are added to the first 2 byte 0xD8_00 , and they are called the high surrogate , and the lower 10 bits are added to the last 2 bytes 0xDC_00 and they are called the lower surrogate .

To give an example ,the old italic letter BE , 𐌁 , has a unicode code point of U+10301 . So its code point is 0x1_03_01 . To encode this code point we must first subtract it from 0x01_00_00 . The result of this subtraction is 0x0_03_01 which in binary is 0000_0000_0011_0000_0001 .
  • the higher 10 bits are 0000_0000_00 which is 0x000 in hexadecimal. They are added to 0xD8_00 , as such the result is 0xD8_00 .
  • The lower 10 bits are 11_0000_0001 which is 0x301 in hexadecimal . They are added to 0xDC_00 , and result is 0xDF_01 .

As such the encoding of this code point using utf-16 4-bytes is 0xD8_00_DF_01 .

The unicode code point between U+D800 and U+DFFF do not represent any character , because of the way utf-16 encode code points .

utf-32

in utf-32 encoding we are using 4 bytes to encode a code point , As such the binary values that can be represented are between

0000_0000 0000_0000 0000_0000 0000_0000
1111_1111 1111_1111 1111_1111 1111_1111
or 0x00_00_00_00 and 0xFF_FF_FF_FF in hexadecimal . When encoding using utf-32 the encoding and the code point are the same .

To give some examples the unicode code point U+AB11 has a code point number in hex of 0xAB_11 , since we are using utf-32 , the encoding is the same as this code point , so its encoding is 0x00_00_AB_11 .

BOM , little endian big endian

Big endian and little indian is how to order multi bytes data in the computer . In big endian , the bytes are ordered from the most significant to the least significant . In little endian , the bytes are ordered from the least significant to the most signficant . So in big endian the most significant byte is stored to the left , and in little endian the most significant byte is stored to the right . For example

###### data is 0x03_05
# most significant ix 0x03
# least significant is 0x05

# Big Endian ordering is 
0x03_05
# most significant byte 0x03 is stored to the left or first 


# little endian ordering is 
0x05_03
# most significant byte 0x03 is stored to the right or last


###### data is 0x55
# most significant is 0x55

# Big endian is 
0x55

# little endian is 
0x55

utf-16 and utf-32 encoding are multibytes , they uses more than one byte to encode a character. Hence when they store the encoding , they must specify the byte ordering , if it is a big endian or little endian .

utf-16

Little endian encoding is specified by using UTF-16LE . For instance, in this example

>>> bytes('ˆ','utf-16LE')
b'\xc6\x02'

we have specified that we want to encode the circumflex accent ˆ using utf-16 , and that the byte ordering is little endian . As such the most significant byte is stored last .

Big endian encoding is specified by using utf-16BE For instance , in this example

>>> bytes('ˆ','utf-16BE')
b'\x02\xc6'

the circumflex accent ˆ is encoded using utf-16 . We have specified that the byte ordering is big endian , hence the most significant byte is stored first .

If we don’t specify the byte ordering by using utf-16 , unicode recommends the default ordering to be big endian , but python uses little endian .

>>> bytes('ˆ','utf-16')
b'\xff\xfe\xc6\x02'

As we see the byte ordering is little endian , and we also see that there are two bytes which are added before the encoding of the circumflex accent character . These two bytes are actually the inverse of the byte order mark .

The byte order mark is 0xFE_FF , and it allows us to detect the byte order when using utf-16 . So if it is encountered and is 0xFE_FF the byte ordering is big endian , if its encountered and is 0xFE_FF this means that the byte ordering is little endian. If it is not encountered then unicode recommends to assume that the encoding is big endian when using utf-16 .

so when using :

  • utf-16LE : the encoding and decoding is little endian and no bom mark is added . if a bom mark is encountered when decoding it is decoded into a code point .
  • utf-16BE : the encoding and decoding is big endian and no bom mark is added . If a bom mark is encountered when decoding it is decoded into a code point .
  • utf-16 : when encoding a Bom mark is added , unicode recommends the default encoding to be big endian . Java uses big endian and python uses little endian as default. Hence java will add the bom mark of 0xFE_FF and do a big endian encoding , and python will add the bom mark of 0xFF_FE , and do a little endian encoding . When decoding using utf-16 , if the bom mark is encountered in the first two bytes , we decode based on this bom mark. If it is not encountered unicode recommends the decoding to be big endian . Python will use little endian , and java will use big endian . Bom marks encountered after the first two bytes , are decoded into their corresponding code point .
# encoding using  utf-16BE

>>> _bytes = bytes('ø','utf-16BE')
# encoding specified as big endian

>>> _bytes
b'\x00\xf8'
# when encoding using 
# utf-16BE or utf-16LE
# no bom mark is added 

>>> _bytes.decode('utf-16')
'\uf800'
# when decoding using utf-16
# if no bom mark is found
# it assumes that encoding is little endian
# hence the decoding result is '\uf800'

>>> _bytes.decode('utf-16BE')
'ø'
# utf-16BE , will always decode using big endian

>>> _bytes.decode('utf-16LE')
'\uf800'
# utf-16LE , will always decode using little endian



# encoding using utf-16

>>> _bytes = bytes('ø','utf-16')
# when encoding using 
# utf-16 a bom mark is always added 

>>> _bytes
b'\xff\xfe\xf8\x00'
# python utf-16 default encoding is little endian
# hence 0xFF_FE is added 
# unicode recommends the default encoding to be big endian

>>> _bytes.decode('utf-16')
'ø'
# when decoding using utf-16 
# if the first two bytes are the BOM mark
# they are used to set the decoding 
# if it is little endian or big endian 
# if the bom mark is not encountered 
# the default decoding is recommended 
# to be Big endian by unicode 
# python uses little endian , java uses Big endian

>>> _bytes.decode('utf-16LE')
'\ufeffø'
# utf-16LE will decode the bom mark
# as the unicode code point U+FEFF
# which is the zero width no break space control format character

>>> _bytes.decode('utf-16BE')
'\ufffe\uf800'
# utf-16BE will decode the bom mark
# as the unicode code point U+FFFE
# which is a control character

utf-32

When using utf-32 , unicode recommends the ordering to be big endian . Python uses little endian and Java uses big endian .

The byte order mark used by utf-32 is the same as the one used in utf-16 encoding but now it is stored as the 4 bytes 0x00_00_FE_FF when using big endian or as the 4 bytes 0xFF_FE_00_00 when using little endian .

When encoding using utf-32 , the bom mark is added . When decoding using utf-32 and the byte order mark 0x_00_00_FE_FF is encountered in the first 4 bytes , this means that the byte order is big endian . When its reverse 0xFF_FE_00_00 is encountered in the first 4 bytes , this means that the byte order is little endian . When no bom mark is encountered by utf-32 , unicode recommends the byte order to be assumed as big endian . Python assume it to be little endian , and java assume it to be big endian . If a bom is encountered after the first four byte it is decoded into its code point .

When encoding using utf-32LE , the encoding is little endian , and no bom mark is added to the encoding. When decoding using utf-32LE if a bom mark is encountered it is decoded into a code point .

When encoding using utf-32BE , the encoding is Big endian , and no bom mark is added to the encoding. When decoding using utf-32BE if a bom mark is encountered it is decoded into a code point .

# encoding using utf-32BE

>>> _bytes =  bytes('ꬑ', 'utf-32BE')
>>> _bytes
b'\x00\x00\xab\x11'
# ꬑ is encoded using utf-32BE 
# no bom is added 


>>> _bytes.decode('utf-32')
# UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in 
# position 0-3: code point not in range(0x110000)

# utf-32 will decode using utf-32LE , when no 
# bom mark is encountered , hence it cannot decode 
# correctly


>>> _bytes.decode('utf-32LE')
# UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in 
# position 0-3: code point not in range(0x110000)

# utf-32LE will  try to decode using little 
# endian encoding , so it will try to decode 
# 0x11_ab_00_00
# which is not in the range of a valid code points 


# encoding using utf-32LE

>>> _bytes =  bytes('ꬑ', 'utf-32LE')
>>> _bytes
b'\x11\xab\x00\x00'
# ꬑ is encoded using utf-32LE 
# no bom is added 

>>> _bytes.decode('utf-32')
'ꬑ'
# utf-32 will decode using utf-32LE , when no 
# bom mark is encountered  , the encoding
# is little endian , hence it is decoded correctly


>>> _bytes.decode('utf-32BE')
# UnicodeDecodeError: 'utf-32-be' codec can't decode bytes 
# in position 0-3: code point not in range(0x110000)

# when trying to decode using big endian 
# the encoding of _bytes is little endian 
# so when decoding 0x11_ab_00_00  , it will be 
# read from left to right , and the encoding is not
# in range of a valid code point



# encoding using utf-32 

>>> _bytes = bytes('ꬑ', 'utf-32')
>>> _bytes
b'\xff\xfe\x00\x00\x11\xab\x00\x00'
# ꬑ is encoded using utf-32LE 
# a bom is added 

>>> _bytes.decode('utf-32')
'ꬑ'
# utf-32 will check the first 4 bytes
# for the bom mark , if it is found
# decoding is done based on the bom
# if not it is little endian  when using 
# python , java uses big endian 
# unicode recommends big endian


>>> _bytes.decode('utf-32LE')
'\ufeffꬑ'
# utf-32LE will decode using little endian
# the bom mark will be decoded into its 
# code point


>>> _bytes.decode('utf-32BE')
# UnicodeDecodeError: 'utf-32-be' codec can't decode bytes in 
# position 0-3: code point not in range(0x110000)

# when trying to decode using big endian 
# the encoding of _bytes is little endian 
# when reading from left to right  
# using big endian , the first encountered
# value is not in the range of a valid  code points