# what is a character in python ?

## What is a character in python ?

A character in python is represented using a Unicode code point. A Unicode code point is written by using `U+` followed by a number written in hexadecimal . This number written in hexadecimal represents a character , it has a min value of` 0x0` and a max value of 0x`10_ff_ff` which is `1_114_111` in decimal . A Unicode code point `hex_number` can be represented by using `21` bits .

```>>> max_unicode_hex_number = 0x10_ff_ff
>>> max_unicode_hex_number.bit_length()
21
```

For example the Unicode point `U+0000` represents the null character. `0000` is written in hexadecimal , it has a decimal and a hexadecimal value of `0` .

The Unicode point `U+0030` represents the `0` character. `0030` is written in hexadecimal , it has a decimal value of `48` and a hexadecimal value of `30`.

The Unicode point `U+0061` represents the `a` character. `0061` is written in hexadecimal , it has a decimal value of `97` and a hexadecimal value of `61`.

The Unicode point `U+0062` represents the `b` character. `0062` is written in hexadecimal , it has a decimal value of `98` and a hexadecimal value of `62`.

## Get a character Unicode code point decimal value using ord

We can use the `ord (character)` function to get a character Unicode code point decimal value . So we will get the hexadecimal number in `U+hex_number` expressed in decimal .

```>>> unicode_number_of_a = ord('a')
>>> unicode_number_of_a
97
>>> unicode_number_of_b = ord('b')
>>> unicode_number_of_b
98
```

### Get a character Unicode code point hexadecimal number using hex

We can use the `hex(ord (character))` function to get a character Unicode code point hexadecimal value . So to get the hex number in `U+hex_number` .

```>>> hex_unicode_number_of_a = hex(ord('a'))
>>> hex_unicode_number_of_a
0x61
>>> hex_unicode_number_of_b = hex(ord('b'))
>>> hex_unicode_number_of_b
0x62
```

In this example the character `a ( U+0061)` Unicode code point hexadecimal number is `0x61` and the character `b (U+0062)` Unicode code point hexadecimal number is `0x62`

## Representing characters in python using a glyph

Characters that have a glyph or a symbol , can be represented in python by using a single , double or triple quote enclosing their glyph or symbol. The quotes must contain only one character . If they contain more than one character we will have a string .

```>>> single_quote = 'a'
>>> single_quote
'a'
>>> double_quote = "b"
>>> double_quote
'b'
>>> triple_quote = '''c'''
>>> triple_quote
'c'
```

In this example we have the characters `a` , `b` , and `c` which are represented using their glyph or symbols `a b c `.

## Representing characters in python using an escape sequence

An escape sequence is enclosed in single , double or triple quotes , and it starts with the backslash character .

we may need an escape sequence either for :

• Representing characters that don’t have a glyph or symbol to represent them . For example a new line is represented in Unicode by the the Unicode code point `U+000A` . A new line doesn’t have a glyph or a symbol to represent it , as such it is represented in python by using the escape sequence `'\n'` .
```>>> newLine_character = '\n'
# A new line has no symbol to represent it.
# We represent it using the escape sequence \n
>>> ord(newLine_character)
10
# The new line character has a Unicode code point
# of 10 in decimal
```

• Escaping characters that are used by python in a special way. For example :
• The single and double quote characters are used to represent a string , so to represent the single or double quote themselves , we must use an escape sequence `'\''` `"\""` .
• The backslash character is used to escape a sequence so to represent it itself , we must use an escape sequence `'\\'`
• Escaping something that can be interpreted by python in a special way . For example we can use the escape sequences `\u \U \x` , to represent characters by their Unicode code point hex number . We can also represent characters by using an octal escape sequence , or by using their Unicode name .

Sometimes we don’t want an escape sequence to be interpreted in python . We can do this in two ways . We can either escape the backslash that starts an escape sequence with another backslash:

```>>> escaping_an_escape_sequence = '\\a'
# We can escape an escape sequence
# by escaping the backslash that starts the
# escape sequence by another one .
# \\a , creates two characters
# \\ is the backslash character which
#    has a Unicode code point of
#    U+005c .
# a is the a character which has a
#   Unicode code point of of U+0061
>>> escaping_an_escape_sequence
'\\a'

>>> bell_character = '\a'
# we didn't escape the escape sequence
# \a by a preceding backslash . \a is
# the bell character which has a Unicode
# code point of U+0007

```

Or we can precede the string containing an escape sequence by an `r` –raw –. This has the same effect of preceding every backslash with another one , so we are escaping all the escape sequences .

```>>> escaping_an_escape_sequence = r'\n\a'
# escaping_an_escape_sequence is formed
# of a backslash followed by the n
# character followed by another backslash
# followed by the a character
>>> escaping_an_escape_sequence
'\\n\\a'
>>> print(escaping_an_escape_sequence)
\n\a
>>> escape_sequence = '\n\a'
# escape_sequence is formed of
# the new line character followed
# by the bell character which has a
# a Unicode code point of U+0007
>>> escape_sequence
'\n\x07'
```

### Representing characters in python by using \u \U \x

A character in python has a Unicode code point , which is formed of `U+hex_number` . Instead of representing character in python by using their glyph , we can represent them by using their Unicode code point `hex_number `.

A Unicode code point hex number has `21` bits . It has a min value of `0` and a max value of `0x10_ff_ff`.

• When the Unicode code point hex number has a value between `0x0` and `0xff` , we can represent it by using the `\x` , `\u `and` \U `escape sequence .
• When the Unicode code point hex number has a value between `0x0` and `0xff_ff` , we can represent it by using the `\u `, and `\U` escape sequence .
• When the Unicode code point hex number has a value between `0x0` and `0x10_ff_ff` , we can represent it by using the `\U` escape sequence .

For example the `z` character has a unicode code point of `U+007a` . Its unicode code point hex number is less then or equal to `0xff` as such we can represent it by using `\x \u \U` .

```>>> character_represented_through_its_hex_number = '\x7a'
>>> character_represented_through_its_hex_number
'z'
>>> character_represented_through_its_hex_number = '\u007a'
>>> character_represented_through_its_hex_number
'z'
>>> character_represented_through_its_hex_number = '\U0000007a'
>>> character_represented_through_its_hex_number
'z'
```

The `β `character has a unicode code point of `U+2202` , its hexadecimal number is `0x22_02` . It is larger than `0xff` as such it cannot be represented by using `\x` , and it can only be represented by `\u` or `\U `.

```>>> character_represented_through_its_hex_number = '\u2202'
>>> character_represented_through_its_hex_number
'∂'
>>> character_represented_through_its_hex_number = '\U00002202'
>>> character_represented_through_its_hex_number
'∂'
```

The `π` character has a unicode code point of` U+10156` , its hexadecimal number is larger than `0xff_ff` , hence it can only represented by using `\U` .

```>>> character_represented_through_its_hex_number = '\U00010156'
>>> character_represented_through_its_hex_number
'π'
```

### Representing characters By using octal

We can represent characters which have unicode code point between `U+0000` and `U+00FF` by using an octal escape sequence . The octal escape sequence is a backslash followed by three octal numbers .

For example the letter `ΓΏ` , has a code point of `U+00FF` , its hexadecimal number is `FF` which is `377` in octal as such we can represent it by using the octal escape sequence `\377` .

```>>> character_represented_through_an_octal_escape_sequence = '\377'
>>> character_represented_through_an_octal_escape_sequence
'ÿ'

```

### Representing characters using their unicode name \N{name}

We can represent characters in python by using their unicode name `.` We can use the unicodeData file to get the unicode name of a character or we can use the unicodedata module .

The unicodeData file contain the properties of the unicode characters .

```03C4;GREEK SMALL LETTER TAU;Ll;0;L;;;;;N;;;03A4;;03A4

# The unicode code point 03C4 has a name of
# GREEK SMALL LETTER TAU . its category is Ll
# which is letter lowercase .  It has a
# bidirectional class of L , which is left to
# right . N means that  this character cannot
# be mirrored as in using a mirror . For example
# ( is mirrored to ) .
# its uppercase is T and it has a unicode code point
# of 03A4 , and its title case is T and it has a

```

We can also use the unicodedata module to get the unicode properties of a character .

```>>> import unicodedata

>>> unicodedata.name('τ')
# get the unicode character name of τ
'GREEK SMALL LETTER TAU'

>>> unicodedata.bidirectional('τ')
# get the bidirectional class of τ ,
# it is L which means  Left to Right
'L'

>>> unicodedata.category('τ')
# get the unicode category of τ
# it is 'Ll' , which means
# letter lower case
'Ll'
```

So if we want to write a character using its unicode character name , we can do it like this .

```>>> import unicodedata

>>> unicodedata.name('a')
'LATIN SMALL LETTER A'
>>> charachter_a = '\N{LATIN SMALL LETTER A}'
# use the escape sequence \n{name} to represent
# the character a
>>> charachter_a
'a'

>>> unicodedata.name('p')
'LATIN SMALL LETTER P'
>>> charachter_p = '\N{LATIN SMALL LETTER P}'
# use the escape sequence \n{name} to represent
# the character p
>>> charachter_p
'p'
```

## What is encoding ?

A character has a Unicode code point. Its code point is formed of `U+hex_number`. The hexadecimal number has a min value of `0` , and a max value of `0x10_ff_ff ` and it has a max length of `21` bits .

Encoding is how we are going to represent these `21` bits in computer. There are three encoding schemes , and they are

• `Utf-32` which is about encoding the unicode code point hex number by using 32 bit . As such we only need one utf-32 unit to encode it.
• `Utf-16` , which is about encoding the unicode code point hex number by using 16 bits .As such we need 1 or two utf-16 units to encode it .
• `Utf-8` which is about encoding the unicode code point number by using 8 bits . As such we need one , two , three or four utf-8 units to to encode it . The next picture summarizes how to encode unicode code points by using utf-8 .

## what is ascii ?

`ASCII` stands for American standard code for information interchange . It is an encoding of characters by using only 7 bits . It has a min value of 0 and a max value of 127 . The encoding is done on a byte , as such the first bit is always 0. ASCII encoding and Unicode code points between 0 and 127 are the same .