There are two character sets in C
. The first one is the source character set , which is the set of characters , in which a C source file , is written .
So the source character set , includes the set of characters , used in C operators , for example +=
, or in C identifiers , such as in the name of a variable or a function , or in C keywords , such as int
or if
, or the set of characters used in literals , such as integers , characters , or strings literals .
The second one is the execution character set . This is the character set , where the program is executed . For example , we can have ASCII as a source character set , and unicode , as an execution character set . Not to confuse the unicode character set , with Unicode encoding such as utf-8
or utf-16
…
When compiling a C program , and after preprocessing is done , the source file , source character set , is converted to the execution character set .
This conversion includes the conversion of escape sequences , such as \n
, found in character constants or string literals .
The conversion between the source and execution character set , is done in a period of time before the source file is converted to object code .
Both the source and the execution character set , must contain what is called a basic character set , which includes the following characters .
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 ! " # % & ’ ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
The values of characters 0
till 9
, in both source and execution basic character sets , must be one after another , in an increment of one .
The basic source and execution character sets , must also include , the space character , which is for example obtained by typing the spacebar on the keyboard , and is displayed visually by an empty column on the horizontal line , and which for example has a decimal numerical value of 32
, in both the ASCII character set , and the unicode character set .
The basic source and execution character sets, must also include the control characters : horizontal tab , vertical tab , form feed and new line .
The basic execution character set , must also include control characters , that represent : alert , backspace , carriage return , and the null character .
The basic execution character set control characters , are represented in character constants , or in string literals , of the source file , by using escape sequences formed of the backslash character , followed by one or more characters : \a
for alert , \b
for backspace , \r
for carriage return , \0
for null , \t
for horizontal tab , \v
for vertical tab , \f
for form feed , \n
for new line .
The null character , indicates the end of a string of characters .
printf("%s", "Hello world\n" ); /*Outputs Hello World , followed by a new line */ Hello world printf("%s", "Hello\0World\n" ); /*Outputs only Hello , because of \0 */ Hello
The null character , must be represented in the execution character set , using a byte , with all of its bits set to zero .
Trigraphs , can be used to represent characters in the basic character sets , that for example can not be typed on some keyboards . A trigraph is formed of two interrogation marks , followed by a character . For example ??=
represents #
.
The basic source and execution character sets , must be encoded using a single byte . The number of bits in a byte , is implementation defined , and must be able to hold any character , of the basic execution character set . A byte , is an addressable unit of storage .
The source and execution character sets , can also each include , a set of 0 or more local specific characters , not part of the basic character sets , called extended characters . These characters can be used in : character constants , string literals , comments , header names , identifiers , or preprocessing tokens which are not converted to tokens .
If a character is in the set of extended characters , and when it is used in identifiers , character constants , or string literals , it can be represented using its universal name , which has the format \u
followed by four hexadecimal digits , or \U
followed by eight hexadecimal digits .
The universal character name , is the short identifier in unicode . For example the YEZIDI LETTER ELIF , has a short identifier of 10E80
, or +10E80
, or U10E80
, or U+10E80
in unicode , and its universal character name in C is \U00010E80
.
If a character is defined in the source character set , and not found in the execution character set , then it is up to the implementation to choose a corresponding character , as long as it is not the null character .
The extended characters , not part of the basic character sets , can be encoded using multiple bytes .
The name given for the basic character set , and the set of added local specific characters , is the extended character set .