C : source , execution , basic , and extended character sets

There are two character sets in C . The first one is the source character set , which is the set of characters , in which a C source file , is written .

So the source character set , includes the set of characters , used in C operators , for example += , or in C identifiers , such as in the name of a variable or a function , or in C keywords , such as int or if , or the set of characters used in literals , such as integers , characters , or strings literals .

The second one is the execution character set . This is the character set , where the program is executed . For example , we can have ASCII as a source character set , and unicode , as an execution character set . Not to confuse the unicode character set , with Unicode encoding such as utf-8 or utf-16

When compiling a C program , and after preprocessing is done , the source file , source character set , is converted to the execution character set .

This conversion includes the conversion of escape sequences , such as \n , found in character constants or string literals .

The conversion between the source and execution character set , is done in a period of time before the source file is converted to object code .

Both the source and the execution character set , must contain what is called a basic character set , which includes the following characters .


a b c d e f g h i j k l m n o p q r s t u v w x y z

0 1 2 3 4 5 6 7 8 9

! " # % & ’ ( ) * + , - . / :

; < = > ? [ \ ] ^ _ { | } ~

The values of characters 0 till 9 , in both source and execution basic character sets , must be one after another , in an increment of one .

The basic source and execution character sets , must also include , the space character , which is for example obtained by typing the spacebar on the keyboard , and is displayed visually by an empty column on the horizontal line , and which for example has a decimal numerical value of 32 , in both the ASCII character set , and the unicode character set .

The basic source and execution character sets, must also include the control characters : horizontal tab , vertical tab , form feed and new line .

The basic execution character set , must also include control characters , that represent : alert , backspace , carriage return , and the null character .

The basic execution character set control characters , are represented in character constants , or in string literals , of the source file , by using escape sequences formed of the backslash character , followed by one or more characters : \a for alert , \b for backspace , \r for carriage return , \0 for null , \t for horizontal tab , \v for vertical tab , \f for form feed , \n for new line .

The null character , indicates the end of a string of characters .

printf("%s", "Hello world\n" );
/*Outputs Hello World , followed by a new line  */
Hello world

printf("%s", "Hello\0World\n" );
/*Outputs only Hello , because of \0 */

The null character , must be represented in the execution character set , using a byte , with all of its bits set to zero .

Trigraphs , can be used to represent characters in the basic character sets , that for example can not be typed on some keyboards . A trigraph is formed of two interrogation marks , followed by a character . For example ??= represents # .

The basic source and execution character sets , must be encoded using a single byte . The number of bits in a byte , is implementation defined , and must be able to hold any character , of the basic execution character set . A byte , is an addressable unit of storage .

The source and execution character sets , can also each include , a set of 0 or more local specific characters , not part of the basic character sets , called extended characters . These characters can be used in : character constants , string literals , comments , header names , identifiers , or preprocessing tokens which are not converted to tokens .

If a character is in the set of extended characters , and when it is used in identifiers , character constants , or string literals , it can be represented using its universal name , which has the format \u followed by four hexadecimal digits , or \U followed by eight hexadecimal digits .

The universal character name , is the short identifier in unicode . For example the YEZIDI LETTER ELIF , has a short identifier of 10E80 , or +10E80 , or U10E80 , or U+10E80 in unicode , and its universal character name in C is \U00010E80 .

If a character is defined in the source character set , and not found in the execution character set , then it is up to the implementation to choose a corresponding character , as long as it is not the null character .

The extended characters , not part of the basic character sets , can be encoded using multiple bytes .

The name given for the basic character set , and the set of added local specific characters , is the extended character set .