char is a region of memory , formed of bits , large enough to hold any character of the basic execution character set . A char is the encoding , of the basic execution character set .
The size of the region , occupied by a char in memory, is equal to 1 byte . A byte is an addressable unit of storage , which number of bits is implementation defined , and which is capable of holding any character , of the basic execution character set .
char can be signed , or it can be unsigned , but the types :
unsigned char and
signed char are not the same type , same as an
int is not a
float , even when a
char and an
unsigned char or a
signed char have the same representation .
What is meant when it is said that a
char is signed or unsigned , is that in C , a
char is considered to be an integer . So the value of the bits stored in memory for a char , is interpreted as a number . This numerical value when interpreted , can be either interpreted as being signed , or as being unsigned .
For example if a char is formed of 8 bits , and the first bit is considered to be a sign bit , then
00000001 is equal to
1 , and
10000001 is equal to
-1 , whereas if the bits representation is considered to be in the second complement , then
00000001 would be considered to be 1 , and
10000001 would be considered to be
-127 , and finally if the first bit , is not to be considered in any signed representation , then
00000001 is considered to be
10000001 is considered to be
The basic execution character set members , must all be stored as having a positive value , in the type char , other characters , non member of the basic execution character set , can be stored in the char type , as either positive or negative , this is implementation defined .
signed char is a different type than a char , and its first bit is considered to be signed , so it is a signed integer type , and it has the same number of bits that a char has .
unsigned char is also a different type than a char , and its first bit is considered to be unsigned , so it is an unsigned integer , and it has the same number of bits that a char has .
So a character type , is not the code point of a character in a character set , but is the encoding of this code point , and is a numerical value . As an example , in ASCII ,
A has a code point of
65 . The encoding of
A is the same as its code point , so this is why the numerical value of the char holding the character
A , is
65 . Other character sets , might have an encoding of their code points , different than the code point itself .
The basic execution character set encoding , can be stored in the
char type, but what about the extended execution character set , which can be encoded using more than one byte , and what if it cannot fit into a single byte .
For example , if the extended character set is the : ISO-8559-1 character set , then the extended character set , can be encoded using a single byte . The ISO-8559-1 character set , is known as Latin alphabet number one , and it contains the basic character set , in addition to accented and other characters . But what if , the extended character set , is for example Unicode , which cannot be encoded by using just one byte , as such it has 16 and 32 bits encodings.
The solution to the storage , of the extended execution character set encoding , is : wide characters , which are fixed length encoding , of characters , members of the extended execution character set .
First the C standard came up with the
wchar_t type , which is defined in the
stddef.h header . This type , is used to store a fixed length encoding of characters , members of the extended execution character set .
wchar_t is an integer type . Its definition can look something like this :
typedef unsigned short wchar_t;
Under windows , it has a length of 16 bits , and it is used to store
utf-16 encoding of characters in the extended execution character set . Under linux , it has a length of 32 bits , and is used to store
utf-32 encodings .
To declare a string literal or a character constant , to be of type
wchar_t , precede it with
L. For example :
wchar_t aChar = L'a'; wchar_t * aString = L"a";
Later on , the C standard , came up with the type
char16_t , an unsigned integer type used to store 16 bits encoding , and the type
char32_t , an unsigned integer type used to store 32 bits encoding .
Both of these types , are fixed length encodings , defined in the unicode utilities header :
uchar.h , and can be used to store encoding of the extended execution character set .
If the macro
__STDC_UTF_16__ , is defined , then values of type
UTF-16 encoded , otherwise the encoding is of fixed length , and is implementation defined .
If the macro
__STDC_UTF_32__ , is defined , then the value of type
UTF-32 encoded , otherwise the encoding is of fixed length , and is implementation defined .
To declare a string or a character literal to be of type
char16_t , or of type
char32_t , it can be done like this :
char16_t aChar = u'1'; char16_t* aString = u"1"; char32_t bchar = U'1'; char32_t* bString = U"1";
Multibyte characters , are characters which are encoded using a variable length encoding . For example , Unicode characters can be encoded using a fixed length encoding , such as
utf-32 , which encode all Unicode characters using 4 bytes , or Unicode characters can be encoded using a variable length encoding , such as
utf-8 , which can be of : 1 , 2 , 3 or 4 bytes .
The C extended source , and extended execution , character sets , can be both encoded using a variable length encoding . If the extended source , or extended execution character sets , are encoded using a variable length encoding , then the included basic character sets, shall be encoded using a single byte . Also , the null character , must be encoded , using a single byte with all of its bits set to zero . Finally no other byte , with all of its bits set to zero , shall occur in a multibyte character .
Multibyte characters in the source file are mapped to the source character set , in an implementation defined manner , at the start of the compilation process , at a period of time , before preprocessing is performed .
In the execution environment , there is no inbuilt type , that can hold variable length encoding , since inbuilt types have a fixed length , so they cannot store , either : 1 , 2 , 3 or 4 bytes . Also reading a variable length encoding , requires the use of special functions , in order to know , where each multibyte character encoding , starts and stops .
Example of functions , that can be used to convert from
wchar_t , to a variable length encoding , and from a variable length encoding , to
wchar_t are :
mbtowc . Such functions , are declared in the headers :
Example of functions , that can be used to convert from
char16_t , and
char32_t , to multibyte characters , and the inverse , are
mbrtoc16 , and
mbrtoc32 . Such functions , are declared in the header :