Table of Contents
Characters
A char
is a region of memory , formed of bits , large enough to hold any character of the basic execution character set . A char is the encoding , of the basic execution character set .
The size of the region , occupied by a char in memory, is equal to 1 byte . A byte is an addressable unit of storage , which number of bits is implementation defined , and which is capable of holding any character , of the basic execution character set .
A char
can be signed , or it can be unsigned , but the types : char
, unsigned char
and signed char
are not the same type , same as an int
is not a float
, even when a char
and an unsigned char
or a signed char
have the same representation .
What is meant when it is said that a char
is signed or unsigned , is that in C , a char
is considered to be an integer . So the value of the bits stored in memory for a char , is interpreted as a number . This numerical value when interpreted , can be either interpreted as being signed , or as being unsigned .
For example if a char is formed of 8 bits , and the first bit is considered to be a sign bit , then 00000001
is equal to 1
, and 10000001
is equal to -1
, whereas if the bits representation is considered to be in the second complement , then 00000001
would be considered to be 1 , and 10000001
would be considered to be -127
, and finally if the first bit , is not to be considered in any signed representation , then 00000001
is considered to be 1
and 10000001
is considered to be 129
.
The basic execution character set members , must all be stored as having a positive value , in the type char , other characters , non member of the basic execution character set , can be stored in the char type , as either positive or negative , this is implementation defined .
A signed char
is a different type than a char , and its first bit is considered to be signed , so it is a signed integer type , and it has the same number of bits that a char has .
An unsigned char
is also a different type than a char , and its first bit is considered to be unsigned , so it is an unsigned integer , and it has the same number of bits that a char has .
So a character type , is not the code point of a character in a character set , but is the encoding of this code point , and is a numerical value . As an example , in ASCII , A
has a code point of 65
. The encoding of A
is the same as its code point , so this is why the numerical value of the char holding the character A
, is 65
. Other character sets , might have an encoding of their code points , different than the code point itself .
Wide characters
The basic execution character set encoding , can be stored in the char
type, but what about the extended execution character set , which can be encoded using more than one byte , and what if it cannot fit into a single byte .
For example , if the extended character set is the : ISO-8559-1 character set , then the extended character set , can be encoded using a single byte . The ISO-8559-1 character set , is known as Latin alphabet number one , and it contains the basic character set , in addition to accented and other characters . But what if , the extended character set , is for example Unicode , which cannot be encoded by using just one byte , as such it has 16 and 32 bits encodings.
The solution to the storage , of the extended execution character set encoding , is : wide characters , which are fixed length encoding , of characters , members of the extended execution character set .
First the C standard came up with the wchar_t
type , which is defined in the stddef.h
header . This type , is used to store a fixed length encoding of characters , members of the extended execution character set .
wchar_t
is an integer type . Its definition can look something like this :
typedef unsigned short wchar_t;
Under windows , it has a length of 16 bits , and it is used to store utf-16
encoding of characters in the extended execution character set . Under linux , it has a length of 32 bits , and is used to store utf-32
encodings .
To declare a string literal or a character constant , to be of type wchar_t
, precede it with L
. For example :
wchar_t aChar = L'a'; wchar_t * aString = L"a";
Later on , the C standard , came up with the type char16_t
, an unsigned integer type used to store 16 bits encoding , and the type char32_t
, an unsigned integer type used to store 32 bits encoding .
Both of these types , are fixed length encodings , defined in the unicode utilities header : uchar.h
, and can be used to store encoding of the extended execution character set .
If the macro __STDC_UTF_16__
, is defined , then values of type char16_t
are UTF-16
encoded , otherwise the encoding is of fixed length , and is implementation defined .
If the macro __STDC_UTF_32__
, is defined , then the value of type char32_t
are UTF-32
encoded , otherwise the encoding is of fixed length , and is implementation defined .
To declare a string or a character literal to be of type char16_t
, or of type char32_t
, it can be done like this :
char16_t aChar = u'1'; char16_t* aString = u"1"; char32_t bchar = U'1'; char32_t* bString = U"1";
Multibyte characters
Multibyte characters , are characters which are encoded using a variable length encoding . For example , Unicode characters can be encoded using a fixed length encoding , such as utf-32
, which encode all Unicode characters using 4 bytes , or Unicode characters can be encoded using a variable length encoding , such as utf-8
, which can be of : 1 , 2 , 3 or 4 bytes .
The C extended source , and extended execution , character sets , can be both encoded using a variable length encoding . If the extended source , or extended execution character sets , are encoded using a variable length encoding , then the included basic character sets, shall be encoded using a single byte . Also , the null character , must be encoded , using a single byte with all of its bits set to zero . Finally no other byte , with all of its bits set to zero , shall occur in a multibyte character .
Multibyte characters in the source file are mapped to the source character set , in an implementation defined manner , at the start of the compilation process , at a period of time , before preprocessing is performed .
In the execution environment , there is no inbuilt type , that can hold variable length encoding , since inbuilt types have a fixed length , so they cannot store , either : 1 , 2 , 3 or 4 bytes . Also reading a variable length encoding , requires the use of special functions , in order to know , where each multibyte character encoding , starts and stops .
Example of functions , that can be used to convert from wchar_t
, to a variable length encoding , and from a variable length encoding , to wchar_t
are : wctomb
, mbtowc
. Such functions , are declared in the headers : stdlib.h
and wchar.h
.
Example of functions , that can be used to convert from char16_t
, and char32_t
, to multibyte characters , and the inverse , are c16rtomb
, c32rtomb
, mbrtoc16
, and mbrtoc32
. Such functions , are declared in the header : uchar.h
.