What is a character , wide character , and multibyte character in c ?

Characters

A char is a region of memory , formed of bits , large enough to hold any character of the basic execution character set . A char is the encoding , of the basic execution character set .

The size of the region , occupied by a char in memory, is equal to 1 byte . A byte is an addressable unit of storage , which number of bits is implementation defined , and which is capable of holding any character , of the basic execution character set .

A char can be signed , or it can be unsigned , but the types : char , unsigned char and signed char are not the same type , same as an int is not a float , even when a char and an unsigned char or a signed char have the same representation .

 
What is meant when it is said that a char is signed or unsigned , is that in C , a char is considered to be an integer . So the value of the bits stored in memory for a char , is interpreted as a number . This numerical value when interpreted , can be either interpreted as being signed , or as being unsigned .

For example if a char is formed of 8 bits , and the first bit is considered to be a sign bit , then 00000001 is equal to 1 , and 10000001 is equal to -1 , whereas if the bits representation is considered to be in the second complement , then 00000001 would be considered to be 1 , and 10000001 would be considered to be -127 , and finally if the first bit , is not to be considered in any signed representation , then 00000001 is considered to be 1 and 10000001 is considered to be 129 .

 
The basic execution character set members , must all be stored as having a positive value , in the type char , other characters , non member of the basic execution character set , can be stored in the char type , as either positive or negative , this is implementation defined .

 
A signed char is a different type than a char , and its first bit is considered to be signed , so it is a signed integer type , and it has the same number of bits that a char has .

An unsigned char is also a different type than a char , and its first bit is considered to be unsigned , so it is an unsigned integer , and it has the same number of bits that a char has .

So a character type , is not the code point of a character in a character set , but is the encoding of this code point , and is a numerical value . As an example , in ASCII , A has a code point of 65 . The encoding of A is the same as its code point , so this is why the numerical value of the char holding the character A , is 65 . Other character sets , might have an encoding of their code points , different than the code point itself .

Wide characters

The basic execution character set encoding , can be stored in the char type, but what about the extended execution character set , which can be encoded using more than one byte , and what if it cannot fit into a single byte .

For example , if the extended character set is the : ISO-8559-1 character set , then the extended character set , can be encoded using a single byte . The ISO-8559-1 character set , is known as Latin alphabet number one , and it contains the basic character set , in addition to accented and other characters . But what if , the extended character set , is for example Unicode , which cannot be encoded by using just one byte , as such it has 16 and 32 bits encodings.

The solution to the storage , of the extended execution character set encoding , is : wide characters , which are fixed length encoding , of characters , members of the extended execution character set .

First the C standard came up with the wchar_t type , which is defined in the stddef.h header . This type , is used to store a fixed length encoding of characters , members of the extended execution character set .

wchar_t is an integer type . Its definition can look something like this :

typedef unsigned short wchar_t;

Under windows , it has a length of 16 bits , and it is used to store utf-16 encoding of characters in the extended execution character set . Under linux , it has a length of 32 bits , and is used to store utf-32 encodings .

To declare a string literal or a character constant , to be of type wchar_t , precede it with L. For example :

  wchar_t aChar = L'a';

  wchar_t * aString = L"a";

 
Later on , the C standard , came up with the type char16_t , an unsigned integer type used to store 16 bits encoding , and the type char32_t , an unsigned integer type used to store 32 bits encoding .

Both of these types , are fixed length encodings , defined in the unicode utilities header : uchar.h , and can be used to store encoding of the extended execution character set .

If the macro __STDC_UTF_16__ , is defined , then values of type char16_t are UTF-16 encoded , otherwise the encoding is of fixed length , and is implementation defined .

If the macro __STDC_UTF_32__ , is defined , then the value of type char32_t are UTF-32 encoded , otherwise the encoding is of fixed length , and is implementation defined .

To declare a string or a character literal to be of type char16_t , or of type char32_t , it can be done like this :

char16_t aChar = u'1';
char16_t* aString = u"1";

char32_t bchar = U'1';
char32_t* bString = U"1";

Multibyte characters

Multibyte characters , are characters which are encoded using a variable length encoding . For example , Unicode characters can be encoded using a fixed length encoding , such as utf-32 , which encode all Unicode characters using 4 bytes , or Unicode characters can be encoded using a variable length encoding , such as utf-8 , which can be of : 1 , 2 , 3 or 4 bytes .

 
The C extended source , and extended execution , character sets , can be both encoded using a variable length encoding . If the extended source , or extended execution character sets , are encoded using a variable length encoding , then the included basic character sets, shall be encoded using a single byte . Also , the null character , must be encoded , using a single byte with all of its bits set to zero . Finally no other byte , with all of its bits set to zero , shall occur in a multibyte character .

Multibyte characters in the source file are mapped to the source character set , in an implementation defined manner , at the start of the compilation process , at a period of time , before preprocessing is performed .

In the execution environment , there is no inbuilt type , that can hold variable length encoding , since inbuilt types have a fixed length , so they cannot store , either : 1 , 2 , 3 or 4 bytes . Also reading a variable length encoding , requires the use of special functions , in order to know , where each multibyte character encoding , starts and stops .

Example of functions , that can be used to convert from wchar_t , to a variable length encoding , and from a variable length encoding , to wchar_t are : wctomb , mbtowc . Such functions , are declared in the headers : stdlib.h and wchar.h .

Example of functions , that can be used to convert from char16_t , and char32_t , to multibyte characters , and the inverse , are c16rtomb , c32rtomb , mbrtoc16 , and mbrtoc32 . Such functions , are declared in the header : uchar.h .