C++ character types : char , wchar_t , char8_t , char16_t and char32_t a tutorial !

 

A character can be a letter , a number , a sign , an emoji , or anything that can be written in any form .

Characters Sets in C++

In C++ there is the source character set , and the execution character set .

The source character set , is the set of characters available in the source environment , and which are used in writing , the string literals "Hey" , the character literals 'H' , and which are also used to write variable names , and the program as a whole .

When a C++ program is compiled , it is converted into machine code for the execution environment . The execution environment , is where the machine code is executed , hence what is needed , is to translate the character literals such as 'H' , and the string literals such as "Hey", written in the source character set , into the execution character set , if they are different .

This being said , the source character set and the execution character set must both contain some basic characters , which are called the basic character set .

The source file itself , can be saved into a third character set , but when it is being compiled , the source file which is saved in any desired character set , must first be brought to the source character set , and later on compiled , and has its character and string literals , translated into the execution character set.

The basic character set is formed from the letters A to Z , a to z , from the digits 0 to 9 , from the new line character , the horizontal tab character , the vertical tab character , the form feed character , the space character , and the characters _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ' .

Additionally the basic execution character set , must contain the null character which has a value of 0 , and which is used to terminate C style strings , the backspace character , the carriage return character , and an alert character , which emits a beeping sound , in the execution environment .

The basic character set is represented in the C++ standard using the char type . The basic character set , is called the narrow character set . A narrow character set , is any character set , that can be represented using a single byte . Such character set , is also called , an ordinary character set . An example of a character set , considered to be a narrow character set , is the ASCII character set , even though it contains additional characters .

What about other character sets , beside the basic character set , are they allowed by the C++ standard ? Yes , they are allowed to exist , and if their encoding does not fit using a single byte , they are called wide character sets .

These wide character sets , must contains the characters , which are enounced to belong to the basic character set . They are also known as locale character sets , or extended character sets . An example of a wide character set , is the unicode character set .

The wide character sets are represented by multiple types , all called wide character types , and which are wchar_t , char16_t , char32_t .

When translating string and character literals , from the source character set , into the execution character set , and if no corresponding character exists in the execution character set , it is up to the implementation to define the translated to character .

For example , the source character set can be ISO-8859-1 , and the execution character set can be ISO-8859-8 . ISO-8859-1 contains the character À which is not to be found in ISO-8859-8 , so it is up to the implementation , to define to which character in the execution character set , À will map to .

Ordinary and narrow characters

char , unsigned char , and signed char , are called ordinary character types .

char , unsigned char , signed char , and char8_t are called narrow character types . The narrow character types , all have a size of 1 byte .

char , unsigned char , and signed char

The char type can hold any member of the basic character set . The char type can be signed , as in , it can contain both positive and negative values , or it can be unsigned , as in , it can contain only negative values , this is implementation defined .

The char type has a size of 1 byte , the number of bits in a byte , is not defined by the C++ standard , but typically a byte , is formed of 8 bits .

This being said , the char type can represent more characters , than those found in the basic character set . In all cases , the characters found , in the basic character set , must be encoded using a positive numeric value , and characters from 0 till 9 , must be one after another , in an ascending order .

The char type , the unsigned char type , and the signed char type , must all have the same size and alignment . An alignment , is where in memory , an object can be placed .

All three characters types are distinct , an unsigned char can only hold positive values, a signed char , can hold both positive and negative values , whereas a char , can either be signed or unsigned , this is implementation defined .

The C++ standard defines the minimum range , that the char , unsigned char , and signed char types can have .

TypeNegative or zero valuePositive value
signed char-127127
unsigned char0255
char-127 or 0127 or 255

An implementation can provide larger ranges , to view the actual range for the ordinary character types , they are defined in the climits header .

#include<climits>
#include<iostream>

int main(void ){
    std::cout << "SCHAR_MIN is : " << SCHAR_MIN  << "\n";
    std::cout << "SCHAR_MAX is : " << SCHAR_MAX  << "\n";
    std::cout << "UCHAR_MAX is : " << UCHAR_MAX  << "\n";
    std::cout << "CHAR_MIN is  : " << CHAR_MIN  << "\n";
    std::cout << "CHAR_MAX is  : " << CHAR_MAX  << "\n"; }

/*Output :
SCHAR_MIN is : -128
SCHAR_MAX is : 127
UCHAR_MAX is : 255
CHAR_MIN is  : -128
CHAR_MAX is  : 127 */

The C++ standard does not specify which signed representation is to be in a host environment , it can be sign and magnitude , one’s complement , or two’s complement .

Ordinary character and string literals

An ordinary character literal is the character literal enclosed by a single quote ' , for example 'a' is an ordinary character literal .

An ordinary character literal is of type char . It can contain members of the basic character set , such as r , and if it contains members of an extended character set , such as À , then these members are converted to their universal character name , in the start of the compilation process .

A universal character name is an escape sequence such as '\u00C0' , which represents the character 'À', hence a character literal can also include escape sequences .

An escape sequence is used , between other things , to represent a character , in the execution character set . For example if a horizontal tab is to be represented in the execution character set , it is represented in the source character set using the escape sequence \t , and it is physically shown in the execution environment , for example on a terminal or on a screen .

An ordinary character literal , when being converted to the execution character set , is converted to the encoding of this character , in the execution character set . So a horizontal tab \t , in the source character set , when being translated to the execution character set , and if the execution character set is ascii , it will have an encoding of 00001001 .

For ordinary character literals , the encoding storage destination , is the char type . The char type can only hold 1 byte , typically a byte is 8 bits .

Ordinary character sets , those that can be represented using a single byte , have an encoding of a character , which is the same as the character code point . A code point is just a way of numbering the characters in the character set , whereas the encoding is the representation of the code point , in the computer as a series of bits .

The available escape sequences in C++ are :

Escape sequenceActionDescription
\aalertCauses a beeping sound to be heard , in the execution environment .
\fform feedCauses the carriage to go to the start , of a new page .
\rcarriage returnCauses the carriage to return to the start of the current line .
\nnew lineCauses the carriage to go to the start of a new line .
\bbackspaceCauses the carriage to go back one space .
\thorizontal tabCauses the carriage to go to the next horizontal tab stop . A horizontal tab stop is every 8 characters , counting from 0 .
\vvertical tabCauses the carriage , to go to the next vertical tab stop .
\'single quoteAllows the appearance of a single quote in a character literal .
A single quote , cannot be placed directly in a character literal , without being escaped .
\"double quoteAllows the appearance of a double quote in a string literal .
A double quote cannot appear directly in a string literal , without being escaped .
\\backslashAllows the appearance of a backslash character , in a string or a character literal .
A backslash cannot appear , in a string or a character literal , without being escaped .
\?interrogation markEscape an interrogation mark , as not to be interpreted as a trigraph .
A trigraph is formed of two interrogation marks , followed by a character .
\{1 to 3 octal digits}Encoding as octal digitsThe actual encoding of a character , represented using three octal digits .
\x{1 or more hexadecimal digits}Encoding as hexadecimal digitsThe actual encoding of a character represented as a series of hexadecimal digits .
\uhhhhCharacter as universal character nameRepresent a character using its universal character name , formed of four hex digits .
The universal character name , is its code point in the unicode character set .
\uhhhhhhhhCharacter as universal character nameRepresent a character using , its universal character name formed of eight hex digits .
The universal character name , is the code point of a character , in the unicode character set .

A universal character name , must not be used to represent a character , in the basic character set , or to represent a control character .

#include<iostream>

int main(void ){
  /*An example of escape sequences
    usage .*/

  std::cout << '\a' ;
  /*Emit an alert or a beeping
    sound . */

  std::cout << "hello5678You" << "\r\tMe";
  /*Output:
   hello567Me */

  std::cout << '\n' << '??)' << '\n';
  /*??) is a trigraph , and is
    replaced in the start of ,
    the compilation process with
    ] . The program must be compiled ,
    with the switch -trigraphs ,
    g++ prg.cpp -trigraphs otherwise
    trigraphs are just ignored , and
    are not replaced .
    Output :
    ] */

  std::cout << 'A' << '\x41' << '\101';
  /*\x41 is the encoding of the character
    A in hexadecimal , and \101 is
    the encoding of the character A
    in octal .
    Output :
    AAA */

  /*
  char var_c = '\u00C0';
  \u00C0 is the universal character name
  of the character À . If the execution
  character set is unicode , and its
  encoding is UTF-8 , this will cause
  a compiler error of :
    character too large for enclosing character literal type
  because the encoding of À will be the two bytes
  C380 , which does not fit in a single byte .*/ }

An ordinary string literal is a literal delimited by double quotes " , as in "Hello world" . It is an array of constant ordinary characters , each member of the array stores the encoding of an ordinary character , and the array is terminated with the null character . Adjacent string literals , are concatenated into one string literal .

#include<iostream>

int main(void ){
  char *ptr_c =	"Hello" "World";
  /*The string literal "Hello"
    and "World" are concatenated
    into one .
    ptr_c is a pointer to a null
    terminated array , of constant
    characters , trying to change
    the value of one of the characters ,
    as in :
      *ptr_c = 'd';
    , will cause a run time
    error .*/

  char arr_c[ ] = "Hello World";
  /*Create an array containing
    the characters of the string
    literal Hello World , and
    which is null terminated .
    It is perfectly legal , to
    change the values of the
    elements in the created array .*/

  std::cout << arr_c << '\n';
  /*Output :
    Hello Word .*/

  arr_c[0 ] = 'd';
  /*Set the value of the first
    element of the array to
    d .*/
  std::cout << arr_c << '\n';
  /*Output :
    dello Word .*/ }

char8_t and char8_t character and string literals

char8_t has an underlying type of unsigned char , but is a distinct type .

A char8_t character literal can be specified using the suffix u8 , followed by single quotes enclosing a character , as in u8'a' .

Characters preceded by u8 are encoded , using utf-8 . utf-8 is a multibyte encoding , which encodes unicode characters using : 1 byte , 2 bytes , 3 bytes , or 4 bytes .

When encoding using one byte , utf-8 always sets the first bit , of the byte to 0 , hence the available encodings are between 0 and 0x7F in hexadecimal , or between 0 and 0111 1111 in binary. So , the utf-8 encoding of one byte , can only represent 128 characters , this works well for char8_t , which has a storage size of 1 byte .

The 128 characters represented using utf-8 1 byte encoding , are the ascii characters . The ascii character set , is a subset of unicode , and it is superset of the basic character set .

A string-literal that begins with u8, followed by double quotes , optionally enclosing some characters , as in u8"À" , is a char8_t , also known as , a utf-8 string literal .

A utf-8 string literal , contains the encoding of characters , of the unicode character set , using utf-8 . This result in an array of constant char8_t characters , containing the encoding of each character , in the char8_t string literal , terminated with a null character .

The encoding of each character , can be 1 , 2 , 3 or 4 bytes , depending on the character .

#include<iostream>

int main(void ){
  const char8_t *ptr_cc8 = u8"À";
  /*The encoding of À in utf-8 ,
    is C380 .
    ptr_cc8 , is a pointer to
    a constant character , which
    is the first element of an array
    of constant char8_t characters ,
    containing the encoding of À , and
    terminated with the null character ,
    C3 80 00 .*/
  std::cout << std::hex << ptr_cc8[0 ] <<  ptr_cc8[1 ] << ptr_cc8[2 ] ;
  /*Output c38000 .*/ }

The escape sequences , described in the Ordinary character and string literals section , can be used in char8_t character and string literals.

Wide characters

The wide character types are wchar_t , char16_t , and char32_t . They are of a fixed length encoding , and they are used to store the encoding of characters in the extended character sets .

wchar_t and wchar_t character and string literals

wchar_t , is a wide character type . It has an integer type , decided by the implementation , this integer type is called its underlying type .

The underlying type can be unsigned or it can be signed , so it can contain only positive or both positive and negative values , but there is no signed wchar_t or unsigned wchar_t .

A wide character literal , starts with an uppercase L , followed by a character , enclosed in single quotes , as in L'À' . The encoding of this character , is stored in the wchar_t type .

Under windows , wchar_t has typically a size of 16 bits , and under linux it has a typically a size of 32 bits , so under windows it can store 16 bit encodings of characters , whereas under linux it can stores 32 bit encodings of characters .

A wide character string literal starts with an uppercase L , followed by optional characters , enclosed in double quotes " , as in L"Hello world" . A wide character string literal , is stored as an array of constant wchar_t .

The definition of the type wchar_t can be found , in the header cwchar . This header also contains , the definition of WCHAR_MAX and WCHAR_MIN , which contain the min and max values storable in wchar_t .

The cwchar header , also contains utility functions , such as functions to get the length of a wide character string , or to convert a multibyte narrow character string , into a wide character string , or vice versa .

#include<stdio.h>
#include<cwchar>

int main(void ){
  printf("WCHAR_MIN is : %d \n", WCHAR_MIN );
  /*Output Under Linux :
        WCHAR_MIN is : -2147483647
    Output Under Windows :
        WCHAR_MIN is : 0 */

  printf("WCHAR_MAX is : %d \n", WCHAR_MAX );
  /*Output Under Linux :
        WCHAR_MAX is : 2147483647
    Output Under Windows :
        WCHAR_MAX is : 65535 */

  wchar_t var_wc_BE = L'\U00010301';
  /* \U00010301 is the universal
    character name of the
    old italic letter BE . It is
    formed of \U followed by BE
    code point in hexadecimal
    in Unicode .
    The escape sequence is preceded ,
    with L , as such it is a wide
    character escape sequence .
    var_wc_BE is a wide character ,
    and it contains the encoding
    of the wide character literal .*/

  printf("Encoding of BE is : %#x\n", var_wc_BE );
  /*Print the stored encoding
    of old italic letter 𐌁 in
    hexadecimal .
    Output under linux :
        Encoding of BE is : 0x10301
    Output under windows :
        Encoding of BE is : 0xdf01

    Under linux , what is stored is the
    utf-32 encoding of the letter 𐌁 ,
    and which is 00010301 , whereas
    under windows what is stored is
    the utf-16 encoding of the letter
    𐌁 , and which is D800DF01 , since
    under windows wchar_t is 16 bits ,
    only the last 4 hex digits DF01 : are
    stored .*/

  const char *ptr_c_BE = "\U00010301";
  /* \U00010301 is the universal character
    name of the old italian letter BE .
    "\U00010301" is an ordinary string
    literal . A character from the
    extended character set is chosen ,
    and it was not specified that the
    string literal is a wide string
    literal .
    The compiler choses to encode
    this string literal , using utf-8 ,
    which is a multibyte encoding , Hence
    what is stored in the encoding is :
    F0 90 8C 81 */

  printf("Encoding of BE is : 0x%hhx%hhx%hhx%hhx \n", ptr_c_BE[0 ]  , ptr_c_BE[1 ] , ptr_c_BE[2 ] , ptr_c_BE[3 ] ) ;
  /*Print the encoding stored in ptr_c_BE ,
    in hexadecimal .
    Output under linux :
        Encoding of BE is : 0xf0908c81
    Output under windows :
        Encoding of BE is : 0xf0908c81 */


  const wchar_t *ptr_wc_BE = L"\U00010301";
  /*L"\U00010301" is a wide string literal ,
    its encoding is stored as utf-32 under
    linux , and as utf-16 under windows ,
    in both cases , it needs 4 bytes to be stored .*/

  printf("Encoding of BE is : 0x%x 0x%x \n", ptr_wc_BE[0 ]  , ptr_wc_BE[1 ] ) ;
    /*Output for linux , the utf-32
      encoding of BE which is the
      same as its code point
      U00010301 , followed by the
      null character :
        Encoding of BE is : 0x10301 0x0
      Output for windows , the utf-16
      encoding of BE :
        Encoding of BE is : 0xd800 0xdf01 .*/

  const char *ptr_c = "ab" ;
  /*"ab" is an ordinary string
    literal , ptr_c is a pointer
    to the first character ,
    in this constant array
    of characters . */

  printf("%s\n" , ptr_c );
  /*Print the string pointed by , 
    ptr_c . 
    Output :
    ab */

  wchar_t ptr_wc[3];
  /*Define an array of wide characters ,
    formed of three elements  .*/

  mbsrtowcs(ptr_wc , &ptr_c , 3 , 0 );
  /*Convert the ordinary string "ab" ,
    to the wide character string L"ab" .*/

  printf("%ls\n" , ptr_wc );
  /*Print the wide character string
    pointed by ptr_wc .
    Output :
    ab */ }

The escape sequences described in the Ordinary character and string literals section , can be used with wide string and character literals .

char16_t and char16_t character and string literals

char16_t has an underlying type of uint_least16_t , it is used to store , the utf-16 encoding of characters .

A char16_t character literal , starts with the small case letter u , followed by single quotes , enclosing a character , as in u'l' .

A char16_t string literal , starts with the small case letter u , followed by double quotes , enclosing some optional characters , as in u"Hey" .

The escape sequences , described earlier on , can be used in char16_t , string and character literals .

#include<cstdint>
/*Contain the min , and max values
  for integer types which
  have a least width , 
  specific width , fastest
  type with a least width 
  ... */

#include<iostream>

int main(void ){
    using namespace std ;

    cout << "UINT_LEAST16_MAX is : " <<  UINT_LEAST16_MAX << "\n" ;
    /*The max value that can be stored 
      in char16_t , is the same for 
      windows and linux , and 
      is UINT_LEAST16_MAX .
      The min value that can
      be stored is 0 .
      Output :
      UINT_LEAST16_MAX is : 65535 */

    /* char16_t var_c16 = u'\U00010301';
      This definition of var_c16 , will cause
      a compiler error , of character being ,
      too large for enclosing type .
      The character being used is the
      old italic letter BE , and is
      specified using its universal character
      name , this character has a utf-16
      encoding of D800 DF01 , so
      it needs 32 bits , and cannot fit
      in 16 bits .*/

    const char16_t *ptr_c16 = u"\U00010301";
    /*\u00010301 is a utf-16 string
      literal , so the encoding of
      the character old italic letter
      BE , specified using its universal
      character name , is stored in
      an array of constant char16_t
      which its first element is pointed
      by ptr_c16 .*/

    cout<< hex << "UTF-16 encoding of old italic BE is : " << *ptr_c16 << *(ptr_c16+1 ) <<endl ;
    /*Print the hexadecimal values stored ,
      in the first and second char16_t ,
      , both elements of the array pointed
      by ptr_c16 .
      Output  , the same thing for windows , 
      and linux :
      UTF-16 encoding of old italic BE is : d800df01 */ }

char32_t and char32_t character and string literals

char32_t has an underlying type of uint_least32_t , and it is used to store , the utf-32 encoding , of characters .

A utf-32 character literal , starts with the capital letter U , followed by a single quote , enclosing a character , as in U'a' .

A utf-32 string literal , starts with the capital letter U , followed by double quotes , enclosing some optional characters , as in U"a" .

The escape sequences , described earlier , can be used with utf-32 , string and character literals .

#include<cstdint>
/*The cstdint header contains
  the min and max values for
  integer types , of specific
  width , least width , fast
  of least width ...*/

#include<iostream>

int main(void ){
    using namespace std ;

    cout << "UINT_LEAST32_MAX is : " <<  UINT_LEAST32_MAX << "\n" ;
    /*The max value that can be stored ,
      in char32_t , is UINT_LEAST32_MAX .
      This is the same , for windows ,
      and linux .
      Output :
      UINT_LEAST32_MAX is : 4294967295 */

    char32_t var_c32 = U'\U00010301';
    /*'\U00010301' is a char32_t
      character literal , it
      contains an escape sequence ,
      which represents the old italian
      letter Be . This character
      is encoded in utf-32 , and the
      encoding is the same as the code
      point , as such it is :
      00010301 in hex .*/

    cout<< hex << "UTF-32 encoding of old italic BE is : " << var_c32 <<endl ;
    /*Output , the same thing on windows ,
      and linux :
      UTF-32 encoding of old italic BE is : 10301 */ }

Raw string literals

A raw string literal , is a string literal which is stored as it is written , escape sequences are not interpreted , new lines and white spaces are preserved .

This literal has the following format :

R"opt-delimiter(characters-of-the-literal)opt-delimiter"

The optional delimiter can be formed of at most sixteen characters , which must be members of the basic character set , with the exception of : ( ) \ , the space character , the horizontal tab , the vertical tab , the form feed , and the new line .

A raw string literal , can be any kind of string literals , for example a wide character string literal , or anything .

#include<iostream>

int main(void ){

  const char *ptr_c = "C:\\Windows\\System32\\drivers\\etc";
  /*This is a window path , the backslash 
    character is escaped in the string 
    literal , as not to be 
    interpreted .*/

  std::cout << ptr_c << "\n";
  /*Print the string literal ,
    pointed by ptr_c .
    Output :
    C:\Windows\System32\drivers\etc */

  ptr_c = R"(C:\Windows\System32\drivers\etc)";
  /*In a raw string literal ,
    nothing is interpreted , it
    is stored as is , so there is
    no need to escape the backslash
    character in the path .*/

  std::cout << ptr_c << "\n";
  /*Print the string literal , 
    pointed by ptr_c .
    Output :
    C:\Windows\System32\drivers\etc */

  ptr_c	=R"DLM(Use a delimiter to be able to use )" in a raw string literal)DLM";
  /*An example of why to use a
    delimiter . This is done ,  
    in order to be able to use )"
    in a raw string literal .*/

  std::cout << ptr_c << "\n";
  /*Print the string literal pointed 
    by ptr_c .
    Output :
    Use a delimiter to be able to use )" in a raw string literal */ }