A character can be a letter , a number , a sign , an emoji , or anything that can be written in any form .
Table of Contents
Characters Sets in C++
In C++ there is the source character set , and the execution character set .
The source character set , is the set of characters available in the source environment , and which are used in writing , the string literals "Hey"
, the character literals 'H'
, and which are also used to write variable names , and the program as a whole .
When a C++ program is compiled , it is converted into machine code for the execution environment . The execution environment , is where the machine code is executed , hence what is needed , is to translate the character literals such as 'H'
, and the string literals such as "Hey"
, written in the source character set , into the execution character set , if they are different .
This being said , the source character set and the execution character set must both contain some basic characters , which are called the basic character set .
The source file itself , can be saved into a third character set , but when it is being compiled , the source file which is saved in any desired character set , must first be brought to the source character set , and later on compiled , and has its character and string literals , translated into the execution character set.
The basic character set is formed from the letters A
to Z
, a
to z
, from the digits 0
to 9
, from the new line character , the horizontal tab character , the vertical tab character , the form feed character , the space character , and the characters _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '
.
Additionally the basic execution character set , must contain the null character which has a value of 0
, and which is used to terminate C style strings , the backspace character , the carriage return character , and an alert character , which emits a beeping sound , in the execution environment .
The basic character set is represented in the C++ standard using the char
type . The basic character set , is called the narrow character set . A narrow character set , is any character set , that can be represented using a single byte . Such character set , is also called , an ordinary character set . An example of a character set , considered to be a narrow character set , is the ASCII character set , even though it contains additional characters .
What about other character sets , beside the basic character set , are they allowed by the C++ standard ? Yes , they are allowed to exist , and if their encoding does not fit using a single byte , they are called wide character sets .
These wide character sets , must contains the characters , which are enounced to belong to the basic character set . They are also known as locale character sets , or extended character sets . An example of a wide character set , is the unicode character set .
The wide character sets are represented by multiple types , all called wide character types , and which are wchar_t
, char16_t
, char32_t
.
When translating string and character literals , from the source character set , into the execution character set , and if no corresponding character exists in the execution character set , it is up to the implementation to define the translated to character .
For example , the source character set can be ISO-8859-1
, and the execution character set can be ISO-8859-8
. ISO-8859-1
contains the character À
which is not to be found in ISO-8859-8
, so it is up to the implementation , to define to which character in the execution character set , À
will map to .
Ordinary and narrow characters
char
, unsigned char
, and signed char
, are called ordinary character types .
char
, unsigned char
, signed char
, and char8_t
are called narrow character types . The narrow character types , all have a size of 1 byte .
char , unsigned char , and signed char
The char
type can hold any member of the basic character set . The char
type can be signed , as in , it can contain both positive and negative values , or it can be unsigned , as in , it can contain only negative values , this is implementation defined .
The char
type has a size of 1
byte , the number of bits in a byte , is not defined by the C++ standard , but typically a byte , is formed of 8
bits .
This being said , the char
type can represent more characters , than those found in the basic character set . In all cases , the characters found , in the basic character set , must be encoded using a positive numeric value , and characters from 0
till 9
, must be one after another , in an ascending order .
The char
type , the unsigned char
type , and the signed char
type , must all have the same size and alignment . An alignment , is where in memory , an object can be placed .
All three characters types are distinct , an unsigned char
can only hold positive values, a signed char
, can hold both positive and negative values , whereas a char
, can either be signed or unsigned , this is implementation defined .
The C++ standard defines the minimum range , that the char
, unsigned char
, and signed char
types can have .
Type | Negative or zero value | Positive value |
---|---|---|
signed char | -127 | 127 |
unsigned char | 0 | 255 |
char | -127 or 0 | 127 or 255 |
An implementation can provide larger ranges , to view the actual range for the ordinary character types , they are defined in the climits
header .
#include<climits> #include<iostream> int main(void ){ std::cout << "SCHAR_MIN is : " << SCHAR_MIN << "\n"; std::cout << "SCHAR_MAX is : " << SCHAR_MAX << "\n"; std::cout << "UCHAR_MAX is : " << UCHAR_MAX << "\n"; std::cout << "CHAR_MIN is : " << CHAR_MIN << "\n"; std::cout << "CHAR_MAX is : " << CHAR_MAX << "\n"; } /*Output : SCHAR_MIN is : -128 SCHAR_MAX is : 127 UCHAR_MAX is : 255 CHAR_MIN is : -128 CHAR_MAX is : 127 */
The C++ standard does not specify which signed representation is to be in a host environment , it can be sign and magnitude , one’s complement , or two’s complement .
Ordinary character and string literals
An ordinary character literal is the character literal enclosed by a single quote '
, for example 'a'
is an ordinary character literal .
An ordinary character literal is of type char
. It can contain members of the basic character set , such as r
, and if it contains members of an extended character set , such as À
, then these members are converted to their universal character name , in the start of the compilation process .
A universal character name is an escape sequence such as '\u00C0'
, which represents the character 'À'
, hence a character literal can also include escape sequences .
An escape sequence is used , between other things , to represent a character , in the execution character set . For example if a horizontal tab is to be represented in the execution character set , it is represented in the source character set using the escape sequence \t
, and it is physically shown in the execution environment , for example on a terminal or on a screen .
An ordinary character literal , when being converted to the execution character set , is converted to the encoding of this character , in the execution character set . So a horizontal tab \t
, in the source character set , when being translated to the execution character set , and if the execution character set is ascii , it will have an encoding of 00001001
.
For ordinary character literals , the encoding storage destination , is the char
type . The char
type can only hold 1
byte , typically a byte is 8
bits .
Ordinary character sets , those that can be represented using a single byte , have an encoding of a character , which is the same as the character code point . A code point is just a way of numbering the characters in the character set , whereas the encoding is the representation of the code point , in the computer as a series of bits .
The available escape sequences in C++ are :
Escape sequence | Action | Description |
---|---|---|
\a | alert | Causes a beeping sound to be heard , in the execution environment . |
\f | form feed | Causes the carriage to go to the start , of a new page . |
\r | carriage return | Causes the carriage to return to the start of the current line . |
\n | new line | Causes the carriage to go to the start of a new line . |
\b | backspace | Causes the carriage to go back one space . |
\t | horizontal tab | Causes the carriage to go to the next horizontal tab stop . A horizontal tab stop is every 8 characters , counting from 0 . |
\v | vertical tab | Causes the carriage , to go to the next vertical tab stop . |
\' | single quote | Allows the appearance of a single quote in a character literal . A single quote , cannot be placed directly in a character literal , without being escaped . |
\" | double quote | Allows the appearance of a double quote in a string literal . A double quote cannot appear directly in a string literal , without being escaped . |
\\ | backslash | Allows the appearance of a backslash character , in a string or a character literal . A backslash cannot appear , in a string or a character literal , without being escaped . |
\? | interrogation mark | Escape an interrogation mark , as not to be interpreted as a trigraph . A trigraph is formed of two interrogation marks , followed by a character . |
\{1 to 3 octal digits} | Encoding as octal digits | The actual encoding of a character , represented using three octal digits . |
\x{1 or more hexadecimal digits} | Encoding as hexadecimal digits | The actual encoding of a character represented as a series of hexadecimal digits . |
\uhhhh | Character as universal character name | Represent a character using its universal character name , formed of four hex digits . The universal character name , is its code point in the unicode character set . |
\uhhhhhhhh | Character as universal character name | Represent a character using , its universal character name formed of eight hex digits . The universal character name , is the code point of a character , in the unicode character set . |
A universal character name , must not be used to represent a character , in the basic character set , or to represent a control character .
#include<iostream> int main(void ){ /*An example of escape sequences usage .*/ std::cout << '\a' ; /*Emit an alert or a beeping sound . */ std::cout << "hello5678You" << "\r\tMe"; /*Output: hello567Me */ std::cout << '\n' << '??)' << '\n'; /*??) is a trigraph , and is replaced in the start of , the compilation process with ] . The program must be compiled , with the switch -trigraphs , g++ prg.cpp -trigraphs otherwise trigraphs are just ignored , and are not replaced . Output : ] */ std::cout << 'A' << '\x41' << '\101'; /*\x41 is the encoding of the character A in hexadecimal , and \101 is the encoding of the character A in octal . Output : AAA */ /* char var_c = '\u00C0'; \u00C0 is the universal character name of the character À . If the execution character set is unicode , and its encoding is UTF-8 , this will cause a compiler error of : character too large for enclosing character literal type because the encoding of À will be the two bytes C380 , which does not fit in a single byte .*/ }
An ordinary string literal is a literal delimited by double quotes "
, as in "Hello world"
. It is an array of constant ordinary characters , each member of the array stores the encoding of an ordinary character , and the array is terminated with the null character . Adjacent string literals , are concatenated into one string literal .
#include<iostream> int main(void ){ char *ptr_c = "Hello" "World"; /*The string literal "Hello" and "World" are concatenated into one . ptr_c is a pointer to a null terminated array , of constant characters , trying to change the value of one of the characters , as in : *ptr_c = 'd'; , will cause a run time error .*/ char arr_c[ ] = "Hello World"; /*Create an array containing the characters of the string literal Hello World , and which is null terminated . It is perfectly legal , to change the values of the elements in the created array .*/ std::cout << arr_c << '\n'; /*Output : Hello Word .*/ arr_c[0 ] = 'd'; /*Set the value of the first element of the array to d .*/ std::cout << arr_c << '\n'; /*Output : dello Word .*/ }
char8_t and char8_t character and string literals
char8_t
has an underlying type of unsigned char
, but is a distinct type .
A char8_t
character literal can be specified using the suffix u8
, followed by single quotes enclosing a character , as in u8'a'
.
Characters preceded by u8
are encoded , using utf-8
. utf-8
is a multibyte encoding , which encodes unicode characters using : 1 byte , 2 bytes , 3 bytes , or 4 bytes .
When encoding using one byte , utf-8
always sets the first bit , of the byte to 0
, hence the available encodings are between 0
and 0x7F
in hexadecimal , or between 0
and 0111 1111
in binary. So , the utf-8
encoding of one byte , can only represent 128 characters , this works well for char8_t
, which has a storage size of 1 byte .
The 128 characters represented using utf-8
1 byte encoding , are the ascii characters . The ascii character set , is a subset of unicode , and it is superset of the basic character set .
A string-literal that begins with u8
, followed by double quotes , optionally enclosing some characters , as in u8"À"
, is a char8_t
, also known as , a utf-8
string literal .
A utf-8
string literal , contains the encoding of characters , of the unicode character set , using utf-8
. This result in an array of constant char8_t
characters , containing the encoding of each character , in the char8_t
string literal , terminated with a null character .
The encoding of each character , can be 1
, 2
, 3
or 4
bytes , depending on the character .
#include<iostream> int main(void ){ const char8_t *ptr_cc8 = u8"À"; /*The encoding of À in utf-8 , is C380 . ptr_cc8 , is a pointer to a constant character , which is the first element of an array of constant char8_t characters , containing the encoding of À , and terminated with the null character , C3 80 00 .*/ std::cout << std::hex << ptr_cc8[0 ] << ptr_cc8[1 ] << ptr_cc8[2 ] ; /*Output c38000 .*/ }
The escape sequences , described in the Ordinary character and string literals section , can be used in char8_t
character and string literals.
Wide characters
The wide character types are wchar_t
, char16_t
, and char32_t
. They are of a fixed length encoding , and they are used to store the encoding of characters in the extended character sets .
wchar_t and wchar_t character and string literals
wchar_t
, is a wide character type . It has an integer type , decided by the implementation , this integer type is called its underlying type .
The underlying type can be unsigned or it can be signed , so it can contain only positive or both positive and negative values , but there is no signed wchar_t
or unsigned wchar_t
.
A wide character literal , starts with an uppercase L
, followed by a character , enclosed in single quotes , as in L'À'
. The encoding of this character , is stored in the wchar_t
type .
Under windows , wchar_t
has typically a size of 16
bits , and under linux it has a typically a size of 32
bits , so under windows it can store 16
bit encodings of characters , whereas under linux it can stores 32
bit encodings of characters .
A wide character string literal starts with an uppercase L
, followed by optional characters , enclosed in double quotes "
, as in L"Hello world"
. A wide character string literal , is stored as an array of constant wchar_t
.
The definition of the type wchar_t
can be found , in the header cwchar
. This header also contains , the definition of WCHAR_MAX
and WCHAR_MIN
, which contain the min and max values storable in wchar_t
.
The cwchar
header , also contains utility functions , such as functions to get the length of a wide character string , or to convert a multibyte narrow character string , into a wide character string , or vice versa .
#include<stdio.h> #include<cwchar> int main(void ){ printf("WCHAR_MIN is : %d \n", WCHAR_MIN ); /*Output Under Linux : WCHAR_MIN is : -2147483647 Output Under Windows : WCHAR_MIN is : 0 */ printf("WCHAR_MAX is : %d \n", WCHAR_MAX ); /*Output Under Linux : WCHAR_MAX is : 2147483647 Output Under Windows : WCHAR_MAX is : 65535 */ wchar_t var_wc_BE = L'\U00010301'; /* \U00010301 is the universal character name of the old italic letter BE . It is formed of \U followed by BE code point in hexadecimal in Unicode . The escape sequence is preceded , with L , as such it is a wide character escape sequence . var_wc_BE is a wide character , and it contains the encoding of the wide character literal .*/ printf("Encoding of BE is : %#x\n", var_wc_BE ); /*Print the stored encoding of old italic letter 𐌁 in hexadecimal . Output under linux : Encoding of BE is : 0x10301 Output under windows : Encoding of BE is : 0xdf01 Under linux , what is stored is the utf-32 encoding of the letter 𐌁 , and which is 00010301 , whereas under windows what is stored is the utf-16 encoding of the letter 𐌁 , and which is D800DF01 , since under windows wchar_t is 16 bits , only the last 4 hex digits DF01 : are stored .*/ const char *ptr_c_BE = "\U00010301"; /* \U00010301 is the universal character name of the old italian letter BE . "\U00010301" is an ordinary string literal . A character from the extended character set is chosen , and it was not specified that the string literal is a wide string literal . The compiler choses to encode this string literal , using utf-8 , which is a multibyte encoding , Hence what is stored in the encoding is : F0 90 8C 81 */ printf("Encoding of BE is : 0x%hhx%hhx%hhx%hhx \n", ptr_c_BE[0 ] , ptr_c_BE[1 ] , ptr_c_BE[2 ] , ptr_c_BE[3 ] ) ; /*Print the encoding stored in ptr_c_BE , in hexadecimal . Output under linux : Encoding of BE is : 0xf0908c81 Output under windows : Encoding of BE is : 0xf0908c81 */ const wchar_t *ptr_wc_BE = L"\U00010301"; /*L"\U00010301" is a wide string literal , its encoding is stored as utf-32 under linux , and as utf-16 under windows , in both cases , it needs 4 bytes to be stored .*/ printf("Encoding of BE is : 0x%x 0x%x \n", ptr_wc_BE[0 ] , ptr_wc_BE[1 ] ) ; /*Output for linux , the utf-32 encoding of BE which is the same as its code point U00010301 , followed by the null character : Encoding of BE is : 0x10301 0x0 Output for windows , the utf-16 encoding of BE : Encoding of BE is : 0xd800 0xdf01 .*/ const char *ptr_c = "ab" ; /*"ab" is an ordinary string literal , ptr_c is a pointer to the first character , in this constant array of characters . */ printf("%s\n" , ptr_c ); /*Print the string pointed by , ptr_c . Output : ab */ wchar_t ptr_wc[3]; /*Define an array of wide characters , formed of three elements .*/ mbsrtowcs(ptr_wc , &ptr_c , 3 , 0 ); /*Convert the ordinary string "ab" , to the wide character string L"ab" .*/ printf("%ls\n" , ptr_wc ); /*Print the wide character string pointed by ptr_wc . Output : ab */ }
The escape sequences described in the Ordinary character and string literals section , can be used with wide string and character literals .
char16_t and char16_t character and string literals
char16_t
has an underlying type of uint_least16_t
, it is used to store , the utf-16
encoding of characters .
A char16_t
character literal , starts with the small case letter u
, followed by single quotes , enclosing a character , as in u'l'
.
A char16_t
string literal , starts with the small case letter u
, followed by double quotes , enclosing some optional characters , as in u"Hey"
.
The escape sequences , described earlier on , can be used in char16_t
, string and character literals .
#include<cstdint> /*Contain the min , and max values for integer types which have a least width , specific width , fastest type with a least width ... */ #include<iostream> int main(void ){ using namespace std ; cout << "UINT_LEAST16_MAX is : " << UINT_LEAST16_MAX << "\n" ; /*The max value that can be stored in char16_t , is the same for windows and linux , and is UINT_LEAST16_MAX . The min value that can be stored is 0 . Output : UINT_LEAST16_MAX is : 65535 */ /* char16_t var_c16 = u'\U00010301'; This definition of var_c16 , will cause a compiler error , of character being , too large for enclosing type . The character being used is the old italic letter BE , and is specified using its universal character name , this character has a utf-16 encoding of D800 DF01 , so it needs 32 bits , and cannot fit in 16 bits .*/ const char16_t *ptr_c16 = u"\U00010301"; /*\u00010301 is a utf-16 string literal , so the encoding of the character old italic letter BE , specified using its universal character name , is stored in an array of constant char16_t which its first element is pointed by ptr_c16 .*/ cout<< hex << "UTF-16 encoding of old italic BE is : " << *ptr_c16 << *(ptr_c16+1 ) <<endl ; /*Print the hexadecimal values stored , in the first and second char16_t , , both elements of the array pointed by ptr_c16 . Output , the same thing for windows , and linux : UTF-16 encoding of old italic BE is : d800df01 */ }
char32_t and char32_t character and string literals
char32_t
has an underlying type of uint_least32_t
, and it is used to store , the utf-32
encoding , of characters .
A utf-32
character literal , starts with the capital letter U
, followed by a single quote , enclosing a character , as in U'a'
.
A utf-32
string literal , starts with the capital letter U
, followed by double quotes , enclosing some optional characters , as in U"a"
.
The escape sequences , described earlier , can be used with utf-32
, string and character literals .
#include<cstdint> /*The cstdint header contains the min and max values for integer types , of specific width , least width , fast of least width ...*/ #include<iostream> int main(void ){ using namespace std ; cout << "UINT_LEAST32_MAX is : " << UINT_LEAST32_MAX << "\n" ; /*The max value that can be stored , in char32_t , is UINT_LEAST32_MAX . This is the same , for windows , and linux . Output : UINT_LEAST32_MAX is : 4294967295 */ char32_t var_c32 = U'\U00010301'; /*'\U00010301' is a char32_t character literal , it contains an escape sequence , which represents the old italian letter Be . This character is encoded in utf-32 , and the encoding is the same as the code point , as such it is : 00010301 in hex .*/ cout<< hex << "UTF-32 encoding of old italic BE is : " << var_c32 <<endl ; /*Output , the same thing on windows , and linux : UTF-32 encoding of old italic BE is : 10301 */ }
Raw string literals
A raw string literal , is a string literal which is stored as it is written , escape sequences are not interpreted , new lines and white spaces are preserved .
This literal has the following format :
R"opt-delimiter(characters-of-the-literal)opt-delimiter"
The optional delimiter can be formed of at most sixteen characters , which must be members of the basic character set , with the exception of : ( ) \
, the space character , the horizontal tab , the vertical tab , the form feed , and the new line .
A raw string literal , can be any kind of string literals , for example a wide character string literal , or anything .
#include<iostream> int main(void ){ const char *ptr_c = "C:\\Windows\\System32\\drivers\\etc"; /*This is a window path , the backslash character is escaped in the string literal , as not to be interpreted .*/ std::cout << ptr_c << "\n"; /*Print the string literal , pointed by ptr_c . Output : C:\Windows\System32\drivers\etc */ ptr_c = R"(C:\Windows\System32\drivers\etc)"; /*In a raw string literal , nothing is interpreted , it is stored as is , so there is no need to escape the backslash character in the path .*/ std::cout << ptr_c << "\n"; /*Print the string literal , pointed by ptr_c . Output : C:\Windows\System32\drivers\etc */ ptr_c =R"DLM(Use a delimiter to be able to use )" in a raw string literal)DLM"; /*An example of why to use a delimiter . This is done , in order to be able to use )" in a raw string literal .*/ std::cout << ptr_c << "\n"; /*Print the string literal pointed by ptr_c . Output : Use a delimiter to be able to use )" in a raw string literal */ }