Table of Contents
Character repertoires in lisp
A character repertoire such as ascii
, or Latin alphabet 1
, or unicode
, is a set of characters .
Each character , such as a
, in a repertoire , is assigned a code point . A code point is just a number . For example , in ascii , the code point for a
, is 97
in decimal , or 61
in hexadecimal .
Each code point has an encoding . An encoding is the representation of the code point , in the computer . For example , in ascii , the character a
is encoded , the same as its code point , which is 61
in hexadecimal or 0110 0001
in binary .
This being said , the characters that can be used in a lisp implementation , are defined by the character repertoire , or character repertoires if multiple , that a lisp implementation supports .
At the minimum , all lisp implementations , must support some basic characters , called the standard characters . The set formed by these characters , is called the standard character repertoire standard-char
.
The characters that all lisp implementations must support are : a-z
, A-Z
, 0-9
, ! ? $ " ’ ‘ . : , ; * + - / | \ ~ ^ < = > # % @ & ( ) [ ] { }
, the non graphic character new line , the graphic character space .
Character types in lisp
Characters in lisp can be used in writing a program , or they can be used to represent themselves , as data . The lisp reader , reads the typed in characters , one by one . Each read character , has a type . The type of a character , affects how the lisp reader , interprets it.
Characters can be of the following types :
A macro character . The standard macro characters in lisp are : ( ) ' ` , ; " #
.
A macro character affects the parsing of the characters , that follows it . When a macro character is encountered , its associated reader macro function , is called . The reader macro function , will read following characters , and return an entity .
For example , the double quote "
macro character , returns a string literal , as in "Hello world"
, the sharp sign #
macro character , followed by a backslash \
, and a single character , or a character name , returns a literal character , as in #\a
, characters following the semicolon macro character ;
, are ignored , and characters following the single quote character '
, if they form a symbol , or a cons , then the read eval loop will not evaluate the symbol or cons . For example :
> '(+ 1 2 ) ;;;The cons (+ 1 2 ) ;;; is not evaluated . (+ 1 2)
An escape character . An escape character , is not a macro character , as such there is no associated macro function , which is called , and no entities are returned .
The backslash \
single escape character , escapes a single character , the character just following it . Escaping a character is , the character as alpha , and its case is preserved . For example :
> \' ;;;The macro character ' is escaped , ;;;It is now treated as an alphabetic ;;;character , instead of being ;;;a macro character . ;;;One or more alpha characters , ;;;can form a symbol . ;;; The symbol is evaluated , by the ;;;read eval loop , this will cause an ;;;error , as no value is associated ;;;with the symbol '. *** - SYSTEM::READ-EVAL-PRINT: variable |'| has no value The following restarts are available:
Vertical bars multiple escape characters , escape the enclosed characters . Enclosed characters are treated as alpha , and their cases are preserved . Escape characters , appearing in vertical bars escape characters , must be escaped , using a single escape character .
>|a(`;b| ;;;Characters in vertical bars , ;;;are escaped , and treated as ;;;alpha , and their case is ;;;preserved . The symbol ;;;a(`;b| , has no value , and ;;;this will cause an error . *** - SYSTEM::READ-EVAL-PRINT: variable |>a(`;b| has no value The following restarts are available:
A constituent character . The constituent characters are the Rubout which is the delete character in ASCII , the Backspace , and A-Z a–z 0-9 ! $ % & * + - . / : < = > ? @ [ ] ^ _ { } ~
.
The lisp reader forms from the read constituent characters , and from the escaped characters , tokens . A token that is formed by the lisp reader , can either be a number , such as 1
, or 1.2
, or it can be a symbol , such as a
or \'
. An escaped character , is considered a constituent character .
A number does not have a name , and as a value , it evaluates to itself . A symbol has a name , which is what is read by the lisp reader , and it has zero or more values , associated with it .
The lisp reader , as stated by the value read , by using (readtable-case *readtable* )
, and which by default is :upcase
, converts constituent characters to uppercase , unless a constituent character is escaped , in such a case , its case is preserved .
> (readtable-case *readtable*) ;;;Print the case for constituent characters :UPCASE > 'Ab ;;;b is converted to ;;;uppercase . AB > 'A\b ;;;b case is preserved . |Ab|
A whitespace character . Whitespace characters as defined by the Common Lisp standard , are the space , and the new line characters . They are used to separate tokens .
> |a b| ;;;One token is formed , since ;;;vertical bar escape characters ;;;are used . |a b| > (+ 2 3 ) ;;;a list of three tokens ;;;+ 2 and 3 . 5
Literal characters
Characters can represent themselves as data , these are the literal characters .
A literal character is input by using #\
, followed by a single character , or by a character name .
If followed by a single character , the character case is preserved . For example , to enter the literal character capital A
, it is represented using #\A
, and to enter the lower case character a
, it is represented by using #\a
.
If followed by a character name , the character name is converted to uppercase , and the character represented by this character name , is the target character literal . For example #\LATIN_CAPITAL_LETTER_A
represents A
, and #\LATIN_SMALL_LETTER_A
represents a
.
Depending on if printer escaping is enabled , or disabled , a literal character when output , is either output with a #\
character preceeding it , or it is just output as is . Printer escaping can be check if it is enabled , or disabled , by using *print-escape*
.
> *print-escape* T > #\a ;;;input is #\a , and output ;;;is #\a #\a > #\A ;;;input is #\A , and output ;;;is #\A #\A
Character categories
A character in lisp , can belong to one or more category .
A graphic
character , is a character which has a glyph , used to display the character . All standard characters beside the new line character , are graphic characters .
The predicate function , graphic-char-p
can be used to check if a character is graphical or not .
> (graphic-char-p #\ ) ;;;The space character is a graphic ;;;character. T > (graphic-char-p #\ ) ;;;The new line character is not graphical ;;;It is informally called a control character NIL
An alphabetic
character , is a character which is also graphic . The standard characters which are alphabetic , in lisp , are : A-Z
, a-z
.
For an implementation defined character , if it has a case , it must be alphabetic , if not , then it is implementation defined , if it is alphabetic or not .
To check if a character is alphabetic , the predicate function alpha-char-p
can be used .
> (alpha-char-p #\1 ) ;;;The character 1 is not alphabetic NIL > (alpha-char-p #\a ) ;;;The character a is alphabetic T
A numeric
character , is also a graphic character . The standard numeric characters , are 0-9
. An implementation may define , other numeric characters .
An alphanumeric
character , is a graphical character , which is either numeric , or alphabetic . The standard characters which are alphanumeric , are : a-z
, A-Z
, and 0-9
.
The predicate function alphanumericp
, can be used to check if a character is alphanumeric .
> (alphanumericp #\a ) ;;;Check if a is alphanumeric , ;;;returns true . T > (alphanumericp #\1 ) ;;;Check if 1 , is alphanumeric , ;;;returns true. T > (and (alphanumericp #\1 ) (not (alpha-char-p #\1 ) )) ;;;Check if a character is numeric , by checking ;;;if it is alphanumeric , and is not alphabetic . ;;;Check if 1 is numeric T
A cased
character , is an alphabetic character, it can be either uppercase , or lower cased , and it must have a character , which has its counterpart case .
The standard uppercase characters are A-Z
, and the standard lowercase characters are a-z
.
The predicate functions upper-case-p
and lower-case-p
, can be used to check if a character is , upper or lower case .
The both-case-p
predicate function , can be used to check , if a character , has both an uppercase , and a lowercase version . Some non standard characters , might not be cased , as an example arabic characters , do not have a case .
The functions char-downcase
, and char-upcase
, can be used to get the lowercase , and uppercase version , of a character .
> (upper-case-p #\A ) ;;;Check if the character A , ;;;is uppercase . T > (lower-case-p #\a ) ;;;Check if the character a , ;;; is lowercase . T > (both-case-p #\a ) ;;;Check if the character a , ;;;can be uppercase and ;;;lowercase . T > (both-case-p #\1 ) ;;;Check if the character 1 , ;;;can be uppercase and lowercase . NIL > (both-case-p #\ARABIC_LETTER_ALEF ) ;;;Check if the arabic character alef ;;;has uppercase , and lowercase . NIL > (char-upcase #\a ) ;;;Get the uppercase version ;;;of the character a . #\A > (char-downcase #\A ) ;;;Get the lowercase version , ;;;of the character A . #\a
A digit
character , is a digit in a given radix . For example , A
in base 16
, is considered to be a digit . The standard radix , are between 2
and 36
inclusive , and the radices digits , can be between : 0
and Z
, where Z
is 35
. Radices digits , are case insensitive .
The predicate function digit-char-p
, can be used to check if a character is a digit , in a given radix .
> (digit-char-p #\9 ) ;;;If no radix is specified , then the ;;;default radix , which is used is 10 . ;;;digit-char-p , returns either the ;;;weight of the digit in the radix , ;;;or false if the character is ;;;not a digit in the provided ;;;radix . 9 > (digit-char-p #\F 16 ) 15
The function digit-char
, can be used to get the character, that represents a digit in a given radix .
> (digit-char 9 ) ;;;If no radix is specified , the default ;;;one is base 10 . The number 9 , is a ;;;digit in base 10 , hence digit-char ;;;returns its representing character . #\9 > (digit-char 10 ) ;;;The number 10 , is not a digit ;;;in base 10 , hence the digit-char ;;;function returns false . NIL > (digit-char 35 36 ) ;;;The number 35 , is a digit , ;;;in base 36 , hence the function ;;;digit-char returns its representing ;;;character capital Z . #\Z
Character attributes
In lisp the only attribute that a character must have is , its code point .
As talked about earlier , an implementation can define characters beside the standard characters . Each character , in a character repertoire , has a code point . An implementation can use , for example , the unicode character repertoire , and all the characters made available by this implementation , through this character repertoire , will have as code points , the unicode code point .
To get the code point of a character in lisp , the function char-code
, can be used :
> (char-code #\a ) ;;;Get the code point of the character ;;;a in decimal . 97
By default the print base in lisp is decimal , so to set to it hexadecimal , and to see the code point of a number , as it is written in unicode , the print base can be set , to hexadecimal as follows .
> (setq *print-base* 16 ) ;;;Set the print base to base 16 . ;;;outputs 10 in base 16 , to indicate ;;;the set base . 10 > (char-code #\a ) ;;;Get the code point of the character ;;;a in hexadecimal . 61
To get the character , represented by a given code point , the function code-char
can be used .
> (setq *read-base* 16 ) ;;;Set the read base , to base 16 , instead ;;;of the default base 10 . 16 > (code-char 61 ) ;;;Enter , the character code in hexadecimal . ;;;The implementation uses , the unicode , ;;;character repertoire , as such 61 in ;;;hexadecimal , is the code point for ;;;the character a . #\a > (setq *read-base* A ) ;;;Set the read base to the default ;;; base 10 . 10 > (code-char 97 ) #\a
An implementation might define additional attributes beside the code point . Historically the font attribute , and the bits attribute existed , as part of common lisp , but they were removed in the process of standardizing common lisp , and instead , an implementation can define additional attributes .
The font attribute , was used to specify the style of a character , for example , bold , or italics .
Character name , encoding , and other meta information
A character can have a name . The only standard characters , for which the standard defined a name , are : the Newline
, and the Space
characters .
The standard also defined a name for the following characters : Rubout
, Page
, Tab
, Backspace
, Return
, Linefeed
.
All non graphic character , must have a name , unless they have some not null implementation defined attribute . The name of a character , can be gotten as a string , using the function char-name
, and a character , can be gotten from a string , which has its name , using the function name-char
.
> (char-name #\a ) ;;;Get the name of the character ;;;a . "LATIN_SMALL_LETTER_A" > (name-char "LATIN_SMALL_LETTER_A" ) ;;;Get the character which has ;;;a name of LATIN_SMALL_LETTER_A . #\a
A character , has an encoding , which is how it is numerically represented in a computing machine . The encoding of a character in lisp , can be gotten by using the char-int
function .
> (char-int #\a ) ;;;Get the integer value of the encoding ;;;of a character , the print base ;;;is decimal , hence the output for ;;;the character a is 97 . 97
The lisp standard defines , that if the implementation does not define any implementation specific attributes for a character , then the char-int
function , must return the same value as char-code
.
The constant variable char-code-limit
, can be used to get the maximum code number , that a character might have , not inclusive , in a given lisp implementation .
> char-code-limit 1114112
Character comparison
To compare two or more characters for equality , or inequality , using their attributes , the predicate functions char=
, and char/=
can be used . These predicate functions , return true or false , depending if all their characters have , or do not have , the same attributes .
> (char= #\a #\LATIN_SMALL_LETTER_A #\A ) ;;;compare a , a and A for equality ;;;using their attributes . NIL > (char/= #\a #\LATIN_SMALL_LETTER_A #\A ) ;;;Compare a , a , and A , for difference ;;;using their attributes . NIL > (char= #\a #\LATIN_SMALL_LETTER_A #\a ) ;;;Compare a , a , a for equality , using ;;;their attributes . T >(char/= #\a #\b #\c ) ;;;Compare for difference , a , b , and ;;;c using their attributes . T
eql
, and equal
, can also be used when comparing two characters only , they will yield the same result as eq=
.
eq
can be used to compare two characters for equality , it does not use the character concept for comparison, but a lower implementation defined level . eq
, might yield false , when eq=
is yielding true , but if eq
yields true, when comparing two characters , then eq=
must yield true .
To compare one or more character for order , using the characters code points , when all other attributes are the same , the predicate functions : char<
, char<=
, char>
, char>=
can be used .
The standard characters , have the following ordering .
A<B<C<D<E<F<G<H<I<J<K<L<M<N<O<P<Q<R<S<T<U<V<W<X<Y<Z ;;;Ordering of capital case characters a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z ;;;Ordering of lowercase characters 0<1<2<3<4<5<6<7<8<9 ;;;Ordering of numbers
An example of using these functions :
> (char< #\A #\B #\C ) ;;;Compare A , B , C for ;;;ascending order . T > (char> #\z #\f #\d ) ;;;Compare z , f , d for ;;;descending order . T > (char<= #\1 #\DIGIT_ONE #\4 ) ;;;Compare for less or equal ;;;order 1 , 1 , 4 . T > (char>= #\4 #\3 #\5 ) ;;;Compare for larger or equal ;;;order 4 , 3 and 5 . NIL
If characters are to be compared , for equality or for order , ignoring case , and ignoring other attributes , which the implementation dictates , that they are to ignored , then the functions : char-equal
, char-not-equal
, char-greaterp
, char-not-greaterp
, char-lessp
, and char-not-lessp
can be used .
> (char-equal #\a #\A #\Latin_capital_letter_A ) ;;;Compare for equality , ignoring case , ;;;the characters a , A , and A . T
Converting a character designator to a character
The function character
, can be used to convert a character designator , to its corresponding character . For example :
> (character 65 ) #\A > (character "a" ) #\a > (character 'a ) #\A > (character #\A ) #\A
standard-char , base-char , extended-char , character
A standard-char
, is any character of the 96
standard characters , talked about earlier .
A character can be checked if it is a standard character , using standard-char-p
.
> (standard-char-p #\a ) ;;;Check if the character a , is ;;;a standard character , returns ;;;true . T > (standard-char-p #\POUND_SIGN ) ;;;Check if the pound sign character ;;;is a standard character , returns ;;;false . NIL
A base-char
is , is a super type of the standard-char
. The set of base-char
, can contain additional characters .
This is more related to how an implementation encode characters . If an implementation for example , has two encoding of characters , one which is 8
bits , and the second one which is 16
bits , then base-char
corresponds to the characters , which are encoded using 8
bits , and the 16
bits encoded characters , are called extended characters .
A extended-char
is simply a character , which is not a base-char
.
The character
type , is a super type , for both the base-char
, and the extended-char
types . In an implementation where all characters are base-char
, there can be no extended-char
.
The predicate functions characterp
, can be used to check if a token is a character .
> (characterp #\POUND_SIGN ) ;;;Check if the character pound sign ;;;is a character . T