Characters in lisp a tutorial

 

Character repertoires in lisp

A character repertoire such as ascii , or Latin alphabet 1 , or unicode , is a set of characters .

Each character , such as a , in a repertoire , is assigned a code point . A code point is just a number . For example , in ascii , the code point for a , is 97 in decimal , or 61 in hexadecimal .

Each code point has an encoding . An encoding is the representation of the code point , in the computer . For example , in ascii , the character a is encoded , the same as its code point , which is 61 in hexadecimal or 0110 0001 in binary .

This being said , the characters that can be used in a lisp implementation , are defined by the character repertoire , or character repertoires if multiple , that a lisp implementation supports .

At the minimum , all lisp implementations , must support some basic characters , called the standard characters . The set formed by these characters , is called the standard character repertoire standard-char .

The characters that all lisp implementations must support are : a-z , A-Z , 0-9 , ! ? $ " ’ ‘ . : , ; * + - / | \ ~ ^ < = > # % @ & ( ) [ ] { } , the non graphic character new line , the graphic character space .

Character types in lisp

Characters in lisp can be used in writing a program , or they can be used to represent themselves , as data . The lisp reader , reads the typed in characters , one by one . Each read character , has a type . The type of a character , affects how the lisp reader , interprets it.

Characters can be of the following types :

A macro character . The standard macro characters in lisp are : ( ) ' ` , ; " # .

A macro character affects the parsing of the characters , that follows it . When a macro character is encountered , its associated reader macro function , is called . The reader macro function , will read following characters , and return an entity .

For example , the double quote " macro character , returns a string literal , as in "Hello world", the sharp sign # macro character , followed by a backslash \ , and a single character , or a character name , returns a literal character , as in #\a , characters following the semicolon macro character ; , are ignored , and characters following the single quote character ' , if they form a symbol , or a cons , then the read eval loop will not evaluate the symbol or cons . For example :

> '(+ 1 2 )
;;;The cons (+ 1 2 ) 
;;; is not evaluated . 
(+ 1 2)

An escape character . An escape character , is not a macro character , as such there is no associated macro function , which is called , and no entities are returned .

The backslash \ single escape character , escapes a single character , the character just following it . Escaping a character is , the character as alpha , and its case is preserved . For example :

> \'
;;;The macro character ' is escaped ,
;;;It is now treated as an alphabetic
;;;character , instead of being
;;;a macro character .  
;;;One or more alpha characters ,
;;;can form a symbol .
;;; The symbol is evaluated , by the 
;;;read eval loop , this will cause an 
;;;error , as no value is associated
;;;with the symbol '.
*** - SYSTEM::READ-EVAL-PRINT: variable |'| has no value
The following restarts are available:

Vertical bars multiple escape characters , escape the enclosed characters . Enclosed characters are treated as alpha , and their cases are preserved . Escape characters , appearing in vertical bars escape characters , must be escaped , using a single escape character .

>|a(`;b| 
;;;Characters in vertical bars ,
;;;are escaped , and treated as 
;;;alpha , and their case is 
;;;preserved . The symbol 
;;;a(`;b| , has no value , and
;;;this will cause an error .
*** - SYSTEM::READ-EVAL-PRINT: variable |>a(`;b| has no value
The following restarts are available:

A constituent character . The constituent characters are the Rubout which is the delete character in ASCII , the Backspace , and A-Z a–z 0-9 ! $ % & * + - . / : < = > ? @ [ ] ^ _ { } ~ .

The lisp reader forms from the read constituent characters , and from the escaped characters , tokens . A token that is formed by the lisp reader , can either be a number , such as 1 , or 1.2 , or it can be a symbol , such as a or \' . An escaped character , is considered a constituent character .

A number does not have a name , and as a value , it evaluates to itself . A symbol has a name , which is what is read by the lisp reader , and it has zero or more values , associated with it .

The lisp reader , as stated by the value read , by using (readtable-case *readtable* ) , and which by default is :upcase , converts constituent characters to uppercase , unless a constituent character is escaped , in such a case , its case is preserved .

> (readtable-case *readtable*)
;;;Print the case for constituent characters 
:UPCASE

> 'Ab
;;;b is converted to 
;;;uppercase .
AB

> 'A\b
;;;b case is preserved .
|Ab|

A whitespace character . Whitespace characters as defined by the Common Lisp standard , are the space , and the new line characters . They are used to separate tokens .

> |a   b|
;;;One token is formed , since
;;;vertical bar escape characters
;;;are used .
|a   b|

> (+ 2 3 )
;;;a list of three tokens 
;;;+ 2 and 3 .  
5

Literal characters

Characters can represent themselves as data , these are the literal characters .

A literal character is input by using #\ , followed by a single character , or by a character name .

If followed by a single character , the character case is preserved . For example , to enter the literal character capital A , it is represented using #\A , and to enter the lower case character a , it is represented by using #\a .

If followed by a character name , the character name is converted to uppercase , and the character represented by this character name , is the target character literal . For example #\LATIN_CAPITAL_LETTER_A represents A , and #\LATIN_SMALL_LETTER_A represents a .

Depending on if printer escaping is enabled , or disabled , a literal character when output , is either output with a #\ character preceeding it , or it is just output as is . Printer escaping can be check if it is enabled , or disabled , by using *print-escape* .

> *print-escape*
T

> #\a
;;;input is #\a , and output 
;;;is #\a
#\a

> #\A
;;;input is #\A , and output 
;;;is #\A
#\A

Character categories

A character in lisp , can belong to one or more category .

A graphic character , is a character which has a glyph , used to display the character . All standard characters beside the new line character , are graphic characters .

The predicate function , graphic-char-p can be used to check if a character is graphical or not .

> (graphic-char-p #\ )
;;;The space character is a graphic
;;;character.
T

> (graphic-char-p #\
)
;;;The new line character is not graphical
;;;It is informally called a control character 
NIL

An alphabetic character , is a character which is also graphic . The standard characters which are alphabetic , in lisp , are : A-Z , a-z .

For an implementation defined character , if it has a case , it must be alphabetic , if not , then it is implementation defined , if it is alphabetic or not .

To check if a character is alphabetic , the predicate function alpha-char-p can be used .

> (alpha-char-p #\1 )
;;;The character 1 is not alphabetic 
NIL

> (alpha-char-p #\a )
;;;The character a is alphabetic 
T

A numeric character , is also a graphic character . The standard numeric characters , are 0-9 . An implementation may define , other numeric characters .

An alphanumeric character , is a graphical character , which is either numeric , or alphabetic . The standard characters which are alphanumeric , are : a-z , A-Z , and 0-9 .

The predicate function alphanumericp , can be used to check if a character is alphanumeric .

> (alphanumericp #\a )
;;;Check if a is alphanumeric , 
;;;returns true .
T

> (alphanumericp #\1 )
;;;Check if 1 , is alphanumeric ,
;;;returns true.
T

> (and (alphanumericp #\1 ) (not (alpha-char-p  #\1 ) ))
;;;Check if a character is numeric , by checking
;;;if it is alphanumeric , and is not alphabetic .
;;;Check if 1 is numeric 
T

A cased character , is an alphabetic character, it can be either uppercase , or lower cased , and it must have a character , which has its counterpart case .

The standard uppercase characters are A-Z , and the standard lowercase characters are a-z .

The predicate functions upper-case-p and lower-case-p , can be used to check if a character is , upper or lower case .

The both-case-p predicate function , can be used to check , if a character , has both an uppercase , and a lowercase version . Some non standard characters , might not be cased , as an example arabic characters , do not have a case .

The functions char-downcase , and char-upcase , can be used to get the lowercase , and uppercase version , of a character .

> (upper-case-p #\A )
;;;Check if the character A , 
;;;is uppercase .
T

> (lower-case-p #\a )
;;;Check if the character a , 
;;; is lowercase .
T

> (both-case-p #\a )
;;;Check if the character a , 
;;;can be uppercase and 
;;;lowercase . 
T

> (both-case-p #\1 )
;;;Check if the character 1 , 
;;;can be uppercase and lowercase .
NIL

> (both-case-p #\ARABIC_LETTER_ALEF )
;;;Check if the arabic character alef
;;;has uppercase , and lowercase .
NIL

> (char-upcase #\a )
;;;Get the uppercase version 
;;;of the character a .
#\A

> (char-downcase #\A )
;;;Get the lowercase version ,
;;;of the character A .
#\a

A digit character , is a digit in a given radix . For example , A in base 16 , is considered to be a digit . The standard radix , are between 2 and 36 inclusive , and the radices digits , can be between : 0 and Z , where Z is 35 . Radices digits , are case insensitive .

The predicate function digit-char-p , can be used to check if a character is a digit , in a given radix .

> (digit-char-p #\9 )
;;;If no radix is specified , then the 
;;;default radix , which is used is 10 .
;;;digit-char-p , returns either the 
;;;weight of the digit in the radix , 
;;;or false if the character is 
;;;not a digit in the provided 
;;;radix .
9

> (digit-char-p #\F 16 )
15

The function digit-char , can be used to get the character, that represents a digit in a given radix .

> (digit-char 9 )
;;;If no radix is specified , the default 
;;;one is base 10 . The number 9 , is a 
;;;digit in base 10 , hence digit-char
;;;returns its representing character .
#\9

> (digit-char 10 )
;;;The number 10 , is not a digit 
;;;in base 10 , hence the digit-char
;;;function returns false .
NIL

> (digit-char 35 36 )
;;;The number 35 , is a digit , 
;;;in base 36 , hence the function
;;;digit-char returns its representing
;;;character capital Z .
#\Z

Character attributes

In lisp the only attribute that a character must have is , its code point .

As talked about earlier , an implementation can define characters beside the standard characters . Each character , in a character repertoire , has a code point . An implementation can use , for example , the unicode character repertoire , and all the characters made available by this implementation , through this character repertoire , will have as code points , the unicode code point .

To get the code point of a character in lisp , the function char-code , can be used :

> (char-code #\a )
;;;Get the code point of the character
;;;a in decimal .
97

By default the print base in lisp is decimal , so to set to it hexadecimal , and to see the code point of a number , as it is written in unicode , the print base can be set , to hexadecimal as follows .

> (setq *print-base* 16 )
;;;Set the print base to base 16 . 
;;;outputs 10 in base 16 , to indicate
;;;the set base .
10

> (char-code #\a )
;;;Get the code point of the character 
;;;a in hexadecimal . 
61

To get the character , represented by a given code point , the function code-char can be used .

> (setq *read-base* 16 )
;;;Set the read base , to base 16 , instead
;;;of the default base 10 .
16

> (code-char 61 )
;;;Enter , the character code in hexadecimal . 
;;;The implementation uses , the unicode , 
;;;character repertoire , as such 61 in
;;;hexadecimal , is the code point for 
;;;the character a .
#\a

> (setq *read-base* A )
;;;Set the read base to the default 
;;; base 10 .
10

> (code-char 97 )
#\a

An implementation might define additional attributes beside the code point . Historically the font attribute , and the bits attribute existed , as part of common lisp , but they were removed in the process of standardizing common lisp , and instead , an implementation can define additional attributes .

The font attribute , was used to specify the style of a character , for example , bold , or italics .

Character name , encoding , and other meta information

A character can have a name . The only standard characters , for which the standard defined a name , are : the Newline , and the Space characters .

The standard also defined a name for the following characters : Rubout , Page , Tab , Backspace , Return , Linefeed .

All non graphic character , must have a name , unless they have some not null implementation defined attribute . The name of a character , can be gotten as a string , using the function char-name , and a character , can be gotten from a string , which has its name , using the function name-char .

> (char-name #\a )
;;;Get the name of the character
;;;a .
"LATIN_SMALL_LETTER_A"

> (name-char "LATIN_SMALL_LETTER_A" )
;;;Get the character which has 
;;;a name of LATIN_SMALL_LETTER_A .
#\a

A character , has an encoding , which is how it is numerically represented in a computing machine . The encoding of a character in lisp , can be gotten by using the char-int function .

> (char-int #\a )
;;;Get the integer value of the encoding
;;;of a character , the print base
;;;is decimal , hence the output for
;;;the character a is 97 .
97

The lisp standard defines , that if the implementation does not define any implementation specific attributes for a character , then the char-int function , must return the same value as char-code .

The constant variable char-code-limit , can be used to get the maximum code number , that a character might have , not inclusive , in a given lisp implementation .

> char-code-limit 
1114112

Character comparison

To compare two or more characters for equality , or inequality , using their attributes , the predicate functions char= , and char/= can be used . These predicate functions , return true or false , depending if all their characters have , or do not have , the same attributes .

> (char=  #\a #\LATIN_SMALL_LETTER_A #\A )
;;;compare a , a and A for equality
;;;using their attributes .
NIL

> (char/=  #\a #\LATIN_SMALL_LETTER_A #\A )
;;;Compare a , a , and A , for difference
;;;using their attributes .
NIL

> (char=  #\a #\LATIN_SMALL_LETTER_A #\a )
;;;Compare a , a , a for equality , using
;;;their attributes .
T

>(char/=  #\a #\b #\c )
;;;Compare for difference , a , b , and
;;;c using their attributes .
T

eql , and equal , can also be used when comparing two characters only , they will yield the same result as eq= .

eq can be used to compare two characters for equality , it does not use the character concept for comparison, but a lower implementation defined level . eq , might yield false , when eq= is yielding true , but if eq yields true, when comparing two characters , then eq= must yield true .

To compare one or more character for order , using the characters code points , when all other attributes are the same , the predicate functions : char< , char<= , char> , char>= can be used .

The standard characters , have the following ordering .

A<B<C<D<E<F<G<H<I<J<K<L<M<N<O<P<Q<R<S<T<U<V<W<X<Y<Z
;;;Ordering of capital case characters

a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z
;;;Ordering of lowercase characters 

0<1<2<3<4<5<6<7<8<9
;;;Ordering of numbers 

An example of using these functions :

> (char< #\A #\B #\C )
;;;Compare A , B , C for 
;;;ascending order .
T

> (char> #\z #\f #\d )
;;;Compare z , f , d for
;;;descending order .
T

> (char<= #\1 #\DIGIT_ONE #\4 )
;;;Compare for less or equal
;;;order 1 , 1 , 4 . 
T

> (char>= #\4 #\3 #\5 )
;;;Compare for larger or equal 
;;;order 4 , 3 and 5 .
NIL

If characters are to be compared , for equality or for order , ignoring case , and ignoring other attributes , which the implementation dictates , that they are to ignored , then the functions : char-equal , char-not-equal , char-greaterp , char-not-greaterp , char-lessp , and char-not-lessp can be used .

> (char-equal #\a #\A #\Latin_capital_letter_A )
;;;Compare for equality , ignoring case ,
;;;the characters a , A , and A .
T

Converting a character designator to a character

The function character , can be used to convert a character designator , to its corresponding character . For example :

> (character 65 )
#\A

> (character "a" )
#\a

> (character 'a )
#\A

> (character #\A )
#\A

standard-char , base-char , extended-char , character

A standard-char , is any character of the 96 standard characters , talked about earlier .

A character can be checked if it is a standard character , using standard-char-p .

> (standard-char-p  #\a )
;;;Check if the character a , is 
;;;a standard character , returns
;;;true .
T

> (standard-char-p #\POUND_SIGN )
;;;Check if the pound sign character
;;;is a standard character , returns
;;;false .
NIL

A base-char is , is a super type of the standard-char . The set of base-char , can contain additional characters .

This is more related to how an implementation encode characters . If an implementation for example , has two encoding of characters , one which is 8 bits , and the second one which is 16 bits , then base-char corresponds to the characters , which are encoded using 8 bits , and the 16 bits encoded characters , are called extended characters .

A extended-char is simply a character , which is not a base-char .

The character type , is a super type , for both the base-char , and the extended-char types . In an implementation where all characters are base-char , there can be no extended-char .

The predicate functions characterp , can be used to check if a token is a character .

> (characterp #\POUND_SIGN )
;;;Check if the character pound sign
;;;is a character .
T