Regex in java , part one , The Pattern class

What is a regex ?

Regex stands for regular expression . it is a string which is formed of characters used to perform :

  • searching
  • replacing
  • validation

in other string. In addition to the regular characters that a regex can contain , such as a b c , we can use meta characters . Meta characters can be

  • the . meta characters which is used to match any character , beside the end of line , so it can match a b 0 –
  • [] which can be used to specify a range of characters such as [a-z] , will match the range of character from a till z . It can also be used to match any character inside the brackets [ariu] , so it will match a or r or i or u
  • [^] when we use the carrot character inside the square bracket , it means not , so match anything which is not the characters or range of characters inside the square brackets .
    [^abc]
    input abcdef
    match will be d  , e , f
  • character classes , which can be used to match one or more character based on how we categorize them , \d for example is used to match digits , \p{Upper} used to match upper case letters.
  • positional meta characters to match a position such as ^ to match the start of the string , and $ , to match the end of the string
  • character group , used to group characters that we want to match (ab)+ , so in this case we want to match the ab group 1 or more time .
  • quantifier , to quantify a character or a character group .
    • ? make the character or character group optional to match a?
    • + match a character or character group 1 or more time (ab)+
    • * match a character or a character group 0 or more times (ab)*
    • {n} match a character or character group n times a{2}
    • {min,} match a character or character group at least min times (ab){2,}
    • {min , max} matc a character or a character group at least min and at most max times [ab]{1,5}
  • the | or meta character , which is used to match either one of the characters or character group before or after it (ab)|c
  • we can use the \ in order to escape meta characters , so if we want to match the dot meta character we can use \.

The Pattern class

the Pattern class is used to compile a regex into an instance of the Pattern class. A regex is formed of one or more characters .

The Pattern class has methods that we can use in order to match a string against a regex , and to split a string based on a regex. it also has a method to return an instance of the Matcher class, which can be used to perform replacement inside a string and to find matches .

Compiling a regex into an instance of the Pattern class

There are two methods that we can use in order to compile a regex into an instance of the pattern class

static Pattern	compile(String regex)
static Pattern	compile(String regex, int flags)

The first one takes a regex specified as a string of characters . The second one takes a regex specified as a string of characters , plus some flags that can be used to change the behaviour of how the matching is done .

the flags that we can use are

Pattern.CASE_INSENSITIVE

Pattern.UNICODE_CASE
Pattern.UNICODE_CHARACTER_CLASS
Pattern.CANON_EQ

Pattern.DOTALL
Pattern.MULTILINE

Pattern.LITERAL

Pattern.CASE_INSENSITIVE

By default pattern matching is done in a case sensitive way , so a != A . To make it case insensitive for ascii characters only , we can use the CASE_INSENSITIVE flag .

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternFlagsCaseInsensitive {

    public static void flagCaseInsensitive() {

        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile("a|b");
        // match either a or b , case sensitive

        arraySplitString = aPattern.split("AcBd");
        /*
         * Split the String AcBd based on the pattern , so if a is found or if b is
         * found split , the string . this method return an array of split String
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output : ["ABCD"] there was no matches found , since matching is case sensitive
         * , as such the result is an array containing the string itself without being
         * split
         */

        aPattern = Pattern.compile("a|b", Pattern.CASE_INSENSITIVE);
        /*
         * create a pattern which is case insensitive it will match a b A B
         */

        arraySplitString = aPattern.split("AcBd");
        out.println(Arrays.toString(arraySplitString));
        /*
         * output : ["", "c", "d"] A is the first match before it there is nothing , so the
         * first element of the array is the empty string , B is the second match before
         * it there is c and after it is d , as such we split around B
         */

        aPattern = Pattern.compile("é", Pattern.CASE_INSENSITIVE);
        /*
         * case insensitive will not work for characters which are not ascii
         */
        arraySplitString = aPattern.split("aÉb");
        out.println(Arrays.toString(arraySplitString));
        /*
         * output : ["aÉb"] Pattern.CASE_INSENSITIVE will only work for ascii character .
         * é is not ascii as such the string is not split around é .
         */

    }

    public static void main(String[] args) {
        flagCaseInsensitive();
    }

}

Pattern.UNICODE_CASE

CASE_INSENSITIVE will only make case insensitive matching work for ascii characters. So this will not work for characters which are not ascii, as such É != é . To make case insensitive work for all characters , we can use UNICODE_CASE . The comparison of the characters will be based on unicode character property .

import java.util.regex.Pattern;
import static java.lang.System.out;
import java.util.Arrays;

class PatternFlagsCaseInsensitiveUnicode {

    public static void flagCaseInsensitiveUnicode() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile("é", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
        /*
         * unicode case insensitive pattern which match é and É
         */

        arraySplitString = aPattern.split("aÉb");
        /*
         * Split the String aÉb based on the pattern é
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * Arrays.toString , will convert the result arraySplitString to a string . The
         * output is: ["a", "b"]
         */

    }

    public static void main(String[] args) {
        flagCaseInsensitiveUnicode();
    }

}

Pattern.UNICODE_CHARACTER_CLASS

The UNICODE_CHARACTER_CLASS allow us to use the predefined and the posix character classes as unicode , instead of being ascii only.

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternFlagsCharacterClass {

    public static void patternFlagsCharacterClass() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile("\\d");
        /*
         * create a pattern using the Predefined character classes \d , this will match
         * digits between 0 - 9
         */

        arraySplitString = aPattern.split("a١b");
        /*
         * Split the String a١b based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output , no match were found , as such the result is the string itself ["a١b"]
         */

        aPattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);
        /*
         * create a pattern using the Predefined character classes \d using the
         * UNICODE_CHARACTER_CLASS , so in this case \d will match any character that is
         * considered to be a digit by unicode character property
         */

        arraySplitString = aPattern.split("a١b");
        /*
         * Split the String a١b based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["a", "b"] , ١ is matched and the string is split around ١
         */

        aPattern = Pattern.compile("\\p{Lower}");
        /*
         * create a pattern using posix character class \p{Lower} , this will match a
         * lower-case alphabetic character: [a-z]
         */
        arraySplitString = aPattern.split("A𐐨B");
        /*
         * Split the String a𐐨b based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["A𐐨B"] , no match were found ,even thoughh 𐐨 is lowercase , since the
         * character class is only matching [a-z] as such the result is the string
         * itself ["A𐐨B"]
         */

        aPattern = Pattern.compile("\\p{Lower}", Pattern.UNICODE_CHARACTER_CLASS);
        /*
         * create a pattern using posix character class \p{Lower} , this will match a
         * lower-case alphabetic character: [a-z] , we have used UNICODE_CHARACTER_CLASS ,
         * so this matchatche any character which is considered lowercase by unicode
         */
        arraySplitString = aPattern.split("A𐐨B");
        /*
         * Split the String a𐐨b based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["A", "B"] . One match is found , 𐐨 is lowercase , and the result is ["A",
         * "B"]
         */

    }

    public static void main(String[] args) {
        patternFlagsCharacterClass();

    }
}

Pattern.CANON_EQ

In unicode , some characters can be represented by more than one code points . For example , the é character can be represented in unicode by either using "\u00e9" or by using "\u0065\u0301" . if we don’t use CANON_EQ , the comparison of the code points will return false , when we use this flag it will return true .

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternFlagsCanonEq {

    public static void patternFlagsCanonEq() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile("\u00e9");
        /*
         * create a pattern using the unicode code point of é
         */

        arraySplitString = aPattern.split("a\u0065\u0301b");
        /*
         * Split the String a\u0065\u0301b based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output , no match was found , as such the result is the string itself ["aéb"]
         */

        aPattern = Pattern.compile("\u00e9", Pattern.CANON_EQ);
        /*
         * create a pattern using the unicode code point of é and using the CANON_EQ ,
         * so in this case é will match both \u00e9 and \u0065\u0301
         */

        arraySplitString = aPattern.split("a\u0065\u0301b");
        /*
         * Split the String a\u0065\u0301b based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["a", "b"] , \u0065\u0301 is matched and the string is split around é
         */

    }

    public static void main(String[] args) {
        patternFlagsCanonEq();

    }
}

Pattern.DOTALL

the DOTALL flag will alter how the meta character . works . By default the . character will not match the end of line terminator :

  • '\n' : new line
  • "\r\n" : carriage return followed by new line
  • '\r' : standalone carriage return
  • '\u0085' : next line character
  • '\u2028' : line separator character
  • '\u2029' : paragraph separator character

if we set the DOTALL flag to true this will let the meta character . match line terminators .

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternFlagsDOTALL {

    public static void patternFlagsDOTALL() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile(".");
        /*
         * create a pattern using the meta character . , this will match everything but
         * line terminators
         */

        arraySplitString = aPattern.split("\nb\n");
        /*
         * Split the String \nb\n based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["\n","\n"], one match is found which is b , split around b
         */

        aPattern = Pattern.compile(".", Pattern.DOTALL);
        /*
         * create a pattern using the meta character .  we used DOTALL so .
         * will match everything even line terminators
         */

        arraySplitString = aPattern.split("\nb\n");
        /*
         * Split the String \nb\n based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output [], . will match everything even line terminators, and the result of
         * splitting is removing all the found matches , so nothing
         */

    }

    public static void main(String[] args) {
        patternFlagsDOTALL();

    }
}

Pattern.MULTILINE

The positional meta characters ^ and $ will match the string input from its start till its end , and will ignore the line terminators

'\n'               : new line;
"\r\n"              : carriage return followed by new line;
'\r'               : standalone carriage return
'\u0085'           : next line character
'\u2028'           : line separator character
'\u2029'           : paragraph separator character

when setting the MULTILINE flag , the ^ and $ flags will just match before and after the line terminator .

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternFlagsMULTILINE {

    public static void patternFlagsMULTILINE() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile("$");
        /*
         * create a pattern using the meta character $ , this will match only the end of
         * a string
         */

        arraySplitString = aPattern.split("a\nb\nc\n");
        /*
         * Split the String a\nb\nc\n based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["a\nb\nc\n",], one match is found which is the end of the string as
         * such the returned array contain two strings
         */

        aPattern = Pattern.compile("$", Pattern.MULTILINE);
        /*
         * create a pattern using the meta character $ , we have used the MULTILINE flag
         * , so $ will match line terminators
         */

        arraySplitString = aPattern.split("a\nb\nc\n");
        /*
         * Split the String a\nb\nc\n based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["a\n","b\n","c\n",], when using the MULTILINE flag $ will match just
         * after the line terminators and it will match also the end of the string . as
         * such the result is an array containing 4 strings
         */

    }

    public static void main(String[] args) {
        patternFlagsMULTILINE();
    }
}

Pattern.LITERAL

we can use the LITERAL flag if we want to treat the regex passed to the Pattern as a literal , as such its characters are treated as regular characters and not as meta characters .

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternFlagsLITERAL {

    public static void patternFlagsLITERAL() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile("a|b");
        /*
         * create a regex using the meta character | , so we can match either a or b
         */

        arraySplitString = aPattern.split("eaf");
        /*
         * Split the String eaf based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["e","f"] one match is found which is a as such we split around a , and
         * we have two strings as a result
         */

        aPattern = Pattern.compile("a|b", Pattern.LITERAL);
        /*
         * The meta character | is in this case treated as literal and what we must
         * match is a|b and not a or b
         */

        arraySplitString = aPattern.split("eaf");
        /*
         * Split the String eaf based on the pattern
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output : ["eaf"] , since now we must match a|b in order to split , so no match is found
         */

        arraySplitString = aPattern.split("ea|bf");
        out.println(Arrays.toString(arraySplitString));
        /*
         * output : ["e","f"] , one match is found which is a|b split around this match
         */

    }

    public static void main(String[] args) {
        patternFlagsLITERAL();
    }
}

Splitting a string using a Pattern instance

String[]	split(CharSequence input)
String[]	split(CharSequence input, int limit)

a compiled regex , will give us a Pattern instance which we can use to split a Charsequence , which is just an interface implemented by String , StringBuffer and StringBuilder .

The first method will only take a CharSequence , whereas the second one will also take a limit which is the number of strings in the array that will be returned .

import java.util.regex.Pattern;
import java.util.Arrays;
import static java.lang.System.out;

class PatternSplit {

    public static void patternSplit() {
        Pattern aPattern;
        String[] arraySplitString;

        aPattern = Pattern.compile(";");
        /*
         * create a regex using the character ;
         */

        arraySplitString = aPattern.split("");
        /*
         * Split the empty string using the pattern instance
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output [] the empty array , no matches were found
         */

        arraySplitString = aPattern.split(";");
        /*
         * Split the string containing only the regex
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output [] the empty array , we don't have anything to split
         */

        arraySplitString = aPattern.split("b;");
        /*
         * Split the string containing b;
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["b"] , the regex was found , there is only one string before it
         */

        arraySplitString = aPattern.split(";b");
        /*
         * Split the string containing ;b
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["","b"] , the regex was found , there is one string after it , and
         * there is the empty string before it
         */

        arraySplitString = aPattern.split("a;b;c", 1);
        /*
         * Split the string containing a;b;c , max number of split is 1
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["a;b;c"] , since max number of split is 1, so the returned array
         * must contain only one element
         */

        arraySplitString = aPattern.split("a;b;c", 2);
        /*
         * Split the string containing a;b;c , max number of split is 2
         */

        out.println(Arrays.toString(arraySplitString));
        /*
         * output ["a" ,"b;c"] , since the max number of split is 2 , hence the split
         * array must contain at most two elements
         */

    }

    public static void main(String[] args) {
        patternSplit();
    }
}

Get the flags and regex from a Pattern instance

String pattern()
int	flags()
String	toString()

we can get the string used to create the Pattern instance by using the pattern method and the flags by using the flags method . The toString method will also return the regex used to create the Pattern instance .

import java.util.regex.Pattern;
import static java.lang.System.out;

class PatternGetFlagsRegex {

    public static void patternGetFlagsRegex() {
        // get the pattern flags and regex
        Pattern aPattern = Pattern.compile("a\\d", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
        String aPatternRegex = aPattern.pattern();
        out.println("The pattern regex is : " + aPatternRegex);
        // output : The pattern regex is : a\d
        int aPatternFlags = aPattern.flags();
        out.println("The pattern flags are : " + aPatternFlags); // CASE_INSENSITIVE is 2 , UNICODE_CASE is 64
        // output : The pattern flags are : 66

        out.println("The pattern to string is : " + aPattern.toString());
        // output : The pattern to string is : a\d

    }

    public static void main(String[] args) {
        patternGetFlagsRegex();
    }
}

Quote a regex literal

static String	quote(String s)

we can use the static Pattern method , to quote a regex literal .This will escape all the regex metacharacters like the quantifier, so they will be like regular characters . The quoting is done by adding \Q at the start of the regex string , and \E , at the end of the regex string .

import java.util.regex.Pattern;
import static java.lang.System.out;
import java.util.Arrays;

class PatternQuote {

    public static void patternQuote() {

        String quotedRegex = Pattern.quote("a+");
        // Quote the regex String using the Pattern.quote method

        out.println("The quoted regex is : " + quotedRegex);
        // output : The quoted regex is : \Qa+\E

        Pattern aPattern = Pattern.compile(quotedRegex);
        out.println(Arrays.toString(aPattern.split("bab")));
        // output : ["bab"] , the pattern is a+ , and not a 1 or more time

        out.println(Arrays.toString(aPattern.split("ba+b")));
        // output : ["b","b"] , the pattern is a+ is found 1 time

    }

    public static void main(String[] args) {
        patternQuote();
    }
}

Check if a CharSequence match a regex From the start till the end

static boolean	matches(String regex, CharSequence input)

we can use the static matches method to check if a CharSequence will match a regex from the beginning till its end .We pass to this static method , the regex that we want the CharSequence to match , and the CharSequence we want to match . A CharSequence is an interface and is implemented by the String , StringBuffer and StringBuilder classes .

import java.util.regex.Pattern;
import static java.lang.System.out;

class PatternMatches {

    public static void patternMatches() {
        out.println(Pattern.matches("a+", "This is a string to test"));
        // output : false
        // doesn't match the string from its start till its end
        out.println(Pattern.matches("a+", "aaa"));
        // output : True

    }

    public static void main(String[] args) {
        patternMatches();
    }
}