Java Regex
Regular Expressions (RegEx) in Java
Java regular expression is a sequence of characters that specifies a set of strings and that is used to search, edit, and manipulate text. Like most programming languages, Java supports regular expressions. We've already learned some basics of the regex language. In this lesson, we'll explore how it is realized in Java.
Simple matching
First of all, we can create a regular expression by means of a String
. Take a look at the following example:
String aleRegex = "ale"; // the "ale" regex
In Java, String
data type has built-in support for regular expressions. Strings have a special method called matches
that takes a regular expression pattern as its argument and checks whether the string matches this pattern. Keep in mind that the method returns true
only when the entire string matches the regexp, otherwise, it returns false
. The pattern defined by the regex is applied to the text from left to right.
In the example below, we try to match aleRegex
and different strings:
"ale".matches(aleRegex); // true
"pale".matches(aleRegex); // false, "pale" string has an additional character
"ALE".matches(aleRegex); // false, uppercase letters don't match lowercase and vice versa
You can see that the string "pale"
is not matched by our regex pattern. The reason is that Java regex implementation checks whether the whole string can be fit into the regex pattern, not just some substring. In this regard, Java differs from many other programming languages.
Here is another example. The helloRegex
pattern has two words separated by a comma and a whitespace character:
String helloRegex = "Hello, World";
"Hello, World".matches(helloRegex); // true
"Hello, world".matches(helloRegex); // false
"Hello,World".matches(helloRegex); // false
As is evident from the previous examples, when our regex is just a sequence of simple characters, it can be matched only with the exact same string. There are easier ways to compare strings, though, and we wouldn't really need regular expressions if that was all they could do.
As you remember, the real power of regular expressions lies in special characters helping you define a pattern that matches several different strings at once.
The dot character and the question mark
The dot .
matches any single character including letters, digits, spaces, and so on. The only character it is unable to match with is the newline character \n
. The examples below should look familiar to you:
String learnRegex = "Learn.Regex";
"Learn Regex".matches(learnRegex); // true
"Learn.Regex".matches(learnRegex); // true
"LearnRegex".matches(learnRegex); // false
"Learn, Regex".matches(learnRegex); // false
"Learn\nRegex".matches(learnRegex); // false
As you remember, the question mark ?
is a special character that means “the preceding character or nothing". Words with slightly different spelling in American and British English serve as a traditional example of this character's application:
String pattern = "behaviou?r";
"behaviour".matches(pattern); // true
"behavior".matches(pattern); // true
Now let's combine the dot character .
and the question mark ?
in one regex pattern. This combination (.?
) basically means "there can be any single character or no character at all". In the example below, the regex matches any two or one character:
String pattern = "..?";
"I".matches(pattern); // true
"am".matches(pattern); // true
"".matches(pattern); // false
The dot and question mark character seem to be very useful but what if you want to check for those exact symbols in your text? Let's see what we should do about that.
The tricky escape character
Right now you're probably wondering what we should do if we want to use the dot .
or the ?
as a regular punctuation mark and not as a special symbol within the regex pattern?
Well, in this case, we should protect our special symbol by putting the backslash \
before it. The backslash is called an escape character because it helps symbols to "escape" their working duties. Note that when you want to use the backslash \
itself in its literal meaning, you need to escape it as well! This way, a double backslash \\
in your regex means a single backslash in the matching string.
However, it gets more complicated when you implement such patterns in your Java program. The backslash \
works as an escape character not only for regular expressions but for String
literals as well. So, in fact, we have to use an additional backslash to escape the one we need in the regular expression, just like this:
String endRegex = "The End\\.";
"The End.".matches(endRegex); // true
"The End?".matches(endRegex); // false
For instance, the regular expression for any five-character sequence that ends with a dot looks like this:
String pattern = ".....\\.";
"a1b2c.".matches(pattern); // true
"Wrong.".matches(pattern); // true
"Hello!".matches(pattern); // false
As you can see, regular expressions are a powerful tool for processing strings in Java. They allow us to define common patterns by using regular characters and characters with special meaning, and then check whether strings match these patterns. The key points are:
matches
string method is used to compare a regex pattern with a string;- it returns
true
only when the whole string matches the regexp; - when using regexps in Java, we should add extra backslash for escaping symbols, which results in using a double backslash instead of just one.
Sets, ranges, alternations
The dot character allows us to write common patterns for matching. The dot, though, matches almost every possible character, and sometimes we want to be more specific in our regex. So our next step is to call the sets to come to our rescue.
The set of characters
Each set corresponds to a single character in the string, but what character exactly it can be is defined by the content of the set. The set is written in square brackets, []
. For example, the set "[abc]"
means that a single character "a"
, "b"
, or "c"
can match it. Take a look at the example below:
String pattern = "[bcr]at"; // it matches strings "bat", "cat", "rat", but not "fat"
"rat".matches(pattern); // true
"fat".matches(pattern); // false
You can use as many sets as you want and combine them with usual characters in a line. There are two sets in the following example:
String pattern = "[ab]x[12]"; // can match a or b followed by x followed by either 1 or 2
This pattern can be successfully matched by the following strings:
"ax1", "ax2", "bx1", "bx2"
But the following strings do not match this pattern:
"xa1", "aax1", "bx"
As you can see, the order of sets in regular expressions is important. On the other hand, the order you put the characters within the set does not matter.
The range of characters
Sometimes we want to make our character sets quite large. In this case, we don't have to write them all down. Instead we can specify a range designated by the dash symbol -
. The character that precedes the dash denotes the starting point of the range, the character that follows it is the last character that falls into the range. If the needed characters immediately follow each other in the ASCII encoding or in Unicode characters, we can put them into a set as a range of characters. This includes alphabetically ordered letters and numeric values. For example, we can write a set that matches every digit:
String anyDigitPattern = "[0-9]"; // matches any digit from 0 to 9
The same works for letter ranges such as "[a-z]"
or "[A-Z]"
. These ranges match all Latin lowercase and uppercase letters respectively. Since patterns are case sensitive, in case we want to match any letter ignoring the case, we can write the following regex:
String anyLetterPattern = "[a-zA-Z]"; // matches any letter "a", "b", ..., "A", "B", ...
Note, that although the range [A-z]
is technically valid, it includes additional symbols that are placed between uppercase and lowercase letters in the ASCII table.
As you can see, you can easily put several ranges in one set and mix them with separate characters in any order:
String anyLetterPattern = "[a-z!?.A-Z]"; // matches any letter and "!", "?", "."
Excluding characters
In some cases, it is simpler to define which characters are not wanted. Then, we can write a set that will match everything except for the characters mentioned in it. To do that, we write the hat character ^
as the first one in the set.
String regex = "[^abc]"; // matches everything except for "a", "b", "c"
"a".matches(regex); // false
"b".matches(regex); // false
"c".matches(regex); // false
"d".matches(regex); // true
The same works for ranges:
String regex = "[^1-6]";
"1".matches(regex); // false
"2".matches(regex); // false
"0".matches(regex); // true
"9".matches(regex); // true
Escaping characters in sets
The general rule is that you do not need to escape special characters within sets if they are used in their literal meaning. For example, the set [.?!]
will match a single period, a question mark, an exclamation mark, and nothing else. However, the characters that form a set or a range should be escaped or put in a neutral position in case we look for their literal symbols:
- to match the dash character itself, we should put it in the first or in the last position in the set:
"[-a-z]"
matches lowercase letters and the dash, and"[A-Z-]"
matches upper case letters and the dash; - hat
^
does not need to be escaped if placed anywhere but the beginning. This way, the set"[^a-z^]"
matches everything except for lowercase letters and the hat character; - square brackets should always be escaped:
String pattern = "[\\[\\]]"; // matches "[" and "]"
Alternations
Until this point, we were talking about single characters. There's another way to match one of the listed items, which includes longer sequences to choose from. The vertical bar |
is used to match character sequences either before or after the symbol.
String pattern = "yes|no|maybe"; // matches "yes", "no", or "maybe", but not "y" or "e"
"no".matches(pattern); // true
This is useful in situations when we want to look for particular words, like, "bear", "bat", or "bird" to complete the sentence: "The giant ___ scared me", and when it's easier to indicate whole words.
The vertical bar can be used together with parentheses, which designate the boundaries of alternating substrings: everything within the parentheses is an optional substring that can match the alternation block:
String pattern = "(b|r|go)at"; // matches "bat", "rat" or "goat"
String answer = "The answer is definitely (yes|no|maybe)";
In general, alternations serve a purpose similar to that of sets: they describe multiple alternatives that a particular part of the pattern can match. However, while sets can match only a single character in the string, alternations are used to define multi-character alternatives.
Here's a brief recap:
- Square brackets designate a set and are used to describe the set of characters that can match a pattern, like
[123]
. - Inside sets, ranges of characters marked by the dash can be used:
[1-3]
. - There are sets that ban specific characters. These are the sets with the hat character in the initial position:
[^123]
. - The vertical bar can be used to define multi-character alternating substring:
1|2|3
.
Shorthands
There are some sets that are used more often than the other ones: sets for digits, or alphanumeric characters, or whitespace characters (there are quite a lot of whitespace characters, we must say). To make the usage of such sets easier and quicker, there are special regex characters, which are equivalent to certain sets, but have shorter "names".
The list of shorthands
There are several pre-defined shorthands for the commonly used character sets:
\d
is any digit, short for[0-9]
;\s
is a whitespace character (including tab and newline), short for[ \t\n\x0B\f\r]
;\w
is an alphanumeric character (word), short for[a-zA-Z_0-9]
;\b
is a word boundary. This one is a bit trickier as it doesn't match any specific character, but rather matches a boundary between an alphanumeric character and a non-alphanumeric character (for example, a whitespace character) or a boundary of the string (the end or the start of it). This way,"\ba"
matches any sequence of alphanumeric characters that starts with "a","a\b"
matches any sequence of alphanumeric characters that ends with "a", and"\ba\b"
matches a separate "a" preceded and followed by non-alphanumeric characters.
There are also counterparts of these shorthands that are equivalent to the restrictive sets, and match everything except for the characters mentioned above:
\D
is a non-digit, short for[^0-9]
;\S
is a non-whitespace character, short for[^ \t\n\x0B\f\r]
;\W
is a non-word character or non-alphanumeric character, short for[^a-zA-Z_0-9]
.\B
is a non-word boundary. It matches the situation opposite to that of the\b
shorthand: it finds its match every time whenever there is no "gap" between alphanumeric characters. For example,"a\B"
matches a part of a string that starts with "a" followed by any alphanumeric character which, in its turn, is not followed by a word or a string boundary.
These shorthands make writing common patterns easier.
Each shorthand has the same first letter as its representation (digit, space, word, boundary). The uppercase characters are used to designate the restrictive shorthands.
Example
Let's consider an example with the listed shorthands. Remember, that in Java we use an additional backslash \
for escaping.
String regex = "\\s\\w\\d\\s";
" A5 ".matches(regex); // true
" 33 ".matches(regex); // true
"\tA4\t".matches(regex); // true, because tabs are whitespace as well
"q18q".matches(regex); // false, 'q' is not a space
" AB ".matches(regex); // false, 'B' is not a digit
" -1 ".matches(regex); // false, '-' is not an alphanumeric character, but '1' is OK.
Here's how boundary shorthand will look in Java code:
String startRegex = "\\bcat"; // matches the part of the word that starts with "cat"
String endRegex = "cat\\b"; // matches the part of the word that ends with "cat"
String wholeRegex = "\\bcat\\b"; // matches the whole word "cat"
So far, we are not applying them in practice, because we only deal with matches
method that requires a full string to match the regexp.
As an alternative, if you do not want to use shorthands for the previous example, you can write regex shown below to get the same outcome:
String regex = "[ \\t\\n\\x0B\\f\\r][a-zA-Z_0-9][0-9][ \\t\\n\\x0B\\f\\r]";
This regex, however, is long and not nearly as readable as the previous one. It also has a lot of character repetitions. You can use the predefined shorthands instead of commonly used sets and ranges to simplify your regexes and make them more readable.