In the previous topics, we have learned about the dot and the question mark in regexp language, the ways of escaping them, and other regexp metacharacters. Now it is time to learn about these metacharacters and how they function.
First of all, let's start with the sets.
Sets
While the dot allows us to match almost every possible character, the sets provide us with the opportunity to be more specific in our regexp templates and narrow down the scope of our search. Each set in the regular expression takes the place of exactly one character in the string, but it defines a whole number of characters that can match it. These characters are listed inside the square brackets, []:
template = '[bd]a[td]'
In the template above, we have two defined sets. The first one corresponds either to a character b or d in the string, the second one — to t or d. Here are the results for some of the possible strings:
re.match(template, 'bat') # match
re.match(template, 'dad') # match
re.match(template, 'cat') # no match: 'c' is not in the first set
re.match(template, 'dot') # no match: 'o' instead of 'a'
An empty set causes an error:
re.match('c[]at', 'cat') # sre_constants.error: unexpected end of regular expression
An unescaped left square bracket, for which no unescaped right square bracket was found, causes the same error:
re.match('[', '[') # sre_constants.error: unexpected end of regular expression
By the way, good news, everyone! There is (almost) no need for boring escaping stuff when we use sets in regexp.
Escaping in sets
Sets in regular expressions have a sort of superpower: they automatically "neutralize" the metacharacters listed inside the square brackets, turning them into regular characters. This way, the dot and the question mark, for example, do not have to be escaped if they are part of a regexp set:
template = 'Hodor[?.]'
re.match(template, 'Hodor?') # match
re.match(template, 'Hodor.') # match
re.match(template, 'Hodor!') # no match
The only metacharacters that do not fall under this rule and keep their special status are, predictably, the right square bracket ] and the backslash \. The right square bracket should be escaped to show that it is a part of the set, not the metacharacter denoting its end:
template = r'=[\]]'
re.match(template, '=]') # match
template = r'=[)]]'
re.match(template, '=]') # no match
re.match(template, '=)]') # match (the only string this template can match)
The backslash in sets, like everywhere else, serves as the starting symbol of escape sequences. So, if you just want to have a backslash in your set, you have to relieve it from this burden by escaping it using double backslash. Here the backslash is escaped and matches itself in the string:
template = r'¯[\\]_'
re.match(template, r'¯\_(ツ)_/¯') # match
# remember that re.match checks whether regexp matches the beginning of the string, not the whole string
Here the backslash is not escaped and serves as a part of the escape sequence:
template = r'¯[\t]_'
re.match(template, '¯\_(ツ)_/¯') # no match
re.match(template, '¯\t_') # match
By the way, you can still escape any character in the set (even if it is not ] or \): this won't change the set of matching characters. But additional escaping characters will make your regular expression more difficult to read and understand, and, believe us, this is a thing to avoid — real-life regexps are usually barely comprehensible even without unnecessary characters.
Apart from "ordinary" regexp metacharacters, there are, though, some characters that acquire a special meaning specifically when they're used inside the square brackets.
Ranges
One of the main things about sets is that you may not only list the characters individually but also use ranges of characters. A range is designated by a dash -. For example, if you want your set to match every letter from a to z, you do not have to list out the whole alphabet, you can simply write [a-z].
re.match('ja[a-z].', 'jazz') # match
re.match('[A-Z]ill', 'kill') # no match: [A-Z] matches only uppercase letters
re.match('[A-Z]ill', 'Bill') # match
[0-9] does it for the digits. Note that regular expressions match characters, not numbers. So, the template [1-100] matches only 1 and 0, not all numbers in the range from 1 to 100.
re.match('[0-9]', '7') # match
re.match('[0-9]', '07') # match
re.match('[1-9]', '07') # no match
Several ranges can be easily put in one set. They do not have to follow each other in any way:
re.match('love [a-zA-Z]', 'love U') # match: [a-zA-Z] matches both uppercase and lowercase
re.match('love [a-z!A-Z]', 'love !') # match: [a-z!A-Z] matches letters and !
The characters that fall within the range are determined by ASCII / Unicode encoding table, so be careful when defining ranges: they may include something unexpected or exclude something that was meant to be in your set.
re.match('[A-Z]bermensch', 'Übermensch') # no match: Ü is not within A-Z range
re.match('[À-Ý]bermensch', 'Übermensch') # match
re.match('[À-Ý]bermensch', '×bermensch') # match: × is within À-Ý range
To use the dash - as a regular character in a set, you should "strip" it of the left or right character defining the range, so just put the dash in the first or last position in the set, [-abc] or [abc-]:
re.match('[-1-9]1', '-1') # match
re.match('[1-9-]1', '-1') # match
Take a look at the table summarizing some of the ranges you might want to use in your programs:
[a-z] | Lowercase Latin Letters |
[A-Z] | Capital Latin Letters |
[a-zA-Z] | Both Lowercase and Capital Latin Letters |
[0-9] | Digits |
Sets can also be handy in case you want to ban characters from your template. Let's see how this is done!
Exclusion of characters
The hat (aka the caret) ^ symbol is also a specific set metacharacter: whenever it is placed as the first character in the set, it makes the set specify the characters you do not want to see in the string. Any character that is not a part of such set will match it:
re.match('[^A-Z]ond', 'Bond') # no match
re.match('Bon[^A-Z]', 'Bond') # match
The hat placed anywhere else in the set, except for the first position, will lose its special meaning and become a regular character:
re.match('[A-Z^]ames', 'James') # match
re.match('[A-Z^]ames', '^ames') # match
That is pretty much it about sets. Four metacharacters associated with them are easy to remember, but can turn you into a real regexp wizard!
Just in case you forgot: there are a lot of websites where you can test your regular expression, see what could be wrong with it, and correct it. For example, Regex 101 works just fine.
Summary
Let's see what we have learned in this topic!
the square brackets
[]are used to designate sets in regular expressions;the sets are used to specify the number of characters to match;
only the backslash
\and the closing bracket]should be escaped in your sets;the dash
-allows us to easily put a range of characters in the set;a set with the hat
^in the first position matches every character that is not listed in it.