If you work with Python regexp functions regularly, you know that sometimes they require additional parameters. These additional parameters are called flags. In this topic, we are going to cover compilation flags that can help you with adjusting regular expression patterns.
Compilation flags
Regexes contain eight compilation flags. They can be passed as values of the optional flags argument to any function. These flags change the behavior of your patterns. Each flag has two names: a long one, re.IGNORECASE, for example, and a short one, re.I in this case. Below you can find a table with all flags that will be discussed further:
Flags | Description |
| Ignores a case. |
| Allows the |
| Allows the |
| Allows whitespaces and comments in pattern compilation. |
| Makes |
In version 3.11, a new flag re.NOFLAG was added, which means no other flags. re.NOFLAGcan be used as the default value for the function.
They can seem a bit overwhelming. Don't worry, we will take a look at each of them in more detail further on.
Ignoring a case
As you may know, regular expressions are case-sensitive. To make your regular expression treat uppercase and lowercase letters equally, you can use the re.IGNORECASE (re.I) flag:
lower = 'you shall not pass!'
upper = 'YOU SHALL NOT PASS!'
string = 'You Shall Not Pass!'
result_lower = re.match(lower, string, flags=re.IGNORECASE) # match
result_upper = re.match(upper, string, flags=re.IGNORECASE) # matchChanging metacharacter behavior
Re.DOTALL, or re.S for short, is another very useful flag. It matches the special . character with any character, including the newline \n character. If you remember, the dot meta metacharacter doesn't match a newline by default:
pattern = 'precious .'
result_1 = re.search(pattern, 'my precious \n') # no match
result_2 = re.search(pattern, 'my precious \n', flags=re.DOTALL) # match
By default, the ^ character matches only the start of a string while the $ character matches only the string end. Re.MULTILINE (re.M) flag matches them with the beginning and the end of each line in a string, respectively. Let's compare:
string = '''A million dollars isn’t cool.\nYou know what’s cool?\nA billion dollars.'''
result_1 = re.findall('^(A|You)', string) # ['A']
result_2 = re.findall('^(A|You)', string, flags=re.MULTILINE) # ['A', 'You', 'A']
result_3 = re.findall('(cool.)$', string) # []
result_4 = re.findall('(cool.)$', string, flags=re.MULTILINE) # ['cool.', 'cool?']
Making patterns more readable
Unlike other flags, Re.VERBOSE or re.X doesn't change the way your pattern behaves but makes the pattern compilation more comprehensible instead. Most of the regexp patterns are complex and barely understandable, and re.VERBOSE allows for two things during compilation:
Whitespaces can be used for alignment;
Hash signs
#can be used for adding comments.
Let's see how re.VERBOSE can improve the pattern that matches e-mail addresses:
pattern = re.compile(r"""
^([a-z0-9_\.-]+) # username
@ # @ sign
([0-9a-z\.-]+) # host name
\. # a dot .
([a-z]{2,6})$ # top level domain
""",flags=re.VERBOSE)
results = pattern.match('[email protected]') # match
Tip: If you use the re.VERBOSE flag to make a pattern match whitespaces and hash symbols, you need to escape them. For instance, \#.
Matching ASCII only
By default \w, \W, \b, \B, \d, \D, \s and \S match the entire UNICODE. To make a pattern match only ASCII characters, you can use re.ASCII or re.A:
result_1 = re.findall('\w', 'ä, Ä, ö. Ö, ü, Ü, ß.') # ['ä', 'Ä', 'ö', 'Ö', 'ü', 'Ü', 'ß']
result_2 = re.findall('\w', 'ä, Ä, ö. Ö, ü, Ü, ß.', flags=re.ASCII) # []
One last thing. If you are going to use several flags at the same time, use either the | operator or +:
string = "A million dollars isn’t cool.\nYou know what’s cool?\nA billion dollars."
result = re.findall('^(a|you)', string, flags=re.IGNORECASE|re.MULTILINE)
print(result)Conclusion
In this topic, we have covered the main compilation flags that can improve your regexes. Let's recap them:
Compilation flags are passed as values of the optional
flagsargument;To ignore a letter case, use
re.IGNORECASE;To make the
.metacharacter match newlines, usere.DOTALLTo make
^and$metacharacters match the beginning and end of each line, usere.MULTILINE;If you want to add comments and whitespaces inside the regexp pattern, use
re.VERBOSE, but don't forget to escape the hash symbol and whitespace if you want your pattern to match them.And finally, to make your pattern match ASCII characters exclusively, make use of
re.ASCII. Remember that by default, regexps match UNICODE characters.