Computer scienceProgramming languagesPythonPython librariesRegular expressions

Regexp flags in Python

8 minutes read

If you work with Python regexp functions regularly, you know that sometimes they require additional parameters. These additional parameters are called flags. In this topic, we are going to cover compilation flags that can help you with adjusting regular expression patterns.

Compilation flags

Regexes contain eight compilation flags. They can be passed as values of the optional flags argument to any function. These flags change the behavior of your patterns. Each flag has two names: a long one, re.IGNORECASE, for example, and a short one, re.I in this case. Below you can find a table with all flags that will be discussed further:

Flags	Description
`re.IGNORECASE` or `re.I`	Ignores a case.
`re.DOTALL` or `re.S`	Allows the `.` metacharacter to match a newline.
`re.MULTILINE` or `re.M`	Allows the `^` and `$` metacharacters to match each line.
`re.VERBOSE` or `re.X`	Allows whitespaces and comments in pattern compilation.
`re.ASCII` or `re.A`	Makes `\w`, `\W`, `\b`, `\B`, `\s`, `\S` match only ASCII characters.

In version 3.11, a new flag re.NOFLAG was added, which means no other flags. re.NOFLAGcan be used as the default value for the function.

They can seem a bit overwhelming. Don't worry, we will take a look at each of them in more detail further on.

Ignoring a case

As you may know, regular expressions are case-sensitive. To make your regular expression treat uppercase and lowercase letters equally, you can use the re.IGNORECASE (re.I) flag:

lower = 'you shall not pass!'
upper = 'YOU SHALL NOT PASS!'
string = 'You Shall Not Pass!'
result_lower = re.match(lower, string, flags=re.IGNORECASE)  # match
result_upper = re.match(upper, string, flags=re.IGNORECASE)  # match

Changing metacharacter behavior

Re.DOTALL, or re.S for short, is another very useful flag. It matches the special . character with any character, including the newline \n character. If you remember, the dot meta metacharacter doesn't match a newline by default:

pattern = 'precious .'
result_1 = re.search(pattern, 'my precious \n')  # no match
result_2 = re.search(pattern, 'my precious \n', flags=re.DOTALL)  # match

By default, the ^ character matches only the start of a string while the $ character matches only the string end. Re.MULTILINE (re.M) flag matches them with the beginning and the end of each line in a string, respectively. Let's compare:

string = '''A million dollars isn’t cool.\nYou know what’s cool?\nA billion dollars.'''

result_1 = re.findall('^(A|You)', string)  # ['A']
result_2 = re.findall('^(A|You)', string, flags=re.MULTILINE)  # ['A', 'You', 'A']

result_3 = re.findall('(cool.)$', string)  # []
result_4 = re.findall('(cool.)$', string, flags=re.MULTILINE)  # ['cool.', 'cool?']

Making patterns more readable

Unlike other flags, Re.VERBOSE or re.X doesn't change the way your pattern behaves but makes the pattern compilation more comprehensible instead. Most of the regexp patterns are complex and barely understandable, and re.VERBOSE allows for two things during compilation:

Whitespaces can be used for alignment;
Hash signs # can be used for adding comments.

Let's see how re.VERBOSE can improve the pattern that matches e-mail addresses:

pattern = re.compile(r"""
                      ^([a-z0-9_\.-]+)               # username
                       @                             # @ sign
                      ([0-9a-z\.-]+)                 # host name
                       \.                            # a dot .
                      ([a-z]{2,6})$                  # top level domain     
                      """,flags=re.VERBOSE)

results = pattern.match('[email protected]')  # match

Tip: If you use the re.VERBOSE flag to make a pattern match whitespaces and hash symbols, you need to escape them. For instance, \#.

Matching ASCII only

By default \w, \W, \b, \B, \d, \D, \s and \S match the entire UNICODE. To make a pattern match only ASCII characters, you can use re.ASCII or re.A:

result_1 = re.findall('\w', 'ä, Ä, ö. Ö, ü, Ü, ß.')  # ['ä', 'Ä', 'ö', 'Ö', 'ü', 'Ü', 'ß']
result_2 = re.findall('\w', 'ä, Ä, ö. Ö, ü, Ü, ß.', flags=re.ASCII)  # []

One last thing. If you are going to use several flags at the same time, use either the | operator or +:

string = "A million dollars isn’t cool.\nYou know what’s cool?\nA billion dollars."

result = re.findall('^(a|you)', string, flags=re.IGNORECASE|re.MULTILINE) 
print(result)

Conclusion

In this topic, we have covered the main compilation flags that can improve your regexes. Let's recap them:

Compilation flags are passed as values of the optional flags argument;
To ignore a letter case, use re.IGNORECASE;
To make the . metacharacter match newlines, use re.DOTALL
To make ^ and $ metacharacters match the beginning and end of each line, use re.MULTILINE;
If you want to add comments and whitespaces inside the regexp pattern, use re.VERBOSE, but don't forget to escape the hash symbol and whitespace if you want your pattern to match them.
And finally, to make your pattern match ASCII characters exclusively, make use of re.ASCII. Remember that by default, regexps match UNICODE characters.

82 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo