Regexp template syntax is pretty much the same across all major programming languages. The main features of the Python regexps as opposed to any other programming language are special functions and objects of the re module.
First of all, it may be a good idea to get to know what exactly you are going to get when your regexp template matches a string. In this topic, we'll take a look at the Match object and its attributes. Also, we'll check out the flags that we can apply while calling any matching function. Flags are a small part of the re module, but they can simplify the regexp processing a lot.
Match object
As you can remember from the introductory topic on Python regexps, when your regexp template isn't found in a string, the matching function returns None.
template = "match"
no_match_result = re.match(template, "no match")
print(no_match_result is None) # the output is True
However, when the match is successful and an example of your template is present in the string, match() returns a so-called Match object containing the data about the matching substring: the contents (a match attribute) and starting/ending indexes in the original string (a span attribute).
match_result = re.match(template, "match")
print(match_result) # the output is "<_sre.SRE_Match object; span=(0, 5), match='match'>"
When a Match object is converted to boolean, it is always True. This way, you can use a simple conditional statement if match: to check whether your template matches the string. It'll be False if the statement returned None. And if you have a match, it will be True.
if no_match_result:
print("That's a match!")
else:
print("No match")
# the output is "No match"
if match_result:
print("That's a match!")
else:
print("No match")
# the output is "That's a match!"
This check is usually required to avoid errors in your code. You can't perform the same operations when you have None as if you had a Match object. The errors can stem from this. What can we do then? Let's see...
group() method
What makes Match objects useful is that they contain important information on the results of the matching operation. By calling the group() method with no arguments, you can extract the substring that matches your template:
template = "b.d" # matches a string starting with "b", ending with "d" and any character in between
match_1 = re.match(template, "bad").group() # the result is a string "bad"
match_2 = re.match(template, "bed").group() # the result is a string "bed"
If you try to call group() from None, it will raise an AttributeError:
re.match(template, "dub").group() # AttributeError: 'NoneType' object has no attribute 'group'
This is the reason why you may want to add the conditional statement we've discussed above if you're not sure that your template will match the string.
Extracting match indexes
You may need to know where the matching substring starts, in other words, the starting index of the matching substring in the original string. It can be done with the start() method that returns a starting index (an integer):
start = re.match(template, "bad").start() # the result is integer 0
This method may be confusing when you use match() function, because it always searches for matches starting at the beginning of the string. However, with other matching functions that you will encounter a bit later, start() becomes more useful.
You can similarly extract the ending index of the string with the end() method:
template = "100%?" # matches strings "100" or "100%"
end_1 = re.match(template, "100").end() # the result is integer 3
end_2 = re.match(template, "100%").end() # the result is integer 4
As you can see, the end() method doesn't return the exact index of the last character of the matching substring, it adds 1 to it instead. This feature can facilitate slicing:
template = "100%?"
string = "100% reason to remember the name"
end = re.match(template, string).end()
print(string[end:])
# the output is " reason to remember the name"
To extract both indexes at the same time, use the span() method. Instead of a single integer, it returns a tuple with two integer elements, the starting and the ending index of the matching substring.
span = re.match(template, "100%").span() # the result is tuple (0, 4)
Let's check out function flags — a special kind of attribute to make the regexp search engine more powerful.
Function flags
It's important to keep in mind that regular expressions by default are case-sensitive, letters of the upper and the lower case are treated as different characters.
lower = r'where are the money, Lebowski\?'
upper = r'WHERE ARE THE MONEY, Lebowski\?'
# These are different templates
To make your regex case-insensitive, you can use a special flag when you call the match() function (or any other function for matching). It's called re.IGNORECASE. Pass it to the function as the value of the optional flags argument:
lower = r'where is the money, Lebowski\?'
upper = r'WHERE IS THE MONEY, Lebowski\?'
string = 'Where Is the money, lebowski?'
result_lower = re.match(lower, string, flags=re.IGNORECASE) # match
result_upper = re.match(upper, string, flags=re.IGNORECASE) # match
re.DOTALL is another very useful flag. It can be used in the same fashion; it matches a dot character with literally every character, including \n (as you probably remember, by default, a dot character doesn't match a newline character).
dot_template = 'new line .'
no_flag = re.match(dot_template, 'new line \n') # None
with_flag = re.match(dot_template, 'new line \n', flags=re.DOTALL) # match
To enable several flags at once, pass their sum as the flags value or use the | operator.
result = re.match('FLAG ME.', 'flag me\n', flags=re.IGNORECASE + re.DOTALL) # match
These flags can make your regexps concise and readable. There are many other flags in Python, you can read more about them if you're curious.
Summary
In this topic we have discussed the following points:
- when a regexp template is not found in a string,
Noneis returned by the matching function; - when a regexp template is found in a string, you get a
Matchobject as the result of the matching function call; - to extract a substring that matches your template, use the
group()method of theMatchobject; - to extract indexes of the matching substring in the original string, use
start(),end()orspan()methods; - flags can simplify your regexps:
re.IGNORECASEmakes your regexp case-insensitive,re.DOTALLmake the dot character match\ncharacter, as well as any other possible character.