15 minutes read

In previous topics, we have pointed out the basics of regular expressions in Python. However, the only regexp function that we have used so far is match(). In this topic, we are going to cover the main regexp functions that will improve string matching.

Main functions

Take a look at the table below to find the most common functions and their descriptions:

Methods

Description

re.match(pattern, string, flags=0)

Checks whether a pattern is present at the beginning of a string.

re.search(pattern, string, flags=0)

Checks whether a pattern is present anywhere in a string.

re.findall(pattern, string, flags=0)

Returns all matches in a list. If used with one capturing group, returns only this group matches. If used with more than one group, returns tuples of capturing groups.

re.finditer(pattern, string, flags=0)

Returns all matches as an iterator.

re.split(pattern, string, maxsplit=0, flags=0)

Splits a string based on a pattern. If used with groups, outputs a text matched by a pattern. A text matched by a pattern outside the group won't be in the output.

re.sub(pattern, repl, string, count=0, flags=0)

Searches for a pattern and replaces it with a specified piece of text.

re.compile(pattern, flags=0)

Compiles a pattern for reuse.

In the following sections, we will cover them one by one.

Matching the beginning of the string

The match() function takes a regular expression pattern and string as arguments and checks whether the beginning of the string matches the pattern. It returns a special Match object when a match is found and None, if otherwise. Let's recall how it works. Mind the following snippet:

import re

string = "roads? where we're going we don't need roads."

result_1 = re.match('roads\?', string)  # match
result_2 = re.match('roads\.', string)  # no match 

As you can see, result_2 contains no matches as the difference between the beginning of the string and the given pattern is the punctuation mark.

Matching any string part

The second function is search(). It is very similar to what we have seen before. It also takes a regular expression pattern as the first argument and a string. However, the difference is that search() checks for matches throughout the string. Similarly, the search() function returns a Match object if there's a match or None, if otherwise:

string = "roads? where we're going we don't need roads."

result_1 = re.search('roads\?', string)  # match 
result_2 = re.search('roads\.', string)  # match 
result_3 = re.search('Roads', string)  # no match 
result_4 = re.search('here', string)  # match

Both search() and match() return only the first pattern occurrence in the string. For example, if you want to find the roads pattern:

string = "roads? where we're going we don't need roads."

result = re.search('roads', string)
print(result) # <re.Match object; span=(0, 5), match='roads'>

Finding all matches

You may wonder what to do if you want to find all pattern occurrences in a string. In this case, the findall() function comes to the rescue. Like any other function discussed above, findall() also takes a pattern and string as arguments. There is one subtle difference. The function returns not a Match object but a list with strings that match the pattern. If there are no matches, it returns an empty list:

string = "A million dollars isn’t cool. You know what’s cool? A billion dollars."

result_1 = re.findall('[mb]illion', string)  # ['million', 'billion']
result_2 = re.findall('thousand', string)  # []

Note that the findall() function returns a list of tuples when a pattern contains one or more groups. Let’s have a look at the following example:

string = '3 apples, 2 bananas, 5 pears, 10 strawberries'

results = re.findall('(\d+) (\w+)', string)
print(results)
# [('3', 'apples'), ('2', 'bananas'), ('5', 'pears'), ('10', 'strawberries')]

This can be helpful as now you can loop over it to do the computation for each tuple. For instance, you can count the total number of all the fruit you have bought.

Be careful when you have a pattern with one capturing group. In this case, findall() will return strings that are matched by this group only. The strings matched by a pattern outside of that group will be omitted:

string = '3 apples, 2 bananas, 5 pears, 10 strawberries'

results = re.findall('(\d+) \w+', string)
print(results) # ['3', '2', '5', '10']

There is another function finditer() that behaves the same way as findall(). It finds all possible pattern matches in a string and returns an iterator of regexp match objects instead of a list.

Splitting

As you may have guessed, the split() function splits a string by occurrences of pattern and returns a list of strings. As usual, it takes a pattern and a string as two arguments. Note that if the beginning (the end) matches the pattern, then the first (the last) element will be an empty string:

string = '111412222234333345555544'

results_1 = re.split('4', string)
print(results_1)
# ['111', '1222223', '3333', '55555', '', '']
# note the empty strings at the end of the list

This function can take an additional maxsplit argument. It specifies the number of splits. By default, maxsplit is set at 0. It means that the string will be split by the maximum number of pattern matches. If, for instance, maxsplit is 3, then three splits will be done, and the rest of the string will be returned as the final element of the list:

string = '111412222234333345555544'

results_2 = re.split('4', string, maxsplit=3)
print(results_2)
# ['111', '1222223', '3333', '5555544']

In addition to what we have just said, when you use split(), the matching substrings are removed from the final list. If you want to store them, you can simply use capturing groups. Compare the results below:

string = "Roads? Where we're going we don't need roads."

result_1 = re.split('\W+', string) 
print(result_1)
# ['Roads', 'Where', 'we', 're', 'going', 'we', 'don', 't', 'need', 'roads', '']

result_2 = re.split('(\W+)', string) 
print(result_2)
# ['Roads', '? ', 'Where', ' ', 'we', "'", 're', ' ', 'going', ' ', 'we', ' ', ...]

Be careful! If you employ capturing groups, strings that are matched outside those groups won't be returned:

string = '3 apples, 2 bananas, 5 pears, 10 strawberries'
result_3 = re.split('\d (\w+)', string) 
# ['', 'apples', ', ', 'bananas', ', ', 'pears', ', ', 'strawberries', '']

Searching and replacing

The sub() function takes three arguments: a regular expression pattern, a replacement string, and an initial string. It replaces all pattern occurrences with the specified replacement. If no occurrences are found, it returns the unchanged string. The sub() function also takes an optional argument count, it is the maximum number of pattern matches for replacement. Let's look at examples:

string = 'blue jeans, white shirt, yellow socks'
pattern = '(blue|white|yellow)'
replacement = 'black'

result_1 = re.sub(pattern, replacement, string)  # 'black jeans, black shirt, black socks'
result_2 = re.sub(pattern, replacement, string, count=2)  # 'black jeans, black shirt, yellow socks'

Precompiling patterns

The last regexp function that we are going to talk about is compile(). It allows you to compile a pattern and reuse it later in the code. It takes a pattern (a string) as an argument and returns a special Pattern object that we can use later with other functions we've covered today. Let's see how to compile a pattern and reuse it:

string = "roads? where we're going we don't need roads."

# define a pattern in a string format
string_pattern = 'roads'

# pass the pattern to the re.compile() method
my_pattern = re.compile(string_pattern)

# use the returned Pattern object to match a pattern 
result_1 = my_pattern.match(string)  # <re.Match object; span=(0, 5), match='roads'>
result_2 = my_pattern.findall(string)  # ['roads', 'roads']
result_3 = my_pattern.split(string)  # ['', "? where we're going we don't need ", '.']
result_4 = my_pattern.sub('cars', string)  # 'cars? where we're going we don't need cars.'

Compiling and saving the resulting regular expression object is convenient if you plan to use it further. It saves time and improves your performance.

Summary

In this topic, we have discussed popular regular expression functions that we can use to match a pattern in a string. Here is the recap:

  • Use match() to find a pattern at the beginning of the string or search() to check whether a pattern is present anywhere in the string;

  • Use findall() or finditer() to find all pattern occurrences, the former returns a list while the latter returns an iterator;

  • Use split() to split a string and sub() to replace a matching string with another one;

  • If you are going to use the same pattern many times, you can precompile it with compile() to save memory and time.

93 learners liked this piece of theory. 3 didn't like it. What about you?
Report a typo