16 minutes read

In the previous topic, we've taken a look at quantifiers and their role in regular expressions. So far we've been using quantifiers applied to one character only as we search for a specific character repeated over a certain number of times. But what if we want to look for a repeated substring? What if we want to specify the number of times that a string of different characters should occur in the string? In this case, we should resort to parenthesis characters (). The parenthesis in regular expressions can group desired parts of the template into single units and process them together. Let's discuss the details of their application!

In this topic, we'll come across the OR operator (an alternation) of regular expressions represented by the vertical bar |. As a programmer, you can imagine the situations where this operator is indispensable. In this regard, regexps are no different.

Groups

By default, when we put a quantifier in our template, it's applied to the preceding character. Take h[ao]{2} for example. The quantifier demands either a or o to be repeated twice, but h can only occur once. To apply a quantifier to a sequence of characters, we must use parentheses () to group the desired symbols and put the quantifier after this group. Take a look at the following example. There, we are looking for the h[ao] substring:

template = r"(h[ao]){2}"  # matches a string consisting of two "ha" or "ho"
re.match(template, "haha")  # a match
re.match(template, "hoha")  # a match
re.match(template, "haa")  # no match
re.match(template, "hho")  # no match

You can apply any quantifier you want, but the syntax remains. For example, you can mark an optional substring with a question mark quantifier ?. It will make the group match one or no occurrences of the group in the string:

template = r"ha(\?!)?"  # we expect "?!" to occur together and in this exact order
re.match(template, "ha?!")  # a match
re.match(template, "ha")  # a match
# in case "?" or "!" occur separately, the group won't match them
re.match(template, "ha?")  # matches only "ha", but not "?", since there's no "!" succeeding "?"
re.match(template, "ha!")  # matches only "ha", but not "!", since there's no "?" preceding "!"

So, entire parts of the template can be omitted in the string.

Nested groups

We can also make use of nested groups — you can put a group inside a group to specify smaller substring repetitions inside larger substrings. Take a look at this template that matches any number of repeated strings containing two substrings of the <letter><digit> type (for example, A0, C3):

template = r"(([A-Z]\d){2}\.)+"
re.match(template, "A0C3.B8K5.")  # a match
re.match(template, "A0C3.")  # a match
re.match(template, "A0.C3B8K5")  # no match, as a dot separates two letter-digit combinations
re.match(template, "A0.C3.B8K5")  # no match, as "A0.C3." is separated by a dot
# and "B8K5" aren't followed by a dot

The depth of nested groups is technically unlimited. The only problem is that the bigger number of groups are somewhat hard to read. But don't worry: most of the templates in real life are barely readable anyway.

However, the quantifiers aren't the only reason why we need groups. Groups are also a tool that gives your template a structure when you need it.

Method groups()

After comparing a template against a string, we often need to process the result (extract one part of the matching string, rearrange it, replace some of its parts). We will discuss these options in other topics. But for now, we can say that groups can help with such processing. Groups can help you to designate important parts of your regexp.

If you want to make some groups in your template, you can get parts of the string that match each group with just one groups() method. This method applies to any matched object, that is, any result of the match() function when there's a match. It returns a tuple with substrings matching the created groups:

template = r"([Pp]ython) (\d)"
match = re.match(template, "Python 3")
print(match.groups())  # The output is ('Python', '3')

As you can see, the first tuple element, the Python part, is the match for the first group, and the second element 3 matches the second group. In this case, the number of elements is equal to the number of groups. If a group is optional, None will appear in the resulting tuple:

template = r"([Pp]ython)( \d)?"
match = re.match(template, "Python")
print(match.groups())  # The output is ('Python', None)

In case you need to extract the string matching as a single group, you can opt for a special method.

Method group()

To extract the match for a particular group, you can use the group() method. This method accepts an integer designating the number of the group that you want to extract:

template = r"Python (\d)"
match_1 = re.match(template, "Python 2")
print(match_1.group(1))  # The output is "2"
match_2 = re.match(template, "Python 3")
print(match_2.group(1))  # The output is "3"

The enumeration starts from 1. If you pass 0 or call it with no arguments, the method will return the entire string:

template = r"Python (\d)"
match_1 = re.match(template, "Python 2")
print(match_1.group())  # The output is "Python 2"
print(match_1.group(0))  # The output is "Python 2"

As you can see, you need to know the group number to extract it. For that, we need to discuss the concept of enumeration.

Group enumeration

The groups are enumerated in linear order, from left to right. To be precise, the group numbers coincide with the numbers of their opening parentheses in the template. The group with the first parenthesis gets the first number.

template = r"(a)(b)(c)"
match = re.match(template, "abc")
match.group(1)  # "a"
match.group(2)  # "b"
match.group(3)  # "c"

This is also true for nested groups. Take a look at the following example:

template = r"((\w+) group) ((\w+) group)"
match = re.match(template, "first group second group")
match.group(1)  # "first group"
match.group(2)  # "first"
match.group(3)  # "second group"
match.group(4)  # "second"

So, in case you have a complex regular expression, just count the opening parentheses of your groups (starting from 1) to get the desired number.

If you have a repeated group in your template, with another group inside it, you'll get only the last match from the matching substring if you try to retrieve the "inside" group with the group() method. For example, in the following piece of code, the group() method won't allow you to retrieve the substring 2 from the second group, only 3:

template = r"(Python (\d) ){2,}"
re.match(template, "Python 2 Python 3 ").group(2)  # The output is "3"

There's one more powerful regexp tool. Let's go!

Alternations

In many cases, a pattern we'd like to match can contain alternative substrings — sometimes one, sometimes another. For example, when we search for a web address, it can have .com, .org, .net, etc. as a part of the domain name. We can match several domain types in one template by using |.

| is the or operator in regexps. By separating alternative substrings with vertical bars, you are matching any of these substrings with the template. Here, take a look:

template = r"python|java|kotlin"
re.match(template, "python")  # a match
re.match(template, "java")  # a match
re.match(template, "kotlin")  # a match
re.match(template, "c++")  # no match
re.match(template, "k")  # no match
re.match(template, "jav")  # no match

In the above example, | separates three alternative options. Any string that doesn't coincide with any of the options is not going to match the template.

Groups and alternations

Also, notice that the vertical bar isn't similar to quantifiers in terms of the application scope — it's applied to the entire template until the next vertical bar occurs. For instance, if we need to find the following strings: python course, kotlin course, python lesson, or kotlin lesson, we can write the following expression first:

template = r"python|kotlin course|lesson"
re.match(template, "kotlin")  # no match: should be "kotlin course" to match
re.match(template, "python")  # a match, even though "python lesson" or "python course" were searched for
re.match(template, "lesson")  # a match, even though "kotlin lesson" or "python lesson" were searched for

To mark the borders of the OR operator, we need to use groups. Put the parentheses around the entire OR expression, as in (course|lesson):

template = r"(python|kotlin) (course|lesson)"
re.match(template, "kotlin")  # no match
re.match(template, "lesson")  # no match
re.match(template, "python lesson")  # match
re.match(template, "kotlin course")  # match

Don't forget about groups when you include alternative options to your template.

Summary

Hey, we've done a great job here! Pretty much the whole inventory of regular expression operations is at your disposal now. In this topic, we've learned that:

  • in regexps, parentheses () can be used to group characters into substrings;

  • quantifiers can be applied to groups;

  • groups are enumerated automatically. They are enumerated by their opening parenthesis, from left to right, starting from 1;

  • matches for groups can be retrieved with groups() and group() methods;

  • vertical bars | specify mutually exclusive substrings in a regular template;

  • groups can limit the scope of vertical bars.

There are still a couple of Python regexp methods left to learn. After that, you'll turn to a true regexp wizard!

116 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo