9 minutes read

You have already learned a lot about the concept of regular expressions. In this topic, we will analyze the classic examples of the practical use of regexps.

Groups in regexps

First of all, let's introduce an important concept that we will need to compose complex regexps: groups. Groups have the same meaning as in mathematical expressions: with their help, we can set new priorities for operations.

A part of a regular expression can be enclosed in parentheses to make a group. We can also apply quantifiers to each group: if you set a quantifier after the brackets, then it will apply to the entire contents of the bracket, and not to a single character.

In the code below, we find a pair of characters "ho", which occurs one or more times in a text:

Java
Pattern regexWithGroups = Pattern.compile("(ho)+");
String text = "ho hoho hohoho";
        
Matcher resultWithGroups = regexWithGroups.matcher(text);
while (resultWithGroups.find()) {
    System.out.println(resultWithGroups.group());
}
Kotlin
val regexWithGroups = """(ho)+""".toRegex()
val text = "ho hoho hohoho"

val resultWithGroups = regexWithGroups.findAll(text)
for (res in resultWithGroups) println(res.value)

The result will be the following:

ho
hoho
hohoho

Dates

Let's start with a fairly simple and common task. Suppose you need to find all dates in two different formats: yyyy-mm-dd and yyyy/mm/dd. How can we do it? We need to match the text sections that look like this: 4 digits, then one of the possible separators (/ or -), then 2 digits, and then the same separator and 2 digits. Consider the following regular expression as a solution:

\d{4}(-|\/)\d{2}\1\d{2}
  • We search for 4 digits and then for one of the possible separators: \d{4}(-|\/).

  • Next, we look for two digits and the same delimiter that has been already found: \d{2}\1. With \1 we refer to the first group we encountered in the regex: (-|\/). This is how we look for the already identified separator.

  • Finally, we search for two digits: \d{2}.

Shall we check it? Take a look at the code below:

Java
Pattern regex = Pattern.compile("\\d{4}(-|/)\\d{2}\\1\\d{2}");
String text = "Date 1: 2022-06-06" + 
              "Date 2: 2021/01/01; date 3: 2020-02-02";

Matcher dates = regex.matcher(text);
while (dates.find()) {
    System.out.println(dates.group());
}
Kotlin
val regex = Regex("""\d{4}(-|\/)\d{2}\1\d{2}""")
val dates = regex.findAll("Date 1: 2022-06-06" +
        "Date 2: 2021/01/01; date 3: 2020-02-02")

for (date in dates) println(date.value)

The result is:

2022-06-06
2021/01/01
2020-02-02

Everything works! However, keep in mind that this regex is not always suitable in a real-life situation. It is too simple and covers only a few cases.

Phone numbers

In previous topics, we briefly looked at some regular expressions for phone numbers. In the following example, we will compile a slightly different regex. For simplicity, let's assume that phone numbers can be written in one of the following formats: XXX-XXX-XXXX, (XXX)-XXX-XXXX, (XXX)XXXXXXX, and XXXXXXXXXX.

Look at the following regular expression:

\(?[\d]{3}\)?-?[\d]{3}-?[\d]{4}

What's going on here?

  • Section 1 \(?[\d]{3}\)?-? matches the first three digits, possible brackets, and a delimiter.

  • Section 2 [\d]{3}-? looks for the next three digits and a possible delimiter.

  • Finally, in section 3 [\d]{4} we match the last four digits.

Let's check how it works:

Java
Pattern regex = Pattern.compile("\\(?\\d{3}\\)?-?\\d{3}-?\\d{4}");
String text = "Ann's phone: 123-345-6789 " +
              "Dave's phone: (111)-234-5678, and next phone is (101)-234-5000";

Matcher phones = regex.matcher(text);
while (phones.find()) {
    System.out.println(phones.group());
}
Kotlin
val regex = Regex("""\(?[\d]{3}\)?-?[\d]{3}-?[\d]{4}""")
val phones = regex.findAll("Ann's phone: 123-345-6789 " +
        "Dave's phone: (111)-234-5678, and next phone is (101)-234-5000")

for (phone in phones) println(phone.value)

This code will print the following:

123-345-6789
(111)-234-5678
(101)-234-5000

Great! The following examples will be more complex and universal.

Email

Let's imagine that you need to find all the email addresses that appear in a text. The rules for composing emails are regulated by RFC 5322. A standard email address looks like "[email protected]". The following regular expression will match most of the email addresses compiled according to these rules:

([a-z0-9_\.-]+)@([a-z0-9_\.-]+)\.([a-z\.]{2,6})

It looks rather long and somewhat incomprehensible, doesn't it? Let's take a closer look at what it consists of. The parts of the regular expression work like this:

  • In group 1 ([a-z0-9_\.-]+) we match one or more lowercase letters between a–z, numbers between 0–9, underscores, periods, and hyphens. Then there goes the @ sign in the email address, which ends this group.

  • Group 2 ([a-z0-9_\.-]+) is very similar to the previous one: the subdomain name may consist of the same kinds of symbols. It is followed by a period \..

  • Group 3 ([a-z\.]{2,6}) matches the top level domain: any group of letters or dots of length from 2 to 6 characters.

Let's check how it all works:

Java
String patternStr = "([a-z0-9_\\.-]+)@([a-z0-9_\\.-]+)\\.([a-z\\.]{2,6})";
Pattern regex = Pattern.compile(patternStr);
String text = "We have the following emails: [email protected], [email protected]";

Matcher matchResult = regex.matcher(text);
while (matchResult.find()) {
    System.out.println(matchResult.group());
}
Kotlin
val regex = Regex("""([a-z0-9_\.-]+)@([a-z0-9_\.-]+)\.([a-z\.]{2,6})""")
val matchResult = regex.findAll("We have the following emails: " +
        "[email protected], [email protected]")

for (matches in matchResult) println(matches.value)

The result of executing the above piece of code is as follows:

[email protected]
[email protected]

URL

Searching and copying links from a text manually can be tedious. But we won't have to do that!

A typical URL may look like https://www.somesite.com/index.html. Below is a regular expression that matches the pattern:

(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?

Let 's analyze it in more detail, group by group:

  • Group 1 (https?:\/\/)? corresponds to the first optional part of a URL. It includes the letters "http", possibly "s", a colon, and two slashes.

  • Groups 2 and 3 ([\da-z\.-]+)\.([a-z\.]{2,6}) match a sequence of letters, numbers, hyphens, underscores, and dots (domains and the zero-level domain – from 2 to 6 characters and dots).

  • And group 4 ([\/\w\.-]*)* is needed to identify the file: a set of words composed of letters, numbers, hyphens, underscores, and dots with a slash at the end. Finally, it may be followed by a slash.

The example below shows a possible use of this regular expression:

Let's check how it all works:

Java
String patternStr = "(https?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w\\.-]*)*\\/?";
Pattern regex = Pattern.compile(patternStr);
String text = "Jet Brains Website: https://www.jetbrains.com/ " +
              "And here is information about Hyperskill: https://hi.hyperskill.org/how-we-teach";
        
Matcher matchResult = regex.matcher(text);
while (matchResult.find()) {
    System.out.println(matchResult.group());
}
Kotlin
val regex = Regex("""(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?""")
val matchResult = regex.findAll("Jet Brains Website: https://www.jetbrains.com/ " +
        "And here is information about Hyperskill: https://hi.hyperskill.org/how-we-teach")

for (matches in matchResult) println(matches.value)

The result of executing the above piece of code is as follows:

https://www.jetbrains.com/
https://hi.hyperskill.org/how-we-teach

Conclusion

In this topic, we have considered some practical examples of using regexes of varying complexity. Step by step, we have discussed relatively simple regexes for dates and phone numbers, as well as more complex and universal ones for URLs and email addresses.

The acquired skills will help you both in the following tasks and in your future projects.

94 learners liked this piece of theory. 3 didn't like it. What about you?
Report a typo