Computer scienceProgramming languagesPythonPython librariesRegular expressions

Regexps in programs

6 minutes read

Regular expressions are very versatile. They can be used to automate many tedious text-related tasks, such as input text validation or data collection. In this topic, we will have a look at two examples of simple yet powerful programs that employ regular expressions.

Email validation program

Let's have a look at a basic program that checks whether the text contains email addresses and, if it does, returns them in sequential order:

import re

def find_emails(string):
    # Here we compile our simple pattern that will match email addresses
    pattern = re.compile(r'[\w\.-]+@[\w\.-]+')

    # Remember that re.findall() returns a list of all matched email strings
    emails = re.findall(pattern, string) 

    # To print the matched strings one by one
    for email in emails:
        print(email)

The program above carries out a rather simple check. It checks if the @ character is preceded and followed by alphanumeric characters, an underscore, and a dot. Mind that \w is equal to [A-Za-z0-9_].

Let's test our program:

# Suppose we have a text with various email addresses
string = '''cat [email protected], dog 456 
          [email protected] [email protected]'''
find_emails(string)
# [email protected]
# [email protected]
# [email protected]

The downside is that our program will also match strings like _@._. They obviously cannot be considered email addresses.

Email validation 2.0

If usernames and domain names are too short, it may lead to rather bad scenarios. We can set some restrictions when compiling our pattern to avoid this:

# Let's say we want the username to be at least 5 characters long 
# and the domain name of 2 to 4 characters 
pattern = re.compile(r'[\w\.-]{5,}@[\w-]+\.\w{2,4}')

Let's break it down piece by piece:

[\w\.-]{5,} matches alphanumeric characters, underscores, a dot, or a dash that appear at least 5 times;
@ matches the @ sign;
[\w-]+\. matches alphanumeric characters, underscores, or a dash followed by a dot;
\w{2,4} matches alphanumeric characters and underscores that appear 2-4 times.

Here's our final program:

def find_emails(string):
    # Here we compile our simple pattern that will match email addresses
    pattern = re.compile(r'[\w\.-]{5,}@[\w-]+\.[\w]{2,4}')

    # Remember that re.findall() returns a list of all matched email strings
    emails = re.findall(pattern, string) 

    # To print the matched strings one by one
    for email in emails:
        print(email)

Let's test it:

string = '''_@._ mary_liu@abc._ [email protected], dog 456 
            [email protected] [email protected] [email protected]'''
find_emails(string)
# [email protected]
# [email protected]
# [email protected]

As you can see, simple restrictions can make our pattern more powerful. You can also make your pattern match more specific strings by, for example, adding the boundary shorthands \b at the beginning and the end of the pattern, as well as lookaround assertions.

Keep in mind, the regex we have created is only for experimentation and learning purposes, so it cannot verify all potential email addresses.

Tokenization

When working with text, one of the first things you often need to do is split it into pieces so you can look at or process each word separately. This process is sometimes called tokenization in technical fields, but really, it just means breaking a sentence into smaller parts.

The most straightforward approach to tokenization is to split a text by whitespaces. Let's see how it works:

import re

def tokenize(string):
    tokens = re.split(r'\s+', string)
    return tokens

string = "This is a sample string. (And here's another one!!)"
tokenize(string)
# ['This', 'is', 'a', 'sample', 'string.', '(And', "here's", 'another', 'one!!)']

After giving it a thorough look, you can spot the elephant in the room — punctuation marks. Let's get rid of them before we split our sentence:

import re

def tokenize(string):
    # Let's create a pattern that contains punctuation marks
    punctuation = re.compile(r'[\.,\?!\*:;()]')

    # Substitute the punctuations with empty strings
    no_punct = re.sub(punctuation, '', string)
    print(no_punct)
    # This is a sample string And here's another one

    # Split sentences by whitespaces
    tokens = re.split(r'\s+', no_punct)
    return tokens

tokenize(string)
# ['This', 'is', 'a', 'sample', 'string', 'And', "here's", 'another', 'one']

We have not omitted the apostrophe ' in the punctuation mark list. This is quite important as we do not want to split words like Let's, here's, or Mary's into two different tokens and change their meaning.

As you can see, tokenization can be a bit tricky, but regex can help you with it. Of course, there are a lot of ways to tokenize a text depending on the text type you are dealing with. We have presented you with one of the simplest ways to do it.

Conclusion

In this topic, we've seen examples of how regular expressions can help us with string processing. You can experiment further and implement other regular expressions to check whether a string is a valid email address or tokenize more complicated sentences.

To sum up, even simple regular expression patterns can be used to write powerful programs for real-life applications, including textual data extraction and substitution, web scraping, and many other natural language processing tasks.

73 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo

Regexps in programs

Email validation program

Email validation 2.0

Tokenization

Conclusion

Related topics