Regular expressions are very versatile. They can be used to automate many tedious text-related tasks, such as input text validation or data collection. In this topic, we will have a look at two examples of simple yet powerful programs that employ regular expressions.
Email validation program
Let's have a look at a basic program that checks whether the text contains email addresses and, if it does, returns them in sequential order:
import re
def find_emails(string):
# Here we compile our simple pattern that will match email addresses
pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
# Remember that re.findall() returns a list of all matched email strings
emails = re.findall(pattern, string)
# To print the matched strings one by one
for email in emails:
print(email)The program above carries out a rather simple check. It checks if the @ character is preceded and followed by alphanumeric characters, an underscore, and a dot. Mind that \w is equal to [A-Za-z0-9_].
Let's test our program:
# Suppose we have a text with various email addresses
string = '''cat [email protected], dog 456
[email protected] [email protected]'''
find_emails(string)
# [email protected]
# [email protected]
# [email protected]The downside is that our program will also match strings like _@._. They obviously cannot be considered email addresses.
Email validation 2.0
If usernames and domain names are too short, it may lead to rather bad scenarios. We can set some restrictions when compiling our pattern to avoid this:
# Let's say we want the username to be at least 5 characters long
# and the domain name of 2 to 4 characters
pattern = re.compile(r'[\w\.-]{5,}@[\w-]+\.\w{2,4}')Let's break it down piece by piece:
[\w\.-]{5,}matches alphanumeric characters, underscores, a dot, or a dash that appear at least 5 times;@matches the @ sign;[\w-]+\.matches alphanumeric characters, underscores, or a dash followed by a dot;\w{2,4}matches alphanumeric characters and underscores that appear 2-4 times.
Here's our final program:
def find_emails(string):
# Here we compile our simple pattern that will match email addresses
pattern = re.compile(r'[\w\.-]{5,}@[\w-]+\.[\w]{2,4}')
# Remember that re.findall() returns a list of all matched email strings
emails = re.findall(pattern, string)
# To print the matched strings one by one
for email in emails:
print(email)Let's test it:
string = '''_@._ mary_liu@abc._ [email protected], dog 456
[email protected] [email protected] [email protected]'''
find_emails(string)
# [email protected]
# [email protected]
# [email protected]As you can see, simple restrictions can make our pattern more powerful. You can also make your pattern match more specific strings by, for example, adding the boundary shorthands \b at the beginning and the end of the pattern, as well as lookaround assertions.
Keep in mind, the regex we have created is only for experimentation and learning purposes, so it cannot verify all potential email addresses.
Tokenization
When working with text, one of the first things you often need to do is split it into pieces so you can look at or process each word separately. This process is sometimes called tokenization in technical fields, but really, it just means breaking a sentence into smaller parts.
The most straightforward approach to tokenization is to split a text by whitespaces. Let's see how it works:
import re
def tokenize(string):
tokens = re.split(r'\s+', string)
return tokens
string = "This is a sample string. (And here's another one!!)"
tokenize(string)
# ['This', 'is', 'a', 'sample', 'string.', '(And', "here's", 'another', 'one!!)']After giving it a thorough look, you can spot the elephant in the room — punctuation marks. Let's get rid of them before we split our sentence:
import re
def tokenize(string):
# Let's create a pattern that contains punctuation marks
punctuation = re.compile(r'[\.,\?!\*:;()]')
# Substitute the punctuations with empty strings
no_punct = re.sub(punctuation, '', string)
print(no_punct)
# This is a sample string And here's another one
# Split sentences by whitespaces
tokens = re.split(r'\s+', no_punct)
return tokens
tokenize(string)
# ['This', 'is', 'a', 'sample', 'string', 'And', "here's", 'another', 'one']We have not omitted the apostrophe ' in the punctuation mark list. This is quite important as we do not want to split words like Let's, here's, or Mary's into two different tokens and change their meaning.
As you can see, tokenization can be a bit tricky, but regex can help you with it. Of course, there are a lot of ways to tokenize a text depending on the text type you are dealing with. We have presented you with one of the simplest ways to do it.
Conclusion
In this topic, we've seen examples of how regular expressions can help us with string processing. You can experiment further and implement other regular expressions to check whether a string is a valid email address or tokenize more complicated sentences.
To sum up, even simple regular expression patterns can be used to write powerful programs for real-life applications, including textual data extraction and substitution, web scraping, and many other natural language processing tasks.