Let's talk about digital data. Every day, we use the internet: we communicate with other people, buy things, work, play, and generally do many interesting activities. All our actions generate a lot of data. It can look like a primitive string like Hello! or a formatted string like <b>Hello!</b>. It can also be placed into a query string: ?message=Hello! or into a request body: {"message": "Hello!"}. And from all of these different options, you need to get some information.
Let's look at a simple task: you need to retrieve the information contained between HTML tags. Of course, you can try to calculate the distance between tags, get indexes where the message starts and ends, and finally get the message. But what if you have to deal with other tags? Or the message has changed? Or you get a tree of HTML tags? There are too many possible options of this task to examine! And even harder it would be to make a code that could handle them all. But there is a solution. Let's dive into the wonderland of regular expressions!
The pattern of words
Let's get back to the HTML tags examples later. Now, let's look at a more common task with files.
Digital data has a lot of formats to represent information, and these formats have a lot of rules on how to keep or transfer information. For example, open any folder on your computer, and you will see that a lot of files have a name and an extension separated by a dot. Let's try to create a string template of a filename.
From a human point of view, we can write it as <name>.<extension>. Using this pattern, we can say that a given string is a filename. But for a computer to make the same conclusions, we need to formalize this string. The rules to represent patterns are provided by regular expressions. Consider one of the most common tasks in searching files; you have an array of filenames and need to print groups of them under specific conditions:
package main
import "fmt"
func main() {
files := []string{
"test.txt",
"passwords.json",
"notes.doc",
"test2.txt",
"dont't forget!!!.txt",
"app.cfg",
"send.rtf",
}
for _, f := range files {
fmt.Println(f)
}
}
// Output:
// test.txt
// passwords.json
// notes.doc
// test2.txt
// dont't forget!!!.txt
// app.cfg
// send.rtfInitialization of a regular expression
Before using the regexp package, you need to know how it works in Golang.
A regular expression is a Golang structure that is created from a string with special syntax rules. You can initialize a new regexp structure using Compile or MustCompile methods. Both methods create the structure, but only the Compile method returns the error. It makes sense to use the MustCompile method when you are sure about your regular expression; in the worst case, you will get a panic at runtime. Using Compile is safer: you can always check if any error occurs while the program is running.
The MustCompile method is illustrated below:
package main
import (
"fmt"
"regexp"
)
func main() {
mustCompileRe := regexp.MustCompile("**") // panic!
fmt.Println(mustCompileRe) // won't print because of the panic above
}
// Output:
// panic: regexp: Compile(`**`): error parsing regexp: missing argument to repetition operator: `*`Now look at the Compile method:
package main
import (
"fmt"
"regexp"
)
func main() {
compileRe, err := regexp.Compile(`**`)
if err != nil {
fmt.Println("Compile error:", err)
} else {
fmt.Println(compileRe)
}
}
// Output:
// Compile error: error parsing regexp: missing argument to repetition operator: `*`To create a regular expression template, it's better to use backticks `` so you won't need to escape backslash characters. For example, instead of using "\\w+", you can simply use `\w+`.
The error message is still the same, but the output method has changed. In the first example, the error prints with panic, it means that the code after the MustCompile method doesn't run. The second example, in turn, runs the script to the end, and the error is generated by the Compile method. In this case, you can handle it by yourself.
Match the pattern
The following methods help you answer the question: is the pattern contained in the string? There are two of them: Match and MatchString.
The Match method takes the slice of bytes []byte as an argument. The MatchString method takes the string. Both return a boolean value. If it is set to true, it means that the sequence has been found anywhere in the given string, and if not, it will be false.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`go`)
fmt.Println(re.Match([]byte("Golang")))
fmt.Println(re.Match([]byte("golang")))
fmt.Println(re.MatchString("It has returned false"))
fmt.Println(re.MatchString("You've got it right!"))
}
// Output:
// false
// true
// false
// trueThe pattern of regexp is case-sensitive. You can see it in the first and the second lines of the output in the example above.
You probably think that regular expressions can only return errors and search for substrings. Well, let's not rush to judgment. Next, you will learn about the true power of regular expressions.
The dot
Let's get back to the filenames problem.
The first condition to search is the length of a filename. It's not as simple a task as it may seem at first glance. You could try to use the len() function of Golang, but it counts the bytes of the string, and if the filename contains a two-byte symbol, you will get a wrong answer. Alternatively, you could use methods of working with strings, but in that case, you would need to take into account a lot of possible variations. Instead of complicating the task this much, let's instead concentrate on regular expressions.
For example, let's try to print filenames with the symbols d and t, and two symbols between. The file's name can be compiled from a lot of symbols (almost all available, except for some service ones). Thus, the template of the filename would, in the "human" way, look like this: d<any-symbol-1><any-symbol-2>t.
The regexp has a metacharacter that can fill the spaces between d and t! It is the dot .. It can substitute any single symbol in a string. The expression template for the case above will look like d..t. Let's compile the pattern and take it to a conditional expression:
package main
import (
"fmt"
"regexp"
)
func main() {
files := []string{
"test.txt",
"passwords.json",
"notes.doc",
"test2.txt",
"dont't forget!!!.txt",
"app.cfg",
"send.rtf",
}
re := regexp.MustCompile(`d..t`)
for _, f := range files {
if re.MatchString(f) {
fmt.Println(f)
}
}
}
// Output:
// dont't forget!!!.txt
// send.rtfThe pattern can occur anywhere within the string, and the program will find it. For example, the first line of the output above was triggered by "dont" placed at the beginning of a string (and because the word was written with a mistake). The second line of the output was triggered by d.rt, and that pattern was placed in the middle of the string.
The star
The next most popular task of file search is how to find files with some pattern inside the filename.
Consider the following task: finding filenames with a certain symbol, considering that the symbol can repeat unlimited times in a row. Let it be the "!" symbol.
The regexp mark * can fill any symbol any number of times. This mark works properly only with a symbol before it; otherwise, it returns an error. To find filenames with "!", you can use the following pattern: !*. Now let's try to put it to practice and find the needed filenames by using the regular expression: forget!*.
package main
import (
"fmt"
"regexp"
)
func main() {
files := []string{
"test.txt",
"passwords.json",
"notes.doc",
"test2.txt",
"dont't forget!!!.txt",
"app.cfg",
"send.rtf",
}
re := regexp.MustCompile(`forget!*`)
for _, f := range files {
if re.MatchString(f) {
fmt.Println(f)
}
}
}
// Output:
// dont't forget!!!.txtCombination of marks
To combine the marks, you need to follow the rules of regexp. So far, we've followed only one strict rule: before the star, we need to place a symbol. Let's follow it and place the dot before the star: .*. This pattern can find any pattern of any length!
The next task is to try to find filenames with a sequence between the two t symbols. The pattern looks like t<some_sequence_or_empty>t. Or t.*t in regexp.
package main
import (
"fmt"
"regexp"
)
func main() {
files := []string{
"test.txt",
"passwords.json",
"notes.doc",
"test2.txt",
"dont't forget!!!.txt",
"app.cfg",
"send.rtf",
}
re := regexp.MustCompile(`t.*t`)
for _, f := range files {
if re.MatchString(f) {
fmt.Println(f)
}
}
}
// Output:
// test.txt
// test2.txt
// dont't forget!!!.txtThe backslash
The last symbol we need to learn about is the backslash \. It's a magic symbol because it can switch the abilities of other symbols! For example, . is the filler for any single symbol, but this \. is just a dot. The pattern \. searches for a single dot. It works in reverse, as well: let's turn to common knowledge about strings and remember that \n is the character for the end of the line. But n is just a letter.
Finally, let's figure out how to find filenames by extension using the regexp. Let it be the json extension. The pattern must include the dot, while the name of the file is any: .*\.json.
package main
import (
"fmt"
"regexp"
)
func main() {
files := []string{
"test.txt",
"passwords.json",
"notes.doc",
"test2.txt",
"dont't forget!!!.txt",
"app.cfg",
"send.rtf",
}
re := regexp.MustCompile(`.*\.json`)
for _, f := range files {
if re.MatchString(f) {
fmt.Println(f)
}
}
}
// Output:
// passwords.jsonThe problem
Let's get back to the HTML problem and answer the question: does the string contain the div HTML tag?
First of all, you need to know how the tag should look. Let's generate the tag pattern: <tagname attributes>content</tagname>. And now, you need to represent that pattern with regular expressions.
A few minutes ago, you learned that regexp could contain primitive symbols and powered marks. Try to separate the HTML tag into two sides: passive and active. The passive side includes symbols that must always be inside the string. In our case, these are <div></div>.
The active part consists of changing symbols. The attributes and content can include strings of various lengths or symbols. The powered dot combined with the star can help us! The sequence .* is suitable in both cases. The final expression is <div.*>.*<\/div.*>.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`<div.*>.*<\/div.*>`)
fmt.Println(re.Match([]byte("<div>Accept!</div>")))
fmt.Println(re.Match([]byte("<div>Reject!<div>")))
fmt.Println(re.MatchString("It's return false"))
fmt.Println(re.MatchString("<div hidden class='h1'>With attributes!</div>"))
}
// Output:
// true
// false
// false
// trueConclusion
This topic has been only the first step into the deep dungeon of regular expressions. Let's consolidate the knowledge:
regular expressions are a more flexible way to search patterns in the text;
to enchant a string, you can use the
CompileorMustCompilemethods;to test a string for a pattern, use the
MatchorMatchStringmethods;use the slash quotes
`to represent the regexp string;dot
.or star*symbols are the regexp marks;backslash
\switches the power of symbols.