Seek, and you will find.
Matt. 7:7
At one point or another, most of us have had to search for specific words on a webpage or within a document. It can be pretty tasking to do by hand, particularly when dealing with large volumes of text such as a thick book or a lengthy list of data including things like employee names, plane tickets, stock prices, and more. This search function isn't just useful in web browsers but is also essential in many software applications. Here, the challenge is to find a specific substring within a longer string—this is typically how computers store textual data.
Today, we are going to learn about the efficient and powerful Aho-Corasick algorithm that solves the above problem in almost linear time. A significant feature of this algorithm is the so-called pattern-matching machine, which is an example of a finite-state machine that is constructed from the initial set of given keywords. This machine iterates over the entire text to reveal the positions of found substrings.
Algorithm description
In this section, we will create such a pattern-matching machine from scratch. Remember that each finite-state machine consists of a set of states, transition functions, and an input. The set of states has one prominent starting state and at least one accept state where the machine ends its operating cycle. The transition functions control how the automaton behaves, and the input is simply the sequence of characters from the text string.
For the Aho-Corasick search machine, we will focus on three main components: the goto graph, failure function, and output function. We will examine each of them in detail and illustrate how they work. Let's say, our keywords are "he, she, his, hers" and the text string is "sushis".
Constructing a Trie of Search Patterns
The "Goto" function consists of a directed graph with vertices labeled by the number of the FSM state and edges representing the state transition based on the symbol input. This data structure is known as a trie or prefix tree and is generally used for searching specific keys within a set. As a function, goto maps a pair of a state and an input symbol either to a state or to the fail message. All possible main cases include:
-
If the state is zero and the input character doesn't match any of the first keyword letters, the
gotoreturns0and the machine remains in the initial state. So, unless the symbol input is "h" or "s", the machine stays in its initial zero state, performing a zero-to-zero loop. This condition is shown in the scheme by the formula "", expressing "NOT an element of the set {h,s}".
-
If the next input character matches that of the keyword, the function moves to the appropriate next state according to its directed edge labeled with that symbol. For instance,
goto(0,h) = 1, goto(1,e) = 2.
-
First, we add all the different words, like "he" and "she", to the "goto" trie because they begin with different letters.
-
Then, we start adding sub-branches at the states where the keywords have an overlap, like "h-e" and "h-i-s", "h-e" and "h-e-r-s".
-
After that, we identify all the accept states of the pattern-matching machine that will trigger the
outputfunction. Specifically, in this case, they are states 2, 5, 7, and 9. Notice that in state 5, the machine will print two found words: "she" and "he". -
Otherwise, when the input symbol doesn't match that of the keyword, our search iteration returns the
failmessage. As we can see in the picture, for example,goto(4,p) = fail. Basically, any input that doesn't fit the above directed graph triggers thefailmessage.
Making a Finite-State Machine out of the Trie
The "Failure" function is a state-to-state map that instructs the machine on which state to choose after getting a fail message. It's invoked whenever goto returns fail and it adds failure edges to the pattern-matching graph for those transitions, hence making a genuine finite-state machine out of the "goto" trie. The function essentially helps the algorithm keep track of where it should restart the search when a mismatch occurs, without starting all over.
Let's use the search pattern "he, she, his, hers" and the text string "sushis" as an example. At the letter "h" in "sushis" our automaton is in the fourth state (you might want to try writing down a list of intermediate states and transitions on your own!), since we nearly matched the keyword "she" from the set. Unfortunately, that’s not the case and goto(4,i) = fail. But then, the failure function allows us to continue where we can match the word "his": failure(4) = 1, and we perform a transition following by the red arrow.
The final piece is the output function. It maps a state to the corresponding subset of the pattern:
-
If the automaton reaches any of the accept states,
outputreturns the entire keyword underlying that specific branch of trie and all possible sub-keywords, like "she, he" in state 5. -
It can also be just one word:
output(7) = "his"andoutput(2) = "he". -
In all other states, the function returns an empty string as no keyword has been matched.
Scanning Text Using the FSM
Once we have understood and prepared the three components, we use them to process the string. Initially, the current state of the machine is the start state assigned to the zeroth vertex, and the first symbol of the text string is the current input. The machine then analyzes the text string using iterative operating cycles.
Each iteration begins with the next character of the string as input. Depending on whether it matches the appropriate character of the keyword or not, the finite-state machine goes forwards or backwards via the goto or failure function, respectively. If the last character of the substring search prompt is reached, then the output function gives the position of the keyword's first symbol in the ambient string where it was found.
Let's dissect these steps by detailing one operating cycle of the pattern-matching FSM. Let n be the current state of the machine, and c be the current input character of the string.
If goto(n,c) = m, the machine makes a "goto" transition and enters state m, consuming the next symbol of the string. Additionality, if output(m) ≠ empty, the machine emits the set of found keywords along with the position of the current input symbol. The operating cycle is complete.
If goto(n,c) = fail, the machine consults the failure function and performs a "failure" transition. If failure(n) = k, the machine repeats the cycle with k as the current state and c as the current input symbol.
Let's summarize all these steps by moving along the string "ushers" with the specific pattern-matching machine we just built. We start with state 0 and letter "u", which is neither "h" nor "s". So, a zero-to-zero transition happens: goto(0,u) = 0, output(0) = empty.
Next, we move to the second letter "s" in "ushers", and we see that goto(0,s) = 3 so a "goto" transition takes place, landing us in state 3. The output is output(3) = empty as none of the keywords has been found yet.
Then, we progress with the correct letters "h" and "e", reaching the accept state 5 and triggering the output function:
goto(3,h) = 4
output(4) = empty
goto(4,e) = 5
output(5) = "she", "he"
Now, every input character will invoke the failure function, as the FSM is in its accept state. In particular, goto(5,r) = fail (remember, we're still moving along the string "ushers", and had just reached its fifth letter, "r"). Moreover, failure(5) = 2 and the machine executes a failure transition to state 2 with the same current symbol "r".
Testing again: goto(2,r) = 8. Great! We found the correct place to resume the search. Checking the output: output(8) = empty. Now, we're entering the last character of the string, "s", and achieve our goal: goto(8,s) = 9, output(9) = "hers". Congratulations! We have identified every single occurrence of all the appropriate keywords in "ushers".
Conclusion
The Aho-Corasick algorithm efficiently scans the text and provides you with the positions of all encountered patterns. It's a noteworthy approach to simultaneously finding multiple patterns in a given text. We will delve deeper into this topic in following sections, give detailed examples, illustrate the algorithm's steps through pseudocode, analyze its time complexity, and discuss the significant conclusions about its role in text pattern matching. So, let's continue our journey in understanding the Aho-Corasick algorithm and its capabilities.