Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Deeper into Aho-Corasick algorithm

1 hour read

"Begin at the beginning," the King said, very gravely, "and go on till you come to the end: then stop."

Lewis Carroll, Alice in Wonderland

Introduction: brief recap on the algorithm

Let's recall what we know so far. If you're trying to find a set of keywords in a text, which is a string of characters, you want to do that as efficiently as possible. This means, you don't want to start from the beginning every time a keyword doesn't match the current character. Instead, you should shift to the most relevant state and continue matching other possible keywords. The best way to achieve this is to make a finite-state machine which performs different state transitions based on whether there is a match or not.

The pattern-matching machine uses a "goto" function, essentially a directed tree graph. Let's consider a set of patterns S[5] = {"abac", "ab", "ba", "cac", "a"}. From these, we get the following:

Moreover, if the current character doesn't match the keyword being processed, "goto" sends out a "fail" message triggering the failure function. This tells the finite-state machine which state it should check for other potential matches. In the mentioned example, the pattern-matching machine consists of "goto" and failure functions (green arrows show failure transitions):

If everything is okay and the machine reaches the last character of a keyword, then the output function activates and prints the found keywords and their positions in the original string. Remember, the output function is consulted after every successful return of "goto". However, unless there is a full match, it only gives an empty expression (you can think of it as an empty string, specifically "").

Today, you will learn how to write a code for creating such a finite-state machine and how long it takes for the algorithm to run.

Pseudocode implementation

To gain a full understanding of how the algorithm works in real-life search engines, take a look at its pseudocode. This describes the core of the pattern-matching machine search process. We assume that the input is string[n], an array of nn characters. Return value types of fail and empty are pre-defined macros recognized by the system. For example, fail can be considered equivalent to 1-1 as a typical error message. Note that it doesn't overlap with the finite-state machine state values, which are non-negative integers. Similarly, empty represents a string with one symbol that cannot be found in the text string, such as a null character.

First, let's look at the main part of the substring search process, namely the Aho-Corasick pattern-matching machine operating cycle:

//algorithm core: search process
begin
  state = 0
  for i in [1, n]:
    begin
      while goto(state, string[i]) == fail:
        do state = failure(state)
      state = goto(state, string[i])
      if output(state) != empty then
        begin
          print(i)
          print(output(state))
        end
    end
end

Each iteration through the for loop represents one operating cycle of a finite-state machine. This idea closely resembles the acclaimed Knuth-Morris-Pratt algorithm. The key difference is that the latter can look up only one keyword while Aho-Corasick deals with any finite set of them.

Now we can break down the above pseudocode to understand what's behind it. You may be wondering how to actually build a "goto" trie in programming languages. We discussed its structure as a directed graph when we covered the introduction to this algorithm. It turns out that all you need to know is the set of keywords represented as an array of strings keyword[k]! How so? Let's see:

//constructing goto trie
begin
  new_state = 0
  for i in [1, k]:
    do enter(keyword[i])
  for all char such that goto(0, char) == fail:
    do goto(0, char) = 0
end

Looks good, right? But you might be curious about enter(keyword[i]). Don't worry, I have pseudocode that fully defines and describes the behavior of this function. It's an auxiliary function introduced to keep the above code snippet clean and easy to read. It takes in a string (we refer to as an array of characters, char_array[m]) and adds our string to the "goto" trie. It doesn't return any specific value, so we call it a "procedure" to emphasize that its role is purely to perform some actions on the "goto" graph data structure.

//auxiliary procedure of adding keywords to trie
procedure enter(char_array):
begin
  state = 0
  j = 1
  while goto(state, char_array[j]) != fail:
    begin
      state = goto(state, char_array[j])
      j = j + 1
    end
  for p in [j, m]:
    begin
      new_state = new_state + 1
      goto(state, char_array[p]) = new_state
      state = new_state
    end
  output(state) = char_array
end

Notice two interesting points:

  1. The array keyword[k] is a set of kk elements keyword[i] which are strings. For instance, keyword[4] = {"Paris", "London", "Amsterdam", "Madrid"}. Viewing strings as arrays of characters, char_array[m] is a string of length mm, and each element char_array[j] is a symbol. For instance, char_array[4] = "Nice" = {'N', 'i', 'c', 'e'}. Keep in mind the type of fixed-size array you're dealing with at every step.

  2. The penultimate line of code tells us that we're filling in the "goto" prefix tree and partially constructing the output function at the same time! This is quite efficient, as we end up in the accept state when adding keywords. The output function would be invoked in these states, and we tell it what to print.

We have almost done all the work so far. Finally, we need to cover the construction of the failure function, which will make our pattern-matching machine complete and go smoothly along the optimal routes of state transition. I recommend refreshing your understanding of queue before diving into the code. All you need to know is its core definition and simple behavior patterns.

//constructing failure function
begin
  create queue //we assume queue is created empty by default, length(queue) = 0
  for each char such that goto(0, char) == s and s != 0:
    begin
      add s to queue
      failure(s) = 0
    end
  while length(queue) != 0:
    begin
      r = next state in queue
      delete r from queue
      for each char such that goto(r, char) == s and s != fail:
        begin
          add s to queue
          state = failure(r)
          while goto(state, char) == fail:
            do state = failure(state)
          failure(s) = goto(state, char)
          output(s) = output(s) + output(failure(s))
        end
    end
end

What's happening here? We create a queue of pattern-matching machine states where every non-zero state is stored at first. Concurrently, we add failure links to the initial state for the failure function. Then we identify which states can produce a fail message and where the machine should go next. The last while loop represents the transitions along failure links. Non-trivial states are stored in the variable s and finalize both the failure and output functions by assigning the right values in the appropriate states to them.

Time Complexity

The runtime of the Aho-Corasick algorithm is critical to its efficiency for substring search. Generally, it has a O(n+m+q)O(n+m+q) complexity, where nn is the text string length, mm represents the total length of all the keywords, and qq is the number of all keyword occurrences in the text. To understand its time complexity, let's analyze the steps involved in the algorithm.

Firstly, the algorithm core search process takes O(n)O(n) state transitions to process a text string of length nn. In each operating cycle the pattern-matching machine makes zero or more "failure" transitions followed by exactly one "goto" transition. Let's say the FSM is in a state which is dd nodes away from the initial zero root. Then the algorithm can never make more than dd failure transitions in one cycle since it would either stop at a node with a relevant suffix to continue the search, or in the worst case, end up in the zero state to start again from scratch. Thus the total number of "failure" transitions is at least one less than the total number of "goto" transitions. There are exactly nn "goto" transitions in processing the text string of length nn, hence we have the desired O(n)O(n) estimate.

Next, building a "goto" trie requires O(m)O(m) time because we need to insert every symbol of every keyword in the trie with the enter procedure. Of course, sometimes we may have common prefixes among the patterns, which could save us time, reducing a constant in O(m)O(m). At this point we already know that the algorithm takes at least O(n+m)O(n+m) to run.

Now, the last significant part of the algorithm is its output. If our set of keywords is something like keywords[N] = {"blah", "blahblah", "blahblahblah", ... , "blahblah...blah" where the last keyword consists of "blah" repeated NN times, and our text string is string[1000*N] = "blahblah...blah", then there will be a high number of output invocations during the string iteration. The output could make up the majority of the runtime, and it tells us that we should include it in the time complexity considerations of this algorithm as it can also significantly affect others. Without assuming other specific constraints on the alphabet, the set of keywords, or the text string, the only reasonable estimation of the number of output calls is O(q)O(q), where qq is the total number of occurrences of all the keywords in the text.

From the pseudocode, you can see that there must be further deliberations about the time it takes for "goto" and "failure" functions to work. However, nowadays, there are many different canonical implementations for these functions that allow O(1)O(1) run time. Usually, this results in extra memory usage. For example, "goto" trie can be constructed as a two-dimensional array or a linear list partially combined with direct access tables for the most used states. So, the time complexity of the Aho-Corasick algorithm is linear in terms of text length, keyword set length, and total keyword occurrences in the text. The linear time complexity makes the algorithm highly efficient for substring search tasks, even for large pattern sets and texts.

Conclusion

The Aho-Corasick algorithm offers several advantages over other substring search algorithms, making it a popular choice in various applications. Let's summarize its key aspects and compare it to other algorithms.

  1. Efficient Multiple Pattern Matching: The Aho-Corasick algorithm efficiently handles multiple pattern matching. By constructing a trie data structure, it can simultaneously search for multiple patterns in the given text. This makes it suitable for applications where a large number of patterns need to be matched simultaneously, such as DNA sequencing in Bioinformatics or virus scanning.

  2. Implementation: As we have just seen, this algorithm can be practically implemented in every programming language by following four crucial steps. Namely, building the "goto" trie, adding keywords using the "enter" procedure, building the failure function (while constructing the output simultaneously), and finally, coding an operating cycle in an iterative loop.

  3. Linear Time Complexity: The running time of the Aho-Corasick algorithm is a significant advantage. With a time complexity of O(n+m+q)O(n+m+q), it ensures fast performance, even for large pattern sets and texts. Other algorithms, such as naïve string matching or the Knuth-Morris-Pratt algorithm, have higher time complexities and may not scale well for large inputs.

  4. Additional Space Requirement: One limitation of the Aho-Corasick algorithm is its need for additional space. The trie data structure used by the algorithm requires extra memory to store the nodes and transitions. In memory-constrained environments, this can be a limitation, especially for large pattern sets. Other algorithms like the Boyer-Moore algorithm have lower space requirements.

To sum up, understanding the Aho-Corasick algorithm provides you with a powerful tool for substring search. Its efficient multiple pattern matching capabilities, combined with a linear time complexity, make it suitable for a wide range of applications. However, keep in mind the space requirements when using the algorithm in memory-constrained environments. By leveraging the strengths of the Aho-Corasick algorithm, you can effectively search for multiple patterns and tackle complex substring search tasks.

How did you like the theory?
Report a typo