Computer scienceProgramming languagesPythonAdditional instrumentsAlgorithms and data structures

find() under the hood

5 minutes read

You already know how to find a substring in a string; that's good. But do you know how Python does it? In this topic, we will provide you with a detailed explanation of what happens when you call the find() method.

The find() method

Before we dive into the details of the algorithm, let's refresh our memory. Find() is a str method that returns the lowest index of the substring in a string. Here's an example:

print('wonderland'.find('n'))  # 2

There are two n occurrences in this string: at the index 2 and 8. Find() returns only the first occurrence. If the substring is not found, it returns -1.

We can also specify a search interval by passing the start and end indexes to the method. This is how we can find the second n in wonderland:

print('wonderland'.find('n', 5, 9))  # 8

Once we have quickly gone through the syntax of this method, let's talk about the search algorithm.

The naive search algorithm

We start with an example of a very simple naive search algorithm. Let's say we need to find LAND in ALICEINWONDERLAND. Here's how this algorithm may operate:

First, it compares L (the first letter in the pattern) with A (the first letter in the string). It doesn't match, so the algorithm moves one position to the right and compares L with L (the second letter in the string). It's a match!
Now, it compares A (the second letter in the pattern) with I (the next letter after the one that matched). They don't match, so it keeps moving until it reaches the L at index 13;
Then it compares L with L, A with A, N with N, D with D, and sees that everything matches. It has found our substring!

It took us 18 comparisons, and this is a lot. The Python developers should've definitely picked something faster as their default search algorithm.

The Python search algorithm

The search algorithm that Python uses by default is the Boyer-Moore-Horspool algorithm or, simply, the Horspool algorithm. It was created by Nigel Horspool in 1980 to simplify the Boyer-Moore string search algorithm.

This algorithm is much faster than the naive one. To see how it works, we'll get back to our example and find LAND in ALICEINWONDERLAND.

The first step is to construct a bad match table. It is a table where we assign certain values to the characters in the search pattern. Here's the formula for the values:

$value = length\_of\_the\_pattern - index\_of\_the\_character - 1$ .

Our pattern is LAND. The length is 4. The value for L is $4 - 0 - 1 = 3$ . $0$ is the index. The value for A is $4 - 1 - 1 = 2$ and so on. The value of the last character is the length of the pattern. In our case, it is 4. The same goes for the *. It can mean any other characters. Its value is also equal to the length of the search pattern:

Character	L	A	N	D	*
Value	3	2	1	4	4

Now, we can start the comparison. We begin with the last letter in the pattern (D, not L). However, we still start at the beginning of the string.

We have LAND and ALIC to compare. D and C don't match, so we move our pattern, not one position to the right but four. Why four? Because that's the value of C that is * in our bad match table;
Now we have LAND and EINW. D and W don't match, so we move four positions to the right again, as W is * in the table;
Now it's LAND and ONDE. D and E don't match; E is * in the table. We move four positions to the right again;
Next, we have LAND and RLAN. D and N don't match, and the N value is 1. So, we move 1 position to the right;
Now we have LAND and LAND. First, we compare D with D, and it works. So, we go to the previous character and compare N with N. It's a match, too. Then, it's A and A, L and L. We have found our substring!

It took only 8 comparisons instead of 18 with the naive search algorithm.

The average-case complexity of the Horspool algorithm is $O(n)$ . $O(nm)$ is the worst case, where $m$ is the length of the search pattern, and $n$ is the length of the string. The best-case complexity is sublinear.

Conclusion

In this topic, we've discussed how the find() method works. Let's quickly go through the main points:

The default algorithm for substring search is the Horspool algorithm;
The Horspool algorithm first constructs a bad match table and then goes through the string, comparing it with the pattern according to the values in the table.

10 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo

find() under the hood

The find() method

The naive search algorithm

The Python search algorithm

Conclusion

Related topics