You already know how to find a substring in a string; that's good. But do you know how Python does it? In this topic, we will provide you with a detailed explanation of what happens when you call the find() method.
The find() method
Before we dive into the details of the algorithm, let's refresh our memory. Find() is a str method that returns the lowest index of the substring in a string. Here's an example:
print('wonderland'.find('n')) # 2
There are two n occurrences in this string: at the index 2 and 8. Find() returns only the first occurrence. If the substring is not found, it returns -1.
We can also specify a search interval by passing the start and end indexes to the method. This is how we can find the second n in wonderland:
print('wonderland'.find('n', 5, 9)) # 8
Once we have quickly gone through the syntax of this method, let's talk about the search algorithm.
The naive search algorithm
We start with an example of a very simple naive search algorithm. Let's say we need to find LAND in ALICEINWONDERLAND. Here's how this algorithm may operate:
First, it compares
L(the first letter in the pattern) withA(the first letter in the string). It doesn't match, so the algorithm moves one position to the right and comparesLwithL(the second letter in the string). It's a match!Now, it compares
A(the second letter in the pattern) withI(the next letter after the one that matched). They don't match, so it keeps moving until it reaches theLat index 13;Then it compares
LwithL,AwithA,NwithN,DwithD, and sees that everything matches. It has found our substring!
It took us 18 comparisons, and this is a lot. The Python developers should've definitely picked something faster as their default search algorithm.
The Python search algorithm
The search algorithm that Python uses by default is the Boyer-Moore-Horspool algorithm or, simply, the Horspool algorithm. It was created by Nigel Horspool in 1980 to simplify the Boyer-Moore string search algorithm.
This algorithm is much faster than the naive one. To see how it works, we'll get back to our example and find LAND in ALICEINWONDERLAND.
The first step is to construct a bad match table. It is a table where we assign certain values to the characters in the search pattern. Here's the formula for the values:
.
Our pattern is LAND. The length is 4. The value for L is . is the index. The value for A is and so on. The value of the last character is the length of the pattern. In our case, it is 4. The same goes for the *. It can mean any other characters. Its value is also equal to the length of the search pattern:
Character | L | A | N | D | * |
Value | 3 | 2 | 1 | 4 | 4 |
Now, we can start the comparison. We begin with the last letter in the pattern (D, not L). However, we still start at the beginning of the string.
We have
LANDandALICto compare.DandCdon't match, so we move our pattern, not one position to the right but four. Why four? Because that's the value ofCthat is*in our bad match table;Now we have
LANDandEINW.DandWdon't match, so we move four positions to the right again, asWis*in the table;Now it's
LANDandONDE.DandEdon't match;Eis*in the table. We move four positions to the right again;Next, we have
LANDandRLAN.DandNdon't match, and theNvalue is 1. So, we move 1 position to the right;Now we have
LANDandLAND. First, we compareDwithD, and it works. So, we go to the previous character and compareNwithN. It's a match, too. Then, it'sAandA,LandL. We have found our substring!
It took only 8 comparisons instead of 18 with the naive search algorithm.
The average-case complexity of the Horspool algorithm is . is the worst case, where is the length of the search pattern, and is the length of the string. The best-case complexity is sublinear.
Conclusion
In this topic, we've discussed how the find() method works. Let's quickly go through the main points:
The default algorithm for substring search is the Horspool algorithm;
The Horspool algorithm first constructs a bad match table and then goes through the string, comparing it with the pattern according to the values in the table.