You're already familiar with the sort() method and the sorted() function. In this topic, we will take a look at these things in action. They use one algorithm, and the goal of this topic is to explain this algorithm so that you can employ it more intelligently.
Quick recap
We'll start with quickly going through the sort() and sorted() methods. Practice always makes perfect!
Sort() is a list method, so you cannot sort a dict, a str or anything else apart from a list. Let's see how it works:
a_list = [15, 18, 16, 17]
a_list.sort()
print(a_list) # [15, 16, 17, 18]
We can also sort this list in descending order by setting the reverse flag to True to get [18, 17, 16, 15], instead.
By default, a list of the int or float type is sorted in ascending order, and a list of the str type is in alphabetic one. If you need something else, you can use the key argument and pass a function to sort your list accordingly. Let's say we have a list of integers, and we want odd numbers to come first in the sorted version of that list. First, we define the even function that returns True if the num is even and False if it's odd. Then we can use it as a key for sorting:
def even(num):
return num % 2 == 0
a_list = [15, 18, 16, 17]
a_list.sort(key=even)
print(a_list) # [15, 17, 18, 16]
Now, let's refresh our memory about the differences between sort() and sorted(). Both sort a list in ascending order, but there are two important things to keep in mind:
Sort()modifies the input list whilesorted()creates a new one. The return value forsort()isNone; forsorted()it'slist;Sort()only works withlists.Sorted()can take any iterable as input, for example,strordict, and turn it into a sortedlist.
Otherwise, these methods are very similar; reverse and key arguments work the same with both.
Timsort
There are many different sorting algorithms out there. Bubble sort, Merge sort, and Insertion sort are just some of them. The latter two create a base for Timsort – a hybrid algorithm that is used every time you call sort() or sorted() in Python.
Timsort is a fast and efficient algorithm created by Tim Peters. The time complexity of this algorithm is as follows:
in the average and worst cases;
in the best case (if the list is already sorted).
In the table below, we compare Timsort to three other algorithms: Merge sort, Insertion sort, and Quicksort. Timsort is a smart combination of the first two. Quicksort is thought to be one of the fastest ones. The table below provides you with not big- only but also with Θ and Ω. Take a look at the article by CathyatSeneca if the concepts behind all these letters are not clear to you.
| Best-case | Average | Worst-case |
Timsort | |||
Quicksort | |||
Merge sort | |||
Insertion sort |
As you see, Timsort is actually faster than other algorithms.
It's also important to know that Timsort is a stable sorting algorithm. This means that it doesn't change the relative order of repeated elements. This feature is insignificant when equal elements are indistinguishable. In [2, 2, 4, 2], all 2 are the same. Sometimes it's important when sorting key-value pairs. For example, [[9, 2], [1, 4], [3, 4], [3, 1], [3, 8]]. Here we have three elements starting with 3: [3, 4], [3, 1], and [3, 8]. Our goal is to sort these lists by the first element. Basically, the algorithm considers [3, 4], [3, 1], and [3, 8] equal. Let's try to sort this list in Python. First, we define the zero function to sort the list based on the first elements of sub-lists only:
def zero(inp):
return inp[0]
Next, we set the zero function as the key for sorting:
a_list = [[9, 2], [1, 4], [3, 4], [3, 1], [3, 8]]
a_list.sort(key=zero)
print(a_list) # [[1, 4], [3, 4], [3, 1], [3, 8], [9, 2]]
Note that the elements [3, 4], [3, 1] and [3, 8] come in their initial order. That's because our algorithm is stable. An unstable algorithm may return [3, 1], [3, 4], [3, 8] or [3, 8], [3, 1], [3, 4]. It may also return a list where the order of the repeated elements is intact, but there's no guarantee that it would do so.
How Timsort works
Let's discuss what this algorithm does at length.
The first question you need to ask is Does the list has less than 64 elements in it? If the answer is yes, then the Insertion sort is used. It's a simple algorithm, and it works effectively with small lists. As you probably remember, Insertion sort works in two steps:
It first divides the list into two parts, a sorted and unsorted one. At first, only the element at index
0is considered as the sorted part;Then it looks at the elements in the unsorted part one by one and inserts them in the correct place in the sorted part.
Then these steps are repeated until the whole list is sorted.
What happens if there are 64 or more elements in the array? That's where things become a bit more complicated. First, the algorithm goes through the list looking for the natural runs, already-sorted subsequences that are present in most real-world datasets. These subsequences must be either ascending or descending (in this case, they are reversed). For example, in the list [4, 3, 2, 5, 7, 8, 1], 4, 3, and 2 are descending natural runs. It gets reversed, and we end up with 2, 3, 4. 5, 7, and 8 being an ascending natural run. The algorithm iterates over the list and collects all the elements into runs.
The minrun is the minimal size of a run. This size is determined based on the length of the input. In most cases, it's within the range of 32 to 64 inclusive. The minrun size is important because if it's too short, there will be too many runs to merge. If it's too long, it'd be ineffective to use Insertion sort on it.
If a run is smaller than the minrun, Insertion sort is used. It means that the algorithm adds elements to the run one by one taking them from the next run until the minimum size is reached. Do you remember that Insertion sort works best on short lists? Timsort takes advantage of this.
Once this step is completed, we end up with a lot of sorted sub-arrays (runs). Now, it's time to merge them! All runs are merged two at a time, preferably of equal size, as Merge sort is most effective on the lists of equal length. This is repeated until one sorted list remains.
This is a somewhat simplified explanation of how Timsort works. It has some more features that make it even faster (like the galloping mode), but the basic idea is as follows:
Timsort makes the best of two worlds by combining Merge sort with Insertion sort and takes advantage of already-sorted chunks of data, so-called natural runs.
Conclusion
In this topic, we've shown you how the sort() and sorted() methods work under the hood. Let's quickly go through the main points:
The default sorting algorithm in Python is Timsort, a stable hybrid sorting algorithm;
A hybrid algorithm is a combination of two or more algorithms. In this case, Insertion and Merge sort;
A stable algorithm doesn't change the order of equal elements;
If the length of the input array is less than 64 elements, simple insertion sort is used;
If the length of the array is more than 64, it's divided into runs. Each run is sorted with Insertion sort, and then they are all combined into one list using Merge sort.