Computer scienceSystem administration and DevOpsCommand lineText processing

Sorting and deduplication

12 minutes read

A lot of new information can be confusing, so naturally, it helps to arrange the data in some specific order. Such sorting makes the data output more readable and easy to search. It can also help to keep track of the duplicates. To solve these problems, there are sort and uniq utilities, each with its own separate attributes. In this topic, we'll learn how to use them. Let's start with sort.

Sort utility

Sort is a utility for outputting text strings in a specific order. It can be used either to sort text from one or several files or to sort the Linux command output. For example, you may sort files by size or collect command usage from history.

To start working with sort you need to know the syntax. The entry will depend on what you want to work with. If your task is to sort the output from files, then it will look like this: sort <options> <filename>. In the simplest case, without any options, it will look something like this:

$ sort test.txt
computer
data
data
DATA
database
programmer

Here we just created a test file with multiple lines and listed those lines in lexicographical order. And if you want to sort the output of a command, then you need to type <command> | sort <options>:

$ cat text.txt | sort
computer
data
data
DATA
database
programmer

Now let's get acquainted with the options of the sort command.

Sort options

As far as the options go, there are a lot of them, so we will learn only some of them. For example, if you want to sort the lines of the file in reverse order, then you need to add the -r (reversed) option:

$ sort -r test.txt
programmer
database
DATA
data
data
computer

As you can see, in the test.txt file, some words were repeated several times. The sort command has a -u (unique) option that can remove duplicates. However, it is case sensitive, so if you want to remove all repetitions, including those written in the upper case, this option should be supplemented with the -f option. Thus, you can display the contents of the test.txt file, where the word 'data' will be seen only once:

$ sort -uf test.txt
computer
data
database
programmer

Please, note, that this command does not change the contents of the file, it prints out only a portion of the data.

Now, let's see, how sort works with numbers:

$ echo '10
5
210' | sort

10
210
5

It might not be what you expected when you thought of sorting numbers. By default, sort treats lines with numbers as plain strings and sorts them lexicographically. So, if you want to use numeric sort, you can apply either the -n (numeric) or -h (human-numeric) flags:

$ echo '10
5
210' | sort -n

5
10
210

All right, now we have our numbers from the smallest to the largest!

Sorting by key

One more option we are going to discuss here is the -k option that stands for sorting by key (field). For example, if you want to display the contents of the current directory sorted by the day of the month the document was created, you can use field number 7 as -k7. Since this is sorting by numbers, we need the -n option again.

In total, we get: ls -l, that outputs the detailed list with the owner, group, creation date, size, and other parameters plus pipe | and sort command with options -nk7:

$  ls -l | sort -nk7
total 155472
-rw-rw-r-- 1 admin admin       66 Apr  2 08:41 test.txt
-rw-rw-r-- 1 admin admin 67037191 Oct  3  2018 habr_texts.txt
-rw-r--r-- 1 admin admin        0 Mar 12 16:22 corpus.txt
-rw-r--r-- 1 admin admin 68582461 Nov 26 14:42 wiki_data.txt

However, the sort command is not all there is when working with duplicates. There is a separate uniq utility, which we will discuss below.

Uniq utility

The uniq command is designed to search for repeated adjacent lines in a text. At the same time, with the found matches, the user can perform a lot of actions such as delete them from the output, or, on the contrary, display only these matches.

The command syntax is similar to the previous utility: uniq <options> <file_source>. You can also add an optional parameter <file_target> to save the output to another file. For example, let's take a test.txt file and write all the unique lines from it to a new.txt file. Then, using the cat command, let's look at the contents of the new file.

$ uniq test.txt new.txt
$ cat new.txt
computer
data
DATA
database
programmer

As you can see, this command is case-sensitive. To display the word 'data' once, you need to use the additional option -i (ignore case):

$ uniq -i test.txt
computer
data
database
programmer

And if you need to count the number of repetitions, then you need to add the -c (count) flag:

$ uniq -ci test.txt 
      1 computer
      3 data
      1 database
      1 programmer

Thus, we see all the lines of the file and the number of their repetitions. If you need to display only duplicate lines, then you should use the -D (duplicates) option:

$ uniq -Di test.txt
data
data
DATA

So, we've figured out how to work with duplicates and do the most basic actions: deleting, counting the number of repetitions, displaying only the duplicates themselves.

Although, even if the name of utility suggests that it filters only unique values, it's not literally so:

$ echo "yes
no
yes" | uniq

yes
no
yes

We still have one extra 'yes', but if we sort the sequence before we call the uniq command, the result will be the following:

$ echo "yes
no
yes" | sort | uniq
no
yes

Great, now we have only unique values!

However, there can be more complex tasks to solve with combinations of these commands. Let's have a look at the examples of such tasks below.

Pipelining

Imagine you have to find the most frequent lines in the text. How can you do it?

First, you need to count how many times each line is occurring in the file, and then arrange the lines according to these frequencies in descending order. This can be done as follows:

$ sort test.txt | uniq -ci | sort -nrk1
      3 data
      1 programmer
      1 database
      1 computer

So, the first step is to sort the data in test.txt. The second step is to count how many times each string is repeated. After that we get a corresponding number at the beginning of every string, i.e. we now have two columns. We sort the data by the first column, reverse it and get the final result.

So, all the lines are now ordered by frequency. But the task can be even more specific: to find only the most frequent elements, that is, to display a certain number of lines with the highest frequency value. In our small example, let it be only the very first line with a value of 3. To do this, we will need to add the head -1 command to the pipe:

$ sort test.txt | uniq -ci | sort -nrk1 | head -1
      3 data

Now you know how to combine the sort and uniq commands with each other and with any other commands.

Conclusion

To sum up, now you know that the sort utility can output strings from files or commands in an ordered manner. You also know that you can delete, read and save duplicates using the uniq command. Seems like it's time to practice with some exercises!

74 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo