Computer scienceSystem administration and DevOpsCommand lineText processing

Introduction to text processing: wc, cut, tr

10 minutes read

In this topic, we will analyze how to work with text data in the Unix terminal: display basic statistics and transform text. For this, we need the following utilities: wc, cut, and tr. Below we will analyze each of them.

Text data counts

The wc command sequentially prints newline, word, and byte counts for each file, and totals if more than one file is specified. A word is understood as a non-zero-length characters sequence delimited by white space.

The command syntax is as follows: wc <file path>.

For example, for a test.txt file the result will look like this:

$ wc test.txt
 8  6 46 test.txt

It shows that the test.txt file has 8 lines, 6 words, and 46 bytes. If you need to process several files, specify them separated by a space:

$ wc test.txt data.csv
        8         6        46 test.txt
  1306143  16722950 124206772 data.csv
  1306151  16722956 124206818 total

If you do not specify any files at all and just run the wc command, the terminal will read standard data input. You may type as many text lines as you need. When you finish, go to a new line and press the Ctrl + D key combination. The command interpreter will exit the program and display the result:

$ wc
Hello
world
      2       2      12

Also, the output stream to the wc input can be redirected using echo and pipe. It will look like this: $ echo "Hello world" | wc. Thus, using this command, you can get basic statistics on text from files or standard input. For more command options, use --help. Below we will analyze how to transform texts in the terminal using cut and tr commands.

Cutting lines

The cut command is used when you need to cut out a part of the text either from a file or printed via standard input. Generally, the command syntax is as follows: cut <options> <file path>.

The text parts may be denoted by characters -c , fields -f, or bytes -b. The fields are used more often and the most frequent task is to split text by separator -d, which is TAB by default. However, one can customize the delimiter if needed.

So, we will pay more attention to -f and -d options. In order to get the information about others, use --help.

To begin with, one may need to extract a table column. For example, there is a seasons.txt file with the following content:

Winter: white: Weather: cold
Spring: green: Snow:melted
Summer: bright: Temperature: hot
Autumn: yellow: Leaves: cool

That is a table with 4 columns and : as a delimiter. In order to print the first column, one needs to type the following:

$ cut -d ':' -f 1 seasons.txt
Winter
Spring
Summer
Autumn

So, one should indicate the delimiter, the column number, and the file name.

In order to cut the line part from standard input, one should do this:

$ echo "The sky is blue" | cut -d ' ' -f 1
The

There is a line with space as a separator, and we cut the first word out.

You can also specify the interval from which to which word you want to cut. For instance, if you specify an interval of 1-2 in the example above, the result will be the The sky string.

So, this is how you can cut some text parts out of a file or standard input. In the next step, we will find out how to transform text characters using the tr command.

Transforming characters

This command can translate, squeeze, and delete characters from standard input, writing to standard output. The program processes the text character by character. By default, its syntax is as follows: tr <options>... <set1> <set2>. Sets are specified as characters strings.

The tr command can also accept as input the result of another program using a pipe.

Below, with examples, we will analyze how the tr command is used. To replace characters, for example, "a" with "o", you need to type the following:

$ echo lalala | tr a o
lololo

In order to delete characters, use the -d option. For example, let's delete all the "u" characters:

$ echo 'Linux Ubuntu' | tr -d 'u'
Linx Ubnt

In case you need to keep only the character you deleted, combine the -d option with -c, which will use the complement of "u":

$ echo 'Linux Ubuntu' | tr -dc 'u'
uuu

The other useful option is -s, which removes duplicates. For example, with this option we can get rid of repeated spaces:

$ echo 'Repeated  spaces in  line' | tr -s [:space:]
Repeated spaces in line 

Here we worked with the predefined [:space:] pattern.

To work with other patterns and to use other options use tr --help.

Conclusion

To sum up, in this topic we've learned the basics of how to work with text in the Unix terminal:

  • In order to display some basic text statistics, use the wc command.
  • Use cut command to cut out parts of text.
  • To translate, squeeze, and delete characters use the tr command.
44 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo