Natural scienceBioinformaticsData and Tools

Bowtie2

Provided by: Edvancium

9 minutes read

Bowtie2 tool performs read mapping to a reference genome. It is an essential step in all bioinformatic pipelines utilizing DNA sequencing. Mapping is performed after trimming and quality control. In this topic, we will briefly discuss the bowtie2 algorithm and look at an example of its application.

Overview

Bowtie2 became a very popular short-read sequence aligner, because of its fast and accurate performance. It is commonly used to map sequencing reads to a reference genome. The user inputs sequencing reads and gets positions in the reference genome from which the reads originate. In general, read mapping to the genome means pair-wise alignment of the read and the genome. However, there are a few nuances. Direct alignment of each read to the genome is computationally expensive and takes a long time. To reduce computational time bowtie2 utilizes the seed-and-extend paradigm. It selects a short sequence — a seed in a read. The algorithm effectively searches for exact matches of the seed in the reference genome. Then it extracts genome regions containing the exact match and tries to conduct pair-wise alignment of the read and this region. Moreover, the algorithm does not search directly in the reference genome because this process would be computationally ineffective. Instead, bowtie2 preprocesses the reference genome to a special data structure called FM-index to search faster. Hence, the Bowtie2 workflow consists of two main steps: reference genome preprocessing (or FM-index construction) and read mapping to the reference.

FM-index

To speed up the search, each Burrows-Wheeler-based aligner, such as bowtie2, requires indexed genome reference. Firstly, the algorithm makes the sequence more compressed by calculating the Burrows-Wheeler transformation (BWT) of the reference genome. Let's consider a simple example of the BWT. Let "acgat" be our reference genome. Next, add the "$" symbol to indicate the end of the string. Then write down all circular shifts — that is moving the last symbol ("$") to the first position, while shifting all other symbols to the successive positions. After sorting the first column in lexicographic order (the dollar sign is considered the first), the last column will be the BWT of the string.

A character string is transformed by the BWT. The transform is done by sorting all the circular shifts of a text in lexicographic order.

Burrows-Wheeler matrix has an LF mapping property, where L stands for the last column and F — for the first one. The point is that, if we number letters of each type by their occurrence in the first column, this ranking will preserve in the last column. In our toy example, $a_{2}$ goes after $a_{1}$ in the L column and the same order is observed in the F column. We can restore the first column by taking BWT in lexicographic order. Moreover, there is an LF function that finds the letter's position in the first column based on its location in the BWT.

For the LF function, bowtie2 pre-calculates two essential matrixes — C and Occ. C[c] matrix represents the number of occurrences of lexicographically smaller characters for the letter c. Occ[c, i] matrix shows the number of occurrences of the letter "c" before position i+1 in the BWT. These two matrixes are utilized by the LF function. This function takes an index of a character in the last column and calculates its index in the first column. Together Burrows-Wheeler transformation, C and Occ matrixes make up FM-index.

Calculated C and Cc matrixes

Read mapping

As we said before, bowtie2 operates in the seed-and-extend paradigm. The first step of read mapping is searching for seed. The algorithm takes some fragments (seeds) from a read and performs an accurate search in the FM index. Bowtie2 default seed size is 20 bp. Let's return to the previous example. We already have BWT, C, and Occ matrixes and can restore the first column. Let "cga" be the seed we're looking for. The search for the exact match starts with its last character. Firstly, find the range of indexes with "a" in the first column using the C table. Now we get 2 queries, with "$" and "g" before the "a" symbols. Only the second variant suits us. The next step is to seek rows beginning with "g". LF-function tells us that "g" at the end of row 3 corresponds to index 5 in the first column. Lookup for the symbol in the last column gives us "c". Well, we find the seed and its position in the genome.

Pattern search from right to left by the last-to-first mapping

Normally, the algorithm searches for the exact match of the seed. However, Bowtie2 also supports modes that allow 1 or 2 mismatches for seed. But these regimes are rarely used. Keep in mind the tool extracts several seeds from a read. A seed can be found in multiple places in the genome. In this case, all findings will undergo an extension step. The extension could succeed in several areas and, therefore, read would be mapped to multiple positions which is called multi-mapping.

After a seed is found, bowtie2 starts the extension step. The tool performs pair-wise local alignment of the read and reference area where the seed was discovered. Bowtie2 utilizes Smith–Waterman algorithm. The tool allows two alignment modes: end-to-end and local. End-to-end approach searches for alignments involving whole-length sequences. The local mode supports clipping an alignment from one or both ends to maximize the alignment score. Local alignment can be helpful in various tasks, for instance, trimming adapters from reads.

Note that FM-index provides a fast search for seeds, but its construction could take a long time, especially for large genomes. There are read-mapping tools that exploit other data structures for searching in the reference sequence.

Install bowtie2

Now we are going to discuss bowtie2 usage. If you have questions not addressed below, feel free to check out the bowtie manual for more information.

If you have Bioconda installed, run this

conda install bowtie2

You can also download Bowtie 2 source and binary zip packs from the Download section section of the Sourceforge site.

Run bowtie2

Bowtie2 is easy-to-use. At first, we create FM-index (or index) from the reference genome with the following command:

bowtie2-build reference.fasta bowtie2_index

reference.fasta contains reference genome sequence in fasta format. bowtie2_index is a basename for all output index files.

The second step is read mapping performed by the following command:

bowtie2 -x bowtie2_index -1 read1.fastq -2 read2.fastq -S result.sam

The table below describes the parameters used here.

-x	The basename of index files.
-1	The file in fastq format, contains the first reads in pairs.
-2	The file in fastq format, contains the second reads in pairs.
-S	File to write SAM alignments to.

Bowtie2 works with both pair-ended and single-end reads. To map single-end reads we should specify -U flag, instead -1 and -2. The above-mentioned parameters are a required minimum. There are many additional variables, which can be adjusted by users depending on their biological task.

Except for the .sam file with read alignments, Bowtie2 also creates a report. There is an example of such a report for a pair-ended dataset below.

10000 reads; of these:
  10000 (100.00%) were paired; of these:
    650 (6.50%) aligned concordantly 0 times
    8823 (88.23%) aligned concordantly exactly 1 time
    527 (5.27%) aligned concordantly >1 times
    ----
    650 pairs aligned concordantly 0 times; of these:
      34 (5.23%) aligned discordantly 1 time
    ----
    616 pairs aligned 0 times concordantly or discordantly; of these:
      1232 mates make up the pairs; of these:
        660 (53.57%) aligned 0 times
        571 (46.35%) aligned exactly 1 time
        1 (0.08%) aligned >1 times
96.70% overall alignment rate

There are two important terms — concordant and discordant reads. They are relevant for pair-ended datasets. As you remember, there is a gap between two reads in a pair and its length is specific to sequencing protocols. If the distance between reference genome positions, where the reads from the same pair were mapped, complies with gap length, then reads are concordant. If this distance is much higher or smaller then reads are discordant.

Concordant read pairs are aligned respecting the inner distance, while discordant read pairs have unexpected span size

Conclusion

Bowtie2 is a tool dedicated to read mapping. It finds positions in the reference genome, from which reads originate. Bowtie2 operates on the seed-and-extend paradigm, it extracts short sequences (seeds) from a read and searches for an exact match in the reference genome. The tool tries to extend these seeds by performing a local alignment. Bowtie2 runs directly in the console and its workflow consists of only two steps. The bowtie2-build command creates an FM index and bowtie2 performs read mapping.

How did you like the theory?

Report a typo