7 minutes read

The BLAST algorithm, which is one of the most used algorithms in the field of biology, allows us to find regions of local similarity between biological sequences, such as nucleotide sequences and amino acid sequences. A similarity search is often the first step of a bioinformatics study. It is therefore important to know how BLAST works and how to use it properly and correctly interpret algorithm results.

What is BLAST?

Imagine that you have determined a novel sequence in your laboratory. Perhaps you sequenced and assembled the genome from a study of the bacterial community in an exotic habitat (metagenomics). When you don't know much about your sequence, a good start is the common practice of finding the most similar sequences that already exist in biological sequence databases. We may hope that similar sequences are already stored and annotated.

BLAST (Basic Local Alignment Search Tool) has become a fundamental searching and alignment tool due to its accuracy and short execution time. BLAST performs comparisons between pairs of sequences, searching for regions of local similarity. Local alignment is more effective in the case of unknown sequences because it finds even short stretches of similarity between sequences.

To run the software, BLAST requires a query sequence to search for and a sequence database to search against. The algorithm calculates the score of all pairwise alignments and displays significant findings. Significance is defined as scores that are higher than those that could occur by chance. Aligning against a huge database is a very computationally intensive process. Effective algorithms are required to get reliable results for reasonable time.

What makes BLAST so fast?

BLAST identifies similar sequences using a heuristic method. Heuristics are strategies designed to solve a problem more quickly than an exact method, or to solve a problem when the algorithm is not known. They are useful, but their solutions might not be optimal. In contrast, the Smith-Waterman algorithm is an exact method that calculates optimal alignment but is too slow for searching against large sequence databases.

The BLAST algorithm is based on the assumption that good alignments contain short lengths of exact matches. Because of the theory behind the algorithm, BLAST does not guarantee the most precise results, so we might not find some relevant sequence pairs. Nevertheless, the algorithm performs very well in practice. What makes BLAST so fast?

1. Query preprocessing (seeding). In fact, BLAST doesn't align the whole query sequence at one time. Instead, the algorithm breaks the query sequence into substrings, or words, with a particular number of letters. For instance, if a query sequence is PROTEIN with word length 3, the searched words are PRO, ROT, OTE, TEI, EIN.

The query sequence is split into overlapping words of length 3

In addition to finding exact words, BLAST also creates a table of similar words, or neighborhood words. Not all similar words are equal. Depending on the physicochemical properties of amino acids, certain amino acid substitutions 'cost' more than others. These relationships are described in a scoring matrix, which is default BLOSUM62 for protein BLAST. Neighborhood words are generated when the word score exceeds the threshold, T. As a result, we preprocess our query sequence and prepare a dictionary of exact and slightly modified words.

A table of neighborhood words that passed the threshold

2. Database preprocessing. The enormous speed achieved by BLAST is due to its use of an indexed table of database words. BLAST stores words with indexes — positions at which the word occurs. Using an index, it is easy to find every word occurrence in a sequence database.

BLAST indexed table of words: for each word BLAST stores sequence ID and the position at which the word occurs

3. Extension. An alignment begins with 'anchors' — two word matches between query and database sequence. The distance between two anchors must not be more than window_size parameter (the default value is 40) in both query and target sequences.

Anchors then extend letter-by-letter in both directions. Each time an alignment is extended, an alignment score increases or decreases. It should be noted that BLAST doesn't allow gaps in primary anchor alignments. When the alignment score drops below a predetermined threshold X, the ungapped extension is terminated. Next, if the maximum score is above threshold S, the alignment will be extended with gaps. The extension continues until the score drops below maximum by an amount greater than xdrop_gap parameter (15 by default). The result is that primary anchors are connected in the longer alignment.

Alignment extension graph: extension continues until the score drops below the threshold

An example of anchor sequence extension, exact and similar letters are shown

Understanding the alignment significance

BLAST often suggests too many results that should then be prioritized by their significance. We will consider significance as a degree of similarity between two sequences, or uniqueness of our finding. Imagine you are studying newly discovered small RNA and you decide to search its sequence against the rat genome. In the following picture, we can see an alignment with 34 matches and 4 mismatches. Intuitively, when checking a short sequence against a big genome (the rat genome is 2.7 billion bases long) we have a good chance of having well-aligned sequences just by chance. When is the score high enough to assume the finding is significant?

Local alignment of short query sequence: Score, E-value and other metrics are shown

A possible approach is shuffling all the letters in database sequences and calculating the score distribution in this 'random' database with the same query sequence. As you may know from statistics classes, the probability that the random score x is higher than the score for our sequence S is called the P-value:

P-value(S)=P(xS){\operatorname{P-value } (S)} = {P(x \geq S)}

P-values should be corrected due to multiple testing. In the case of multiple pairwise comparisons without appropriate correction, we get a high proportion of false positive results. Expect value (E-value), a key metric in BLAST output, is just a correction of P-value for multiple testing.

E-value(S)=P-value(S)×n×m{\operatorname{E-value } }(S) = {\operatorname{P-value } }(S) \times n \times m

where n is the length of the query,

m is the sum of length of all sequences in the database,

S is an observed alignment score

Formally, E-value is defined as the expected number of hits (alignments) you expect to see by chance, with the observed score or higher. In our example, E-value (Expect) 0.001 means that 1000 random shuffling of the database is needed to find 1 alignment with as good as or better score. Obviously, shuffling of letters is not applied when calculating E-value. Mathematically, E-value can be calculated using the following equation:

E-value(S)=K×n×m×eλ×S{\operatorname{E-value } }(S) = K \times n \times m \times \mathrm{e}^{-\lambda \times S}

K and λ are positive constants that depend on details of the scoring matrix and the composition of your sequence. Once experimentally calculated for sets of parameters, K and λ are stored and used for E-value calculation.

Conclusions

  • BLAST is a fast and sensitive local alignment search tool that allows scientists to compare nucleotide and protein sequences to large databases.

  • BLAST algorithm is based on heuristics and performs very well in practice, although it may not find optimal alignments.

  • One of the notable BLAST innovations is the statistical significance for each sequence alignment result (E-value).

2 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo