Natural scienceBioinformaticsBioinformatics algorithmsAlignment algorithms

Running BLAST

13 minutes read

In this topic, we will discuss how to make BLAST requests and interpret algorithm output. Running the default search may not produce optimal results, but BLAST helpfully suggests different options to make the search more restrictive or inclusive. It is important to choose the right BLAST algorithm and adjust parameters to get the best possible alignments to your query sequence.

Choosing BLAST tool

You can run BLAST queries using the command line or the NCBI BLAST web page. In both cases, first you need to determine which BLAST tool to run. You will probably want to find similar sequences on both the nucleotide and amino acid levels. Due to the degeneracy of the genetic code, not all DNA mutations result in amino acid changes. For this reason, a DNA sequence typically evolves more rapidly than the protein sequence it encodes.

There are all-purpose BLAST programs each with specialized features — blastn, blastp, blastx, tblastn, and tblastx. The first one, blastn, can be used to compare a nucleotide query sequence with a nucleotide database. Another tool, blastp, compares the user's protein sequence with a database of protein sequences. All other programs are translated BLAST tools (see details in the following table).

Query type Database type Comparison
blastn Nucleotide Nucleotide Nucleotide-Nucleotide
blastp Protein Protein Protein-Protein
blastx Nucleotide Protein Protein-Protein
tblastn Protein Nucleotide Protein-Protein
tblastx Nucleotide Nucleotide Protein-Protein

Blastx translates the user's nucleotide sequence into a protein sequence and queries it against a protein database. It should be noted that the query nucleotide sequence is translated in all six possible reading frames. These are three overlapping reading frames in the forward direction and three on the complementary strand in the reverse direction. Each of the translated sequences is then compared to the protein sequences in the database. Blastx is recommended when you want to identify whether your novel nucleotide sequence is a protein-coding gene and find an open-reading frame (nucleotide sequence that potentially encodes protein).

The translation of double-stranded DNA in all possible reading frames

Another type, tblastn, compares a protein query sequence to a nucleotide database that has been translated. Thus, you are certain with the query protein and want to compare it with all hypothetical proteins in the database. That could be, for example, a draft genome record that remains unannotated. Finally, tblastx translates both query sequence and target sequence from the database, and it is the most computationally expensive type.

Running BLAST

Using web BLAST, you can choose an algorithm on the BLAST Main page or on the BLAST search page. To run BLAST in a simple way, you only need to enter a query sequence in FASTA format (see FASTA topic) or accession ID in the input box. Additionally, you can restrict the search field to a particular organism or sequence database.

BLAST search page

Database. You should specify the type of database, which will probably be "Standard database (nr etc.)" for the vast majority of tasks. The most comprehensive standard database is non-redundant database (nr), which encompasses sequences from different non-curated and curated databases. You may choose a specific database that meets your needs or contains more high-quality data. For example, you can query your protein sequence only against Swiss-Prot database, which is curated. When working with a non-default database, always get acquainted with the database contents. BLAST documentation provides extensive help on database selection.

Program selection. In addition to "core" tool choice, it is necessary to choose an optimal program. BLAST programs are suited for different tasks and have their own benefits (look at the following table).

Nucleotide Preferred tasks
megablast Long alignments between very similar sequences. Intra-species alignments.
blastn Short alignments between distantly related sequences. Cross-species searches.
discontiguous megablast Middle option: more dissimilar sequences than megablast. Intra- and cross-species.
Protein Preferred tasks
blastp "Default" protein blast.
Quick blastp Very similar sequences. Fast search against the non-redundant (nr) protein database.
PSI-BLAST Distantly-related protein sequences.
PHI-BLAST Searches for a pattern in an input sequence.
DELTA-BLAST Distantly-related protein sequences. Sequences with known conserved domains.

Advanced searching

For more advanced searching, BLAST offers a number of different options in the Algorithm parameters section, which can be changed to obtain the best possible result. Extensive help is available for each option. Let's discuss some of them.

Algorithm parameters section

Word size. By modifying word size, W, you can significantly change search results. By default, BLAST derives a list of words with length W = 6 for amino acid sequences and 11 for nucleotide sequences. Increasing W speeds up the process of BLAST search but can cause sensitivity loss (might lose significant findings).

Gap penalty. The open gap cost (existence) is the price of introducing gaps in the alignment, and extension gap cost is the price of every extension past the initial opening gap. If the gap penalty is too large, gaps are avoided and sequences can't be properly aligned. If the gap penalty is too low, gaps are inserted everywhere instead of mismatches.

Sequences with biased letter composition. Low-complexity regions have very simple compositions compared to typical sequences and may result in problems during BLAST search. A low-complexity region of protein may look like PGQQQQQPGQQQQQQ, which is an example of a polyglutamine repeat. BLAST automatically filters low complexity regions and utilizes compositional adjustment matrices. Low-complexity stretches are lower case gray letters in the alignment. Note that repetitive and low-complexity regions can yield an extremely large number of statistically significant results that don't actually represent relevant results.

An example of low-complexity region, the upper sequence contains repeats of single amino acids

BLAST is a very popular program, so the NCBI server is often overloaded with requests. Sometimes you can wait for your search results for tens of minutes, or even longer. Fortunately, BLAST stores the results of all queries for 36 hours, and each query is assigned a special identifier — Request ID. You can even close the search window and retrieve results later by ID on the Recent results page.

Standalone BLAST. The Web interface of BLAST NCBI has a very nice graphical interface. However, its use may be limited if you run high volumes of BLAST searches or want to search against your own custom database. For these purposes, you can install BLAST+ command-line tools. Assume you need to find sequences similar to protein from the file protein.fasta in proteome stored in proteome.fasta. To do this, you should first index your database with the command makeblastdb, which creates files for different word sizes. Next, run blast blastp with options -query, -db, -out.

BLAST output

BLAST output has several parts:

  1. Heading with running details

  2. A graphical display (in web BLAST NCBI)

  3. List of hits

  4. Individual alignments with calculated parameters

The Web interface of BLAST NCBI shows a graphical interpretation of the best hits. The top segment displays the query sequence and a color key for scores. The color bars represent alignments, and positions of bars indicate the region of the query sequence it covers. A thin gray line (marked with *) indicates that two alignment blocks are derived from the same sequence.

Modules of BLAST output: graphic summary of the alignments, a hit list and individual alignments

Individual alignment is accompanied by the length of finding and other metrics:

Score is a number indicating overall quality of an alignment. The score depends on the substitution matrix and penalties for gaps. Higher scores correspond to higher similarity.

E-value. Briefly, the lower the E value is, the more significant the match. Low E-value indicates better hits that are unlikely to arise by chance. Note, that we can not directly compare E-value when searching against databases of different sizes. Even identical alignments will receive different E-values from searches against different databases. To compare different runs, a normalized score (Bit Score) is used, which does not depend on query length and database size. The higher the Bit score, the better the sequence similarity. Do not forget that 8e-34 is the same as 8x10^(-34).

E-value(S)=n×m×2Bit-Score{\operatorname{E-value } }(S) = n \times m \times 2^{-\operatorname{Bit-Score }}

where n is the length of query,

m is the sum of length of all sequences in the database.

Identities (%) describes how many letters in two sequences are identical. The higher the percent identity is, the more significant the match.

Positives (%) describes how many letters in two sequences are similar. This means that the residue pair has a positive value in the substitution matrix.

Query Cover (%) describes the length of the query sequence that is covered by sequence in the database.

Additionally, in Descriptions section BLAST provides more metrics. Max Score is the maximum Bit Score of the alignment of the query sequence with a finding, while Total Score is the sum of the Bit Scores of all alignments with a particular finding.

Interpretation of BLAST results

It should be noticed that BLAST does not postulate sequence homology (suggestion that sequences may have derived from common ancestral sequences). We can only infer biologically meaningful relationships from alignment parameters, such as low E-value. At the same time, a large E-value doesn't mean that the result is not a hit. It means that an irrelevant sequence has a good chance of just as high a score. Another important point is that the E-value is very dependent on the database size and length of the query sequence. Short sequences tend to have larger E-value, therefore additional observations are needed to conclude homology.

Conclusion

You can access BLAST NCBI service from a web browser or download standalone BLAST+ on your local computer. Before running, it is essential to adjust parameters for your query sequence to get the best possible result. Among BLAST output metrics, E-value and also Bit Score evaluate significance of findings. The lower the E-value is, the more unique the hit. However, BLAST does not decide on sequence homology: researchers must draw conclusions about the biological meaning of alignments on their own.

5 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo