In this topic, we will extend pairwise sequence alignment to three or more sequences. Notably, multiple sequence alignment provides us with new biological information that we can't infer from pairwise alignments. Multiple sequence alignment tells us about the evolution of organisms and important sequence features. However, dynamic programming, which we used in pairwise alignment, is too expensive for more than a handful of sequences. We will discuss methods that make multiple alignments fast, but still accurate.
Why do we need multiple alignment?
Multiple sequence alignment (MSA) is the process of aligning three or more biological sequences (DNA, RNA, or protein). Mathematically, MSA is sequence composition in which as many similar positions as possible are aligned with each other with minimum number of gaps. However, from a biological point of view, the definition of optimal MSA is not that simple. For example, we might want to test a hypothesis that a set of sequences are related and originated from a common ancestor — in other words, make a hypothesis of sequence homology. For this case, we would want to align the sequences column-wise to find homologous positions, which are derived from the same position in the hypothetical ancestral sequence.
In addition to analysis of evolutionary relationships, MSA construction is an important procedure in other sequence research. Looking at protein MSA, we may find sequence positions with high prevalence of some residues. Frequently observed residues in particular positions of proteins are said to be conserved. For some reason, mutations in these positions are not biologically allowed — they are likely to be deleterious, or critical for protein structure and function, so organisms with these mutations would not survive or produce offspring, preventing the mutations from joining the larger gene pool. Conserved position may give insight about the biological importance of a residue for the molecule's function and can suggest that we're working with:
components of the active center of an enzyme
important positions that form intramolecular contacts and maintain molecule structure
participants in intermolecular interactions, such as protein-protein interactions
alignment and sequencing errors
For example, in the following picture, we can see MSA of serine protease enzymes from several species. Residue Asp102, which is known to be the catalytic residue, shows 100% identity in a column. We would not have been able to distinguish this position from an occasional match in any pairwise alignment. Multiple alignment provides new information and can be constructed to improve results of pairwise alignment.
Why do we need multiple alignment? It allows us to find common patterns in different sequences that are not detectable when comparing only two sequences. With known conserved positions, we can make predictions of functionally critical residues in proteins. MSA is also an essential prerequisite to reconstructing evolutionary relationships between sequences and predicting macromolecule structure. Let's now discuss how we can construct MSA.
Scoring function
How can we choose which MSA is the best one? First, we should define an alignment score to compare different MSA. This score is typically calculated as a sum of scores of each column. Column score is defined as the sum of all-pairs (SP). SP-score for individual column with gaps can be calculated as follows:
There are other methods of score computation that are faster or more relevant for some biological tasks. In the Center star method, the center sequence is chosen and we align all other sequences with respect to the central one. The Phylogenetic-tree methods imply the construction of the tree of "evolutionary events." This method is accurate, but constructing trees is a hard task. The purpose of most multiple sequence alignment algorithms is to maximize the scoring function. It should be noted that the MSA with the highest score will not necessarily produce a biologically relevant result. Let's discuss algorithms that can be used for MSA construction.
Dynamic programming
In pairwise alignment, we operated with two sequences and used 2-dimensional dynamic programming to find the best-scoring alignment. With three sequences, we would build a 3-dimensional cubic structure as illustrated in the picture. However, the direct application of dynamic programming with more than a few sequences is expensive in both time and memory. All programs that are routinely used sacrifice accuracy and use heuristics. Heuristics might not suggest an optimal solution, but it is much faster than an exact method.
Heuristics in multiple sequence alignment
Progressive methods. The most common heuristic is referred to as "progressive alignment."
For all sequence pairs, we build pairwise alignments. Alignment scores are then converted in distances. As a result, we calculate the distance matrix between all sequence pairs by rules: (1) the higher the score, the less distance between sequences; (2) distance between the same sequences equals 0.
There are algorithms that convert the distance matrix to a guide tree. It should be noticed that a guide tree is an approximate tree, not a formally constructed phylogenetic tree. Each sequence is a tree leaf, and they are connected together with branches. The higher the percent of identity of sequences, the closer they are to each other in the tree.
Next, we combine pairwise alignments one by one to create multiple alignment. The tree dictates the order in which sequences can be merged — beginning with the most similar pair and progressing to the most distantly related. On this step, a leaf can be both a single sequence or sequences collapsed in alignment. Alignments are combined using 2-dimensional dynamic programming. However, instead of aligning two single residues, we align two blocks of residues.
The major disadvantage of progressive alignment is that the order of alignments matters. Errors can arise at any stage and propagate through the rest of the MSA process. The result highly depends on the quality of the initial alignment. An alternative method is iterative MSA, which is also based on heuristics.
Iterative methods. The iterative approach is based on the idea that an optimal solution can be achieved by repeated refinement of draft solutions produced by progressive alignment. Steps of the algorithm include re-aligning of subalignments (which can be chosen randomly or by rule) until there are no more possible improvements. The iterations correct any errors that may have occurred in the alignment process.
Block-based methods. Actually, progressive and iterative methods are largely global alignment methods. To find conserved regions in divergent sequences of varying lengths, block-based methods are used. The result of this approach is conserved, ungapped blocks shared by all sequences.
Multiple alignment tools
A very popular progressive global alignment method is the Clustal family, especially the weighted variant ClustalW, or modern version ClustalO. One of the most important features of Clustal is the flexibility of using substitution matrices. Clustal applies different matrices depending on degree of similarity. For example, Clustal prefers BLOSUM62 for closely related sequences and BLOSUM45 for more divergent sequences according to guide tree. Another improvement of the method is adjustable gap penalties. As an example, more gaps are allowed outside conserved regions than within conserved blocks. Another common progressive alignment method is T-Coffee algorithm. Compared with Clustal family, T-Coffee combines both global and local pairwise alignment. Moreover, the initial alignment is chosen from a pool of alternative alignments, which minimizes the error of not starting from an optimal priming alignment. T-Coffee outperforms Clustal algorithms when aligning moderately divergent sequences, but it is slower. The tools mentioned here can be accessed by GenomeNet web potral or EBI web portal.
Muscle and MAFFT are popular iterative alignment methods. For large datasets, several (for example, the first two) iterations can be run. On average, this gives accuracy comparable to T-Coffee and speeds much faster than ClustalW. Block-based algorithms are implemented in DiAlign2 tool and Match-Box tool.
Multiple alignment visualization
Short conserved sequence blocks can be expressed as single line consensus sequences. A consensus sequence is defined as the most frequent residues or nucleotides found at each position in multiple sequence alignment. Notice that for nucleotide sequence alignment, the sequences being aligned contain only A, T, G, C, while the consensus sequence may contain other letters when nucleotide in uncertain. The additional characters are standardized to different instances of uncertainty and are known as IUPAC nucleotide codes. For example, if you meet both A and T in an individual column, write W to express ambiguity.
We can also visualize sequence trends and patterns with a sequence logo, where the frequent character on its position is drawn as the tallest and placed on the top. The following picture illustrates sequence consensus, nucleotide codes table, and sequence logo, which was obtained from Weblogo web portal.
Even good alignments may contain misaligned regions. It is a good practice to check MSA and edit alignment if necessary. Validation may include introducing or removing gaps to get a biologically relevant result. Widely used programs for MSA visualization and editing are Jalview program and GeneDoc program.
Conclusion
Multiple sequence alignment is an essential technique in many bioinformatics applications, including phylogenetic tree estimation, structure prediction, and identification of important gene regions. Results of multiple alignment are more accurate and can be used to correct the results of pairwise alignment. Exact algorithms, such as dynamic programming, suggest optimal solutions, but are too computationally expensive to run on large datasets. Instead, heuristics are applied. These can be divided in three branches: progressive alignment, iterative alignment, and block-based alignment. Programs such as ClustalW, T-Coffee, Muscle, and MAFFT are the most widely used for constructing multiple alignments. Researchers should check and edit resulting alignments if necessary.