As you know from previous topics, alignment is a technique used to compare biological sequences when you are interested in the similarity between them. When comparing two sequences, it is called pairwise sequence alignment (PSA). In this topic, you will learn what algorithms are frequently used in PSA and what a substitution model is.
What is pairwise sequence alignment?
PSA can be global or local. Global PSA compares sequences across their whole length, and the lengths of both sequences need to be the same. Local alignment compares parts of the sequences and is suitable for sequences of different length.
Two terms need to be distinguished: sequence similarities and sequence identities. These terms are almost equivalent for nucleotide sequences because nucleotide mutation frequencies are almost equal. However, they differ for protein sequences because amino acids have different sizes and physicochemical properties, like acidity and hydrophobicity.
Alignment algorithms
The algorithms for global and local alignment are similar. Both types of algorithms can be based on the dot matrix method and the dynamic programming method.
Dot matrix method
The dot matrix method is based on the graphical representation of two sequences in a two-dimensional matrix. Sequences are written on the horizontal and vertical axes. If a residue in one sequence matches with a residue in the second sequence, you draw a dot in the grid space. Otherwise, the position is left blank. If two sequences share a high identity, dots will form long diagonal lines (red lines in the following figure). This method can be used to identify chromosomal repeats between two closely related genomes.
The problem with this method is that it is hard to illustrate alignment of long sequences and it only works for comparing two sequences. You can increase efficiency by combining several residues together in a "window" and comparing them simultaneously. However, the sensitivity of the alignment can be lost if the window size is too long. Another disadvantage is that the method doesn't provide enough data to assess the quality of the alignment. The next method we will present finds an optimal alignment in a more quantitative way.
Dynamic programming
The dynamic programming method can be divided into three steps: (1) constructing a two-dimensional matrix, (2) filling the matrix with scores, and (3) tracing back through the score matrix in reverse order to find the maximum scoring path. Let's see how it works in the example of global pairwise alignment, or Needleman-Wunsch algorithm. This algorithm is suitable only for closely-related sequences of similar length.
The first step is similar to the dot matrix method, in which the sequences are placed along the horizontal and vertical axes in a matrix.
The second step is filling the score and traceback matrices. Every cell in the score matrix is calculated as depicted below. In the figure, "g.p." means gap penalties, which represent insertions or deletions. The cost of these events varies and depends on whether it's an opening gap or extending gap (affine gap penalties; for instance, the cost of opening gap is -5, the cost of extending gap is -1). Scores of different substitutions also vary and depend on a substitution model that you can use.
The traceback matrix stores the direction from which the maximum score comes. The matrix is filled at the same time as the score matrix.
Filling both matrices is continued while at least one cell is blank. Once all cells are full, the third step starts. To get the optimal alignment, you trace back through the traceback matrix from bottom right corner to top left corner (blue path in the following figure). "Diag" means that the residues from the two sequences are aligned, "up" is a gap added in the top sequence (x-axis), "left" is a gap added in the left sequence (y-axis). The final alignment score is derived from the score matrix. If there is more than one path, the best alignment can be chosen arbitrarily. The whole process is depicted below.
The Smith-Waterman algorithm is a local pairwise alignment algorithm based on dynamic programming that finds local similar patterns in a pair of sequences. This means that the algorithm can find a few optimally aligned segments. The alignment path starts from the highest scoring position and ends when it reaches a zero. So, the path can begin and end internally. An example of a tool that uses a local alignment procedure is the BLAST tool that helps to find local similarities between two sequences very quickly.
The difference between the Smith-Waterman method and the Needleman-Wunsch method is that cells with negative values in the scoring matrix are set to zero for Smith-Waterman. This makes local alignments visible. This method is recommended when a pair of sequences has an unknown or high level of dissimilarity.
Substitution models
During the alignment procedure, you should use a scoring system for quantifying the likelihood of residue substitution. This system is called a substitution model. Such models create matrices of residue substitution frequencies that are often called scoring matrices or substitution matrices.
There are two types of models: mechanistic and empirical. Mechanistic models are mostly used in nucleotide sequences, and empirical models are used for amino acid sequences. The mechanistic model generally uses ten parameters: 4 nucleotide frequencies and 6 rates of substitutions. In contrast, empirical models are based on real-world amino acid frequencies observed in a large dataset and applied to constructed matrices on different datasets.
The simplest DNA model assumes equal frequency for each nucleotide (1/4) and equal rate of substitutions (the only model parameter, , in the following figure). This model was proposed in 1969 and is known as the Jukes-Cantor or JC69 model. In 1980, Motoo Kimura proposed a two-parameter DNA model (also known as the K80 or K2P model; and parameters in the following figure), where the rate of substitutions between purines and purines or between pyrimidines and pyrimidines (called transitions) is higher than the rate of substitutions between purines and pyrimidines (called transversions). Purines are A and G nucleotides, pyrimidines are C and T nucleotides. Since then, many additional DNA models have been constructed, and we cover these in topic BI: Substitution models.
There are two main empirical substitution matrices derived from actual alignment of closely relevant sequences: PAM ("point accepted mutation"; Dayhoff et al, "A model of evolutionary change in proteins", 1978) and BLOSUM ("blocks substitution matrix"). PAM matrices (or Dayhoff matrices) were constructed using a group of very closely related sequences, thus the matrices are not suitable for dissimilar sequences. To work with more divergent sequences, BLOSUM matrices were made. You can find additional information about amino acid substitution models in topic BI: Substitution models.
Conclusion
Pairwise alignment algorithms can be global or local, and these algorithms are fundamentally similar. You can use the dot matrix method for short sequences or identifying large chromosomal repeats in a genome. The dynamic programming approach can be used to search for the optimal alignment. The method can be divided into three steps: (1) locating sequences on matrix axes, (2) filling the matrix with scores, and (3) finding the highest scoring path. When filling the matrix, you should apply a substitution model that is used for more delicate scoring of substitution frequency.
In general, all substitution models can be divided into mechanistic and empirical models. Empirical models are based on real observations of amino acid frequencies, for which substitution matrices are constructed for use with different datasets. Mechanistic models assume nucleotide frequencies and substitution rates and can have different numbers of parameters depending on complexity (e.g. rate of transitions and transversions).