Natural scienceBioinformaticsBioinformatics algorithmsAlignment algorithms

Substitution models

7 minutes read

As you know from the topic "Pairwise alignment," alignment algorithms can be divided into three steps, one of which is filling the scoring matrix. During this step, you should take into account the likelihood of residue substitution. There are different substitution models for DNA and amino acid sequences that help to quantify such likelihoods. In this topic, we cover a few models for DNA and amino acid sequences and how to choose a model that fits your goals for sequence analysis.

Substitution models for DNA sequences

The first and simplest DNA model was introduced in 1969 by Jukes and Cantor. It is called JC or JC69. In this model, nucleotide frequencies are assumed to be equal (p(A)=p(C)=p(G)=p(T)=1/4), and the rate for each type of substitution is also assumed to be the same (μ\mu). So the only parameter here is the substitution rate, μ\mu.

In 1980, Motoo Kimura improved the JC69 model and made a two-parameter model (called K80 or K2P). K80 also assumes equivalent nucleotide frequencies (=1/4), but divided the set of all possible substitutions into two categories: transitions and transversions. Transitions are substitutions between purines and purines or between pyrimidines and pyrimidines. Transversions are substitutions between purines and pyrimidines. Adenine and guanine are purines, cytosine and thymine are pyrimidines (these terms are depicted in the following figure).

Transitions and transversions

In the following year, Kimura went further, breaking transversions into two categories: β\beta, that conserve the strong/weak properties of nucleotides (A-T, C-G), and γ\gamma, that conserve the amino/keto properties of nucleotides (A-C, G-T). This model is called K81 or K3P.

In the same year, Joseph Felsenstein proposed the F81 model in which nucleotide frequencies may vary from 1/4 and the substitution rate is the same for all substitutions (μ\mu).

In 1985, the HKY85 model by Hasegawa et al was introduced. HKY85 is essentially a mix of the K80 and F81 models, meaning nucleotide frequencies vary (not all equal to 1/4) and substitutions are divided into transitions and transversions.

All these DNA models were combined to form the most general model possible, called generalized time reversible model (GTR), introduced in 1986 by Simon Tavare. It assumes 4 unknown nucleotide frequencies and 6 unknown nucleotide substitution rates.

The following table summarize the described models.

Model name

Nucleotide frequencies

Substitution rate

JC or JC69

πA=πC=πG=πT=0.25\pi_A=\pi_C=\pi_G=\pi_T=0.25

μ\mu

K80 or K2P

πA=πC=πG=πT=0.25\pi_A=\pi_C=\pi_G=\pi_T=0.25

transitions, transversions

K81 or K3P

πA=πC=πG=πT=0.25\pi_A=\pi_C=\pi_G=\pi_T=0.25

transitions, β\beta-transversions, γ\gamma-transversions

F81

πAπCπGπT\pi_A\neq\pi_C\neq\pi_G\neq\pi_T

μ\mu

HKY85

πAπCπGπT\pi_A\neq\pi_C\neq\pi_G\neq\pi_T

transitions, transversions

GTR

πAπCπGπT\pi_A\neq\pi_C\neq\pi_G\neq\pi_T

all substitutions are not equal

Substitution models for amino acid sequences

Let's focus on classical amino acid substitution models that you will definitely encounter when performing phylogenetic reconstructions or alignment procedures.

There are two main empirical substitution matrices derived from actual alignment of closely relevant sequences: PAM ("point accepted mutation" from Dayhoff et al, "A model of evolutionary change in proteins," 1978) and BLOSUM ("blocks substitution matrix"). Every matrix is 20 by 20, where cell value represents the likelihood of substitution. A positive score means that the substitution frequency found in a dataset of homologous sequences is higher than randomly occurring substitution. A negative value means the opposite, that a randomly occurring substitution is more likely than the substitution in the homologous datasets. A zero score means that the frequencies found in a dataset of homologous sequences and randomly occurring substitution are equal.

These matrices convert frequencies using log-odds scores – log(observed frequency/expected frequency by random chance) of base 2 or 10.

PAM

PAM matrices (or Dayhoff matrices) were constructed using a group of very closely related sequences. Authors found 1572 accepted point mutations (accepted means "accepted by natural selection") and constructed a matrix of substitution frequencies, PAM1. This matrix contains substitutions that arise in a short evolutionary interval – only 1 accepted point mutation occurred per 100 amino acids residues.

Then the authors multiplied the PAM1 matrix by itself N times to extrapolate the results on more dissimilar or divergent sequences. They constructed the PAM250 (N=250) matrix where 250 accepted point mutations occurred per 100 amino acid residues. Interestingly, in this matrix, asparagine was found to be extremely mutable (only 6% of asparagines remained unchanged) and 55% of tryptophans remained unchanged. This is understandable: asparagine is known to be highly mutable, whereas tryptophan is large and has distinctive chemistry, so substitution is uncommon.

BLOSUM

The PAM matrices were constructed on a rather small dataset and included very similar protein sequences. To gain statistics on more divergent sequences, BLOSUM matrices were constructed.

Scientists collected more than 2000 locally aligned blocks obtained from hundreds of protein groups. Every block has no gaps and is less than 60 amino acids in length. A cutoff of identity (N) was established in every block. For instance N=62: if sequence A and sequence B in one block share at least 62 % of identity, such sequences are merged and residues are averaged. If sequence C in the same block shares more than 62 % of identity with either A or B, C will be merged with A and B. This example illustrates construction of the BLOSUM62 matrix.

How to choose what model I need?

The commonly used nucleotide substitution models are the JC69, Kimura, and F81 models. If you encounter problems during phylogenetic reconstruction using such models, it is possible you will need a more complex and generalized model, for example GTR. However, all these models assume that different positions in the sequence evolve at the same rate. In a series of experiments on globin genes and mitochondrial DNA genes, Ziheng Yang showed that γ\gamma-distribution is an adequate approximation of substitution rates at different DNA positions. So, you frequently encounter GTR+G or HKY85+G models that assume γ\gamma-distribution of rates for sites. In real-world applications, you may use special algorithms that evaluate different models and find the best one for your data.

PAM and BLOSUM matrices were constructed using sequences with different divergence levels. So, if you know that the sequences you are analyzing are very closely related, you should try a PAM matrix with a low PAM number (PAM100 or lower) or BLOSUM with high number like BLOSUM90.

Conclusion

There are different nucleotide substitution models ranging from the simple JC69 to the complex GTR. In essence, they all parameterize nucleotide frequencies and substitution rates. Using one of the models does not guarantee you the best result, and adding the assumption of γ\gamma-distributed substitution rates can improve your results. Such models have "+G" notation.

Among all amino acid substitution models, PAM and BLOSUM are the most frequently used classical models. PAM was initially constructed based on evolutionary assumptions and is widely used in phylogenetic reconstruction. However, it was based on highly similar sequences and might be less realistic for dissimilar sequences. That is why you should try to use BLOSUM for more divergent sequences.

How did you like the theory?
Report a typo