Natural scienceBioinformaticsBioinformatics algorithmsPhylogenetics

Phylogenetic tree algorithms

Provided by: Edvancium

10 minutes read

We previously introduced phylogenetic trees by defining some basic phylogenetic terms and explaining how to read trees. We then discussed the steps to construct a phylogenetic tree. Recall them: (1) choosing data type (i.e. nucleotide or protein) on which a tree will be constructed, (2) applying multiple alignment, (3) selecting tree construction method and substitution model, and (4) assessing the result. In this topic, we will cover algorithms used to complete the fourth step.

Tree construction methods

Tree construction methods can be divided into two big categories: distance-based and character-based. The first computes distance between whole-length sequences, while the second one treats sequences character-wise. The distance-based methods are computationally fast, but can potentially lose information.

Distance-based methods

Distance-based methods can be subdivided into clustering and optimality based algorithms. Clustering methods build a tree based on a distance matrix starting from the most similar pair of sequences. Examples of clustering methods are the unweighted pair group method using arithmetic average (UPGMA; Sokal and Michener, 1958) and neighbor-joining (NJ). Optimality-based methods search through many trees and select one based on a specific assumption. An example is the minimum evolution (ME; Rzhetsky and Nei, 1992) method.

The simplest clustering method is called UPGMA. First, a distance matrix is built and the pair of sequences with the smallest distance is grouped. This pair is the first clade on the tree. The distance to a common ancestor is averaged. For example, given sequences A and B and common ancestor U, the distances AU and BU are the same and equal to half of AB. The distance matrix is recalculated, and the next pair of sequences with the smallest distance is grouped. This process continues until all sequences are on the tree. UPGMA is based on the assumption that all sequences evolve at a constant rate, producing what is called the "molecular clock." This assumption often produces erroneous tree topologies. UPGMA produces dendrograms because of how the distances are averaged distance to the common ancestors.

To correct the errors that the assumptions made in the UPGMA method can cause, neighbor-joining (NJ) method was introduced. The first step is identical to UPGMA — construction of the distance matrix. Then, the matrix is corrected by placing all sequences on a "star tree" as shown in the following figure. The pair of sequences with the smallest corrected distance is grouped. The corrected distance to a newly formed common ancestor is calculated. So, this step helps identify branch lengths to a new common ancestor. The corrected distance matrix is recalculated, and a new pair of most similar sequences is chosen. The process continues until the whole "star tree" is resolved. The NJ method produces an unrooted tree, so the outgroup approach or midpoint rooting should be considered to root the resulted tree (BI: Phylogenetic tree). Instead of dendrograms like UPGMA, NJ produces phylograms.

Resolution of a star tree

These two methods produce one tree as an output.

In contrast, the minimum evolution (ME) algorithm searches for the best, or optimal, tree through many trees. Let's see how this method works. The first step is to construct an NJ tree. Then, an NJ neighborhood is made using branch swapping technique. This approach is called close-neighbor-interchange (CNI) and generates many tree topologies based on a starting tree. The last step is choosing the optimal or ME tree based on overall minimum branch length (ME criterion). ME method uses NJ as a first step, so the resultant tree is unrooted phylogram.

UPGMA and NJ can be used for large datasets, but they aren't as accurate as the ME method. However, ME is suitable for a maximum of a dozen sequences, because the construction of many tree topologies is an exhaustive process. Also, all these methods reduce sequence information into a single value, so a lot of information is lost and ancestral sequences at internal nodes can't be deduced.

Character-based methods

The two most popular character-based methods are maximum parsimony (MP) and maximum likelihood (ML).

The maximum parsimony (MP) method is similar to ME — MP searches through many tree topologies and finds the final tree based on the assumption that the tree with the minimum number of mutations is the optimum. The first step is to retain only informative columns in an alignment, which are columns with at least two types of residues each occurring at least twice. Other, noninformative columns in the alignment are filtered out. The second step is to build many tree topologies. To do this, a tree of three sequences is built, then a fourth sequence is added to form three different tree topologies and so on. The process continues until all sequences are added. The final step is to find an MP tree by searching for the tree with minimum number of mutations. MP produces unrooted cladograms, so the result represents only evolutionary relationships among organisms.

You can see that the MP method is computationally expensive, so it is suitable for ten sequences or so. Additionally, you can't apply any substitution models here. Also, it is very sensitive to the "long-branch attraction" (LBA) effect. The LBA effect is an artifact that occurs because rapidly evolving sequences (i.e. sequences on long branches) tend to attract each other even though they aren't similar.

Correct tree topology is compared with incorrect topology caused by long-branch attraction

The maximum likelihood approach isn't as sensitive to the LBA effect as the MP method, and it is quite popular among character-based methods. Let's discuss how it works on the following example. Given three sequences and $n$ columns in an alignment, let's build all three possible tree topologies. Then, for the first tree topology ( $A$ ) we take characters from the first column and calculate the most likely topology ( $L_1^A$ ). $L_1^A$ is just a sum of log likelihoods of each branch $L_1^A=\sum_{branches} logP_{branch}$ with a given substitution model. The procedure continues for all $n$ columns, so the final likelihood for the first tree topology is $L_A = \sum_{i=1}^{n} L_i^A$ . Then, we calculate the overall likelihood of the other two tree topologies ( $L_B, L_C$ ). The final tree, or ML tree, is a tree with a maximum overall likelihood, $max(L_A, L_B,L_C)$ . The ML method produces unrooted phylograms.

Possible tree topologies from an alignment of three sequences

Calculation of the most likely tree

The ML approach is exhaustive, but is less sensitive to LBA artifacts than MP. However, with increasing number of sequences, exact ML starts to consume a lot of computational resources, so several heuristics and alternative approaches to overcome the problem were made. Some examples are quartet puzzling, genetic algorithms, and Bayesian approach.

Summary of the observed methods is shown below.

Algorithm name	Type	Rooted/unrooted	Branch length	Tree type
UPGMA	DB	Rooted	Yes	Dendrogram
Neighbor-joining	DB	Unrooted	Yes	Phylogram
Minimum evolution	DB	Unrooted	Yes	Phylogram
Maximum parsimony	CB	Unrooted	No	Cladogram
Maximum likelihood	CB	Unrooted	Yes	Phylogram

Conclusion

There are two sets of tree construction methods: distance-based methods and character-based methods. Distance-based methods rely on a distance matrix, and some of them, like UPGMA and NJ, work efficiently for large datasets. Character-based methods work with sequences column-wise. Two of the most popular approaches are MP and ML. MP has a well-known artifact called LBA, so you should double check your MP tree using another method, for example ML.

How did you like the theory?

Report a typo

Phylogenetic tree algorithms

Tree construction methods

Distance-based methods

Character-based methods

Conclusion

Related topics