We have already covered basic phylogenetics terminology, how to read trees, and how they can be expressed in graphical and text formats in Phylogenetic tree topic. In this topic, you will learn how to construct a phylogenetic tree.
Tree construction steps
Tree construction can be divided into four major steps: (1) choosing molecular data (DNA, protein) on which you will perform analysis; (2) applying multiple alignment; (3) selecting tree building method and substitution matrix if applicable; (4) evaluation of the resultant tree.
Choosing a multiple alignment tool and molecular markers
The choice of molecular markers highly depends on the purpose of the study. If you study very closely related organisms like humans and other primates, you can use coding sequences. If you want to delineate more divergent groups of organisms like mammals, amphibians, and insects, you will need to take into account protein sequences. Amino acids are coded by three nucleotides (=codon), so a change in a codon may not necessarily cause a change in an amino acid. In other words, protein sequences can remain the same even if corresponding DNA sequence changes. Another option is to use noncoding sequences like ribosomal RNA or promoter/enhancer DNA sequences. With noncoding sequences, you can differentiate evolutionarily distant groups of organisms like archaea, eukaryotes, and bacteria [see details in the paper].
After you choose your molecular data, you need to choose a multiple alignment tool. We covered different state-of-the-art multiple alignment programs in topic Multiple alignment like ClustalW, T-Coffee, Muscle, and MAFFT. They all are valid choices for this step. However, it's important to keep in mind that you should often, if not always, manually check the resulted alignment. The basic rule is to check whether you observe residues of similar physicochemical properties in conservative regions like lysine and arginine or glutamic acid and aspartic acid.
You will often observe gap regions in the resulted alignment. Let's look at an example to show what you should do with them. If you see gaps at the start or at the end of the alignment, it is probably some noise that you can get rid of. However, if you observe gaps in some portion of sequences while other sequences have characters in corresponding columns and they form continuous bands, it is probably meaningful information like an insertion. These two scenarios are depicted in the figure below (red color highlights conservative blocks in alignments).
Tree building methods and tree evaluation
After performing the multiple alignment, you need to choose a tree building method, substitution model, and among-site rate variation, if applicable. There are two major groups of tree building algorithms: distance-based and character-based. A distance-based algorithm takes whole-length sequences and computes the distances among all the sequences. All the distances are put into a distance matrix, in which every cell depicts the distance between a pair of sequences. Such algorithms are quite fast, but can potentially can lose information because they depict alignment between a pair of sequences as one number. Character-based algorithms treat sequences character-wise. Such algorithms typically can treat no more than 10 sequences in an alignment in a reasonable time frame. However, there are several heuristics and additional fast preliminary steps than can cut computational time before using character-based algorithm.
The rate of amino acids or nucleotide substitutions can vary between different pairs of residues. This information is represented in a substitution matrix (see the topic Substitution models). In that topic, we covered DNA models like Jukes-Cantor, in which all substitution rates are assumed to be equal, and the Kimura model, in which transitions are more frequent than transversions. We also covered protein models like PAM and BLOSUM where PAM is more relevant for highly related sequences. Additionally, we talked about an assumption of among-site rate variations that is called the -distribution model.
Reliability of the resultant phylogenetic tree should be tested. For example, bootstrapping is a widely-used approach. alignment columns are randomly duplicated in the expense of the other columns, so the new alignment has the same length as the original one (in the following figure, red is used to highlight new duplicated columns). The new tree is constructed using the resulting alignment. Such procedure is repeated from 100 to 1000 times, and the final consensus tree with bootstrap values on branches depicts the averaged tree. Bootstrap value on a branch depicts how frequent such branch was observed in new trees.
Bootstrapped trees with support values on branches are frequently used because they are more reliable than non-bootstrapped trees. Another similar method, called jackknifing, is used too. In this approach, half of the sites in an alignment are randomly removed, so the final alignment is two times shorter than the original one. Next steps are the same as in bootstrapping approach.
Application
In programs, some steps are merged into one step. For example, in MEGA software you can choose tree building algorithm, substitution model, and among-site variation where applicable. You can also turn on or off bootstrapping. And all these procedures happen in one click in graphical interface. In MEGA, you can even build multiple alignment or give distance matrix as an input. RAxML program uses character-based algorithms only and also provides choice of substitution models and whether to bootstrap or not. All these steps happen in command-line interface.
Conclusion
The whole tree construction process can be divided into a few steps, that include choosing the type of data, multiple alignment program, tree building method, substitution model and assessing the resulted tree. All these steps or some of these steps are often merged into one step like in MEGA or in RAxML programs respectively.