Similarities in biological sequences can be visualized as phylogenetic trees that show an evolutionary relationship among organisms. In this topic, we cover basic phylogenetic terms and discuss how to read trees.
Phylogenetics terminology
Phylogenetics is the study of evolutionary history of organisms using tree-like diagrams. Each tree is made up of the following components: leaves (taxa, sequences, or terminal nodes), branches, internal nodes, and a root (optional). Let's read the following example and assume that all capital letters at the terminal nodes represent an organism. The tree starts from the root, which can be some extinct organism or any common ancestor of all specified organisms. Then the tree gives two branches. At this point, bifurcation (or split into two branches) happens due to some major changes in genome. Such changes produce wingless organisms at the right and organisms with wings at the left. The right side then splits again into D (blue tailed) and E (red tailed) species. The left side splits into A species without a tail and a group of B and C species with a tail but different eye colors.
A group of taxa (= biological unit) composed of one ancestor organism and all of its descendants is called a monophyletic group or a clade. A group in which only some of the descendants of an ancestor are included is called paraphyletic. For example, on the above picture, B and C organisms are a monophyletic group, but A is a paraphyletic group. Let's assume that C species not only has red eyes but also a red tail. We can group C and E organisms into what is called a polyphyletic group because they share an identical feature (red tail), but they didn't inherit it from a common ancestor.
Trees can be rooted or unrooted. The above example is an example of a rooted tree, because we clearly see the primary ancestor, or root. Often you don't know anything about the common ancestor, only the relative positions of leaves. In the following figure, unrooted tree can be rooted in different branches. Depending on which branch you place the root, evolutionary relatedness will change among organisms. In the first example, C is more closely related to A and B group, than to D and E. In the second example you see the opposite relationship.
There are two major techniques that help to identify the position of a common ancestor. The first one is called outgroup. Outgroup is a homologous sequence that diverged at an early evolutionary time with your sequences. Let's say you are analyzing human sequences of a mucosal protein, if you find a homologous sequence in a reptile or a mouse, you can use it as an outgroup to root the resulted tree. The second technique is useful when you can't choose a good outgroup. It is called the midpoint rooting approach. You find the midpoint between the two most diverged groups and assign this midpoint as the root. This approach assumes that all sequences evolve at a constant rate, so the amount of accumulated mutations is proportional to evolutionary time. This assumption is called the "molecular clock" and is pretty rare in reality.
Tree representation
There are many ways to draw the same tree. Let's practice with the tree in the following example. Branches can be rotated freely without changing relationships among leaves. So, you can see that the first and the last trees are the same in terms of sequences relationship.
Tree can be represented in an unscaled format or cladogram, where the length of the branches means nothing. This representation shows only the evolutionary relationships among sequences. A tree can also be shown as a dendrogram or phylogram. A dendrogram or ultrametric tree shows evolutionary relationships and relative divergence time (e.g. millions of years). All sequences are equidistant from the root. A phylogram or metric tree represents evolutionary relationships and evolutionary changes like the number of mutations. The length of branches can vary for different leaves.
In the above three examples, we see rooted trees with the same topology. But how can we deduce from these trees who is more closely related to organism C, A or B? The simple rule is to search for the nearest common ancestor from top to bottom. In these examples, organisms B and C share more recent common ancestor, than C and A. So, we can conclude that C is more related to B than to A. If branch lengths are meaningful, you can calculate the distance between organisms by summarizing all branch lengths between these two organisms. For example, on phylogram you can see that the distance between A and C is 4, between A and B is 3.
Newick format
The information described by a phylogenetic tree can be expressed in Newick format. It's a linear representation of a tree in which monophyletic groups are nested in parenthesis. Species or groups of species in such parenthesis are separated by comma. Let's see an example above. In the case of the cladogram, we see that all species are aligned in a row, and the Newick format expresses only the evolutionary relationship.
However, we can add information about the lengths of branches, using notation after each species by adding a colon and branch length. For example, on a phylogram tree, species A has acquired 1 mutation after bifurcation from Groups B and C. In turn, B has acquired 1 mutation since it last shared a common ancestor with C. And so on.
Also you may notice that these trees are depicted in a squared form, unlike an angled form in the example above of branch rotation. These two forms are the two most used, however, you can depict tree in a round form or some more complex form. It's a matter of personal preference and making the tree clearly express the relevant information.
Conclusion
Phylogenetics is the study of evolutionary relationships among organisms using diagrams that look like trees. Every tree stores information about evolutionary history among sequences or organisms. Dendrograms show relative divergence time, which is usually expressed in some time units. On a dendrogram, all leaves are equidistant from the root. Phylograms have different branch lengths that express an amount of evolutionary changes like number of substitutions.