6 minutes read

When analyzing sequencing data, you might want to compare two (or more) biological sequences that have similar strings of letters but are not identical in particular positions. To find which parts of the sequences are the same and which are different, bioinformaticians use different alignment algorithms.

In this topic we will cover two main aspects:

  • What is alignment in bioinformatics and how is it interpreted?

  • How many types of alignment algorithms are there and what's the difference between them?

Alignment 101

Alignment, or sequence alignment, is a way of arranging multiple sequences to find similarities in them. You can align any text sequences: in bioinformatics it can be DNA, RNA, or protein sequences, but alignments are also used in historical and comparative linguistics. The goal of word alignment, for example, is to find relationships between words or phrases in sentences written in different languages. In the example below, we're comparing a sentence in German with a sentence in English and trying to find similar regions with similar meanings.

two sentences with the same meaning in different languages, arrows connect words that mean the same thing

The same principle is applied to nucleotide or protein alignment – but instead of words we have nucleotides or amino acids. Let's dive into sequence comparison and interpretation!

Matches, mismatches and gaps

During sequence alignment, we will be comparing characters with three possible outcomes: the characters can be the same (match), different (mismatch), or a sequence will be missing a character that the other has (gap).

Compare two sequences: AAG and AC-. A and A - a match, A and C - a mismatch, G and a dash - a gap.

A mismatch or a gap means that a biological event occurred during the evolution of these sequences. A mismatch corresponds to a substitution – one nucleotide has been replaced by another. The gap can correspond to either the deletion or insertion of nucleotides in one sequence but not the other.

There are several ways to compare sequences: by the amount of them (two or more) or their length (parts of sequences or their whole length). Let's start with the first type!

Pairwise and multiple alignments

When we're comparing two sequences to find similar/unsimilar regions, we use Pairwise sequence alignment (PSA). It is quite helpful in determining the functional, structural, or evolutionary relationships between two sequences.

Multiple sequence alignment (MSA) is used to compare more than 2 sequences. This allows us to discover gene or protein families, enzyme active sites, functional and evolutionary relationships, and many other interesting biological features. However, the algorithms for MSA and PSA are quite alike because the algorithm for multiple alignments is based on the algorithm for pairwise alignments. For example, one of the most popular algorithms for multiple alignment, ClustalW (ClustalX), uses a pairwise alignment algorithm for all pairs of sequences and then aggregates the pairwise alignments for the full set of sequences.

Pair alignment: we compare two sequences with each other (there can be a match, a mismatch and a pass). Multiple alignment: compare 3 or more sequences, compare first with second, second with third, etc.

Global and local alignments

Global alignment compares sequences along their whole length to find as many matching characters as possible. The best candidates for global alignment are sequences that are almost the same length. But what if these two sequences are very different; for example, one of them is much shorter than the other. How can you find the similar places while ignoring the different ones? In this situation, you should choose local alignment, in which sequences are aligned to identify areas of greater density or visible similarity.

Global alignment: Sequences start together and end together. Local alignment: compare only small fragments of sequences, not paying attention to what is outside the aligned fragment.

Application

Now that we know what sequence alignment is, the remaining question is why do we need it in bioinformatics?

First, by aligning newly sequenced genes with sequences already present in the database, researchers can predict the function of their newly discovered genes.

Alignments can also be used to:

  • Identify new gene family members;

  • Find evolutionary relationships or reconstruct phylogeny to determine if two or more genes or proteins are related to one another in similar species;

  • Make predictions about the location and function of protein–coding and transcription-regulation areas in genomic DNA. Since regulatory regions in the genome are usually conserved, the presence of such conserved regions makes it simple to identify the regulatory sites in sequenced genes;

  • Identify protein regions that are similar in structure or function.

Conclusion

Sequence alignment is an important part of modern bioinformatics. It is the comparison and detection of similarities between biological sequences. There are multiple alignment algorithms for various biological applications, but to compare them all and find the best algorithm, we must first incorporate some type of scoring. This will be discussed in the following topics.

6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo