In the field of sequencing, Illumina is one of the most popular platforms: 90% of sequencing data worldwide are generated by Illumina. However, like all sequencing platforms, it has its own set of limitations and errors – the average per base error rate of Illumina is around 1/1000. In this article, we will explore what these are, and how they can impact your sequencing results.
Illumina limitations
Even though specific limitations can vary from one type of Illumina sequencer to another, there are a few key limitations to Illumina sequencing that are important to consider:
- Illumina sequencing is limited by the size of fragments that can be sequenced. This limitation is due to the chemistry used in the sequencing process, which relies on short stretches of DNA being attached to a glass plate. Larger fragments of DNA cannot be attached as easily and therefore cannot be sequenced as accurately. Illumina might not be the ideal option if getting large DNA reads (more than 300 base pairs) is crucial for your experiment;
- Illumina sequencing can take a long time to generate sequence data;
- Another downside is that there is no real-time data access; users have to wait until the sequencing process is complete before they can begin their analysis.
Let's learn which Illumina errors are the most common ones and how to solve them!
Homopolymers errors
Homopolymers, also called mononucleotide microsatellites, are arrays of identical nucleotides. "ATTTTTTGC", for example, has a homopolymer of length 6 (composed of "T" base).
What types of errors are occurring in homopolymers? Mostly insertions or deletions, or indels for short. Here's how they look:
Why does it happen in Illumina?
Illumina has base identification problems when it comes to one nucleotide that differs from the adjacent majority. Let's say we have a following sequence region: "CCCCCCGCCCCC". Signal from nearby cytosines (C) will infer signal from guanine (G), and this base will be lost, resulting in an indel in the final sequence.
The issue can, however, be resolved. Three widely used solutions to this problem are listed below:
- Using special tools — software like Pollux can realign the reads and correct previously miscalculated homopolymer length or a sudden loss of a nucleotide;
- Removing homopolymers from the non-coding regions of the sequence – might sound harsh but speeds up the sequence analysis, what is often important;
- cross-validating the sequence – increasing the coverage or resequencing with another platform is also a possible solution, although certainly the most expensive one.
Phasing and prephasing
In sequencing results, it is frequently noticeable that quality declines as base position increases over read. The main reason for that is phasing. Phasing occurs when a nucleotide's blocker is not properly removed after signal detection. The old nucleotide is detected once more in the following cycle because no new nucleotide can bind to this fragment of DNA. Its fluorescence signal can vary from the synchronous signal of the other nucleotides (this is shown on a picture below). This DNA fragment will now be 1 cycle out of phase with the other DNA fragments, contaminating the light signal that the sequencer's camera should read.
Sometimes, the similar thing happens when something's not right with nucleotide terminator cap. This process is called prephasing – two nucleotides bind during a cycle, what later results in a DNA fragment being ahead of others.
Phasing and prephasing can seriously influence the overall quality score. These problems mostly occur because of improper reagents use (and their manufacturing) or high GC-content (only in case of phasing). Two main solutions are rerunning the experiment or trimming parts of the sequence with low quality score.
Read through adapters
As we already know, during the process of Illumina sequences adapters are added to the end of a sequenced fragment. The read length can sometimes be larger than the fragment being sequenced, and the adapter will be sequenced as well. As a result, there are numerous issues with how your data is processed.
But rest assured, fellow researcher! Fortunately, this problem may be identified and resolved. You should suspect the presence of adapters in reads if, for example, reads aren't mapping properly to the reference genome;
You may also run your FASTQ file via some specialized software to scan the sequences for the presence of adapters. FASTQC tool, for example, even has a section dedicated to detecting the presence of adapters. After successfully locating read through adapters, we should get rid of them by trimming the sequence, which we actually covered in the Trimmomatic topic. Of course, there are other tools that can be used for that, such as Cutadapt tool or Skewer (for paired-end data).
Conclusion
Even the best sequencing technologies have their limitations and errors. The Illumina sequencing platform is no different. While it is very accurate, it can sometimes make mistakes. These errors can be due to a number of factors, including the quality of the DNA sample, the type of sequencer used, and even the user's expertise. Despite these potential errors, Illumina sequencing is still the best option for many applications. It is highly accurate and can produce large amounts of data quickly and cheaply. For these reasons, it remains the gold standard in sequencing technology.