Natural scienceBioinformaticsDNA sequencingIllumina

Illumina quality control

8 minutes read

Now you have a strong theoretical knowledge to start your own analysis of sequencing data. Regardless of the nature of your data (DNA-seq, RNA-seq, ChIP-seq), further analysis is impossible without assessing the reliability of the data and filtering out irrelevant information. Here we will focus on the Illumina sequencing pipeline, but the basic principles taught in this step apply for data from other second-generation sequencers as well.

Retrieving raw reads

During base calling, an Illumina sequencer generates a binary bcl file that is converted to a FASTQ file. FASTQ files contain raw reads — bytes of sequence accompanied with quality scores. Extracting raw reads from repositories (e.g. Short Read Archive) is usually the first step of bioinformatics analysis. Don't forget to revise experimental design and sequencing protocol before working with raw reads (sample description is illustrated on the picture below).

Experiment description from SRA database

Paired-end reads. With paired-end sequencing, the fragments are sequenced from both sides. This approach results in two reads per fragment, with the first read in forward orientation and the second read in reverse-complement orientation. With this technique, we have the advantage of getting more information about each DNA fragment compared to reads sequenced by only single-end sequencing.

Paired-end sequencing of a fragment: the inner distance between Read 1 and Read 2, insert size, and fragment size are shown

In paired-end sequencing, we expect two FASTQ files, as illustrated in the picture below. Order of reads matter — sequence from the same string in both files are derived from the same fragment.

Files with forward or reverse reads and read length information stored in SRA database

Reads quality control

As you know, the sequencing process is always subject to errors, thus, once sequencing reads are obtained from the sequencing machine, they need to be pre-processed. The processing steps include a quality check and data preparation to avoid mistakes in downstream analysis. Let's discuss the best practices in reads quality control.

1. Initial quality control. FastQC is a commonly used software for performing quality assessments on sequencing reads. It calculates statistics about the composition and quality of raw reads. Additionally, MultiQC tool can be used to generate a single report based on many FastQC reports from multiple samples. It is important to remember that FastQC only underlines potential problems — it doesn't fix them.

2. Checking and filtering contaminants. One of the steps of a sequencing experiment is the amplification of a very small amount of genetic material, which could be a contamination from any source. The program FastQ Screen was designed to confirm that reads are derived from the expected organism. It uses alignment algorithms to check a wide range of interspecies genomes that the user specifies in the target list. If you work with human samples and your lab mate does experiments with E.Coli, you know what genome to check for contaminants! However, researchers often don't know the contamination source in advance. The tool Kraken, which was primarily developed for metagenomic studies, examines a pre-built large database of bacterial, archaeal, and viral genomes against your sequence. Moreover, Kraken is not restricted to its standard database — other genomes, such as eukaryotic genomes, can be added.

In addition, it is possible to only keep reads that map to one genome and filter out unwanted reads. The following picture is an example of contamination filtering with FastQ Screen.

Filtering of human contamination in mouse samples

3. Trimming and filtering. Based on the results of the quality check, you may want to trim or filter reads. Filtering usually refers to the removal of an entire read, while trimming makes reads shorter. These procedures often include:

  • Trimming of technical sequences. Technical adapters include flow cell binding sites, primer binding sites, and index sequences. An adapter appears on the 3'-end of a read when the insert is shorter than the read length and sequencing continues through the technical sequence.
  • Filtering primers and adapter dimers
  • Trimming bad-quality reads from the end of a sequence
  • Filtering sequences that are too short

Low-quality bases are removed after trimming and too short sequences are removed after the filtration process

There are different tools that can be used for read trimming: Trimmomatic, fastp tool, and Cutadapt tool. To perform quality trimming, you need to specify the quality threshold and minimum read length (see the scheme above). There is a trade-off between having good quality reads and having enough full-length reads. All in all, you may start with gentle trimming and check the result with FastQC.

Note, that in paired-end sequencing, reads from one spot are treated together: if one of the reverse reads is removed, its corresponding forward read should be removed too, and vice versa. As a result, you will use only high-quality trimmed reads in FASTQ file for further analysis.

Short read error correction

Trimming reads unfortunately results in loss of information. Therefore, there is motivation to correct short-read imperfections, such as substitution error, which is one of the major sources of error in Illumina sequence technology. Fortunately, the low sequencing cost allows experiments to produce many reads to obtain highly redundant coverage. Coverage is defined as the number of reads that 'cover' or align to a particular genome region. This redundancy makes it possible to catch mistakes. The concept behind overlap-based error correction is illustrated below. However, the error correction procedure is intensive in both computational work and memory usage due to the large number of short reads. The Musket algorithm is an example of corrector for Illumina short-read data.

The majority of reads aligning to the particular genomic region have guanine (G) on a particular spot, thus individual read with cytosine (C) can be corrected

Conclusion

In this topic, we discussed quality assessment of reads files and methods for improving reliability of raw data. Raw reads are stored in FASTQ files in repositories. First, we check a report of overall quality (FastQC) and make a decision on filtering and trimming. Contamination can be checked with a FastQ Screen (known contamination source) or Kraken (unknown contamination source) and then filtered. Next, we may trim bad-quality ends, short sequences, and adapters with Trimmomatic or a similar tool. An alternative to read trimming is an error correction procedure. However, correctors are resource-intensive programs and they may not produce high-quality results. All in all, after pre-processing steps we end up with high-quality data, which is relevant for our downstream analysis.

6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo