Now you have a strong theoretical knowledge to start your own analysis of sequencing data. Regardless of the nature of your data (DNA-seq, RNA-seq, ChIP-seq), further analysis is impossible without assessing the reliability of the data and filtering out irrelevant information. Here we will focus on the Illumina sequencing pipeline, but the basic principles taught in this step apply for data from other second-generation sequencers as well.
Retrieving raw reads
During base calling, an Illumina sequencer generates a binary bcl file that is converted to a FASTQ file. FASTQ files contain raw reads — bytes of sequence accompanied with quality scores. Extracting raw reads from repositories (e.g. Short Read Archive) is usually the first step of bioinformatics analysis. Don't forget to revise experimental design and sequencing protocol before working with raw reads (sample description is illustrated on the picture below).
Paired-end reads. With paired-end sequencing, the fragments are sequenced from both sides. This approach results in two reads per fragment, with the first read in forward orientation and the second read in reverse-complement orientation. With this technique, we have the advantage of getting more information about each DNA fragment compared to reads sequenced by only single-end sequencing.
In paired-end sequencing, we expect two FASTQ files, as illustrated in the picture below. Order of reads matter — sequence from the same string in both files are derived from the same fragment.
Reads quality control
As you know, the sequencing process is always subject to errors, thus, once sequencing reads are obtained from the sequencing machine, they need to be pre-processed. The processing steps include a quality check and data preparation to avoid mistakes in downstream analysis. Let's discuss the best practices in reads quality control.
1. Initial quality control. FastQC is a commonly used software for performing quality assessments on sequencing reads. It calculates statistics about the composition and quality of raw reads. Additionally, MultiQC tool can be used to generate a single report based on many FastQC reports from multiple samples. It is important to remember that FastQC only underlines potential problems — it doesn't fix them.
2. Checking and filtering contaminants. One of the steps of a sequencing experiment is the amplification of a very small amount of genetic material, which could be a contamination from any source. The program FastQ Screen was designed to confirm that reads are derived from the expected organism. It uses alignment algorithms to check a wide range of interspecies genomes that the user specifies in the target list. If you work with human samples and your lab mate does experiments with E.Coli, you know what genome to check for contaminants! However, researchers often don't know the contamination source in advance. The tool Kraken, which was primarily developed for metagenomic studies, examines a pre-built large database of bacterial, archaeal, and viral genomes against your sequence. Moreover, Kraken is not restricted to its standard database — other genomes, such as eukaryotic genomes, can be added.
In addition, it is possible to only keep reads that map to one genome and filter out unwanted reads. The following picture is an example of contamination filtering with FastQ Screen.
3. Trimming and filtering. Based on the results of the quality check, you may want to trim or filter reads. Filtering usually refers to the removal of an entire read, while trimming makes reads shorter. These procedures often include:
- Trimming of technical sequences. Technical adapters include flow cell binding sites, primer binding sites, and index sequences. An adapter appears on the 3'-end of a read when the insert is shorter than the read length and sequencing continues through the technical sequence.
- Filtering primers and adapter dimers
- Trimming bad-quality reads from the end of a sequence
- Filtering sequences that are too short
There are different tools that can be used for read trimming: Trimmomatic, fastp tool, and Cutadapt tool. To perform quality trimming, you need to specify the quality threshold and minimum read length (see the scheme above). There is a trade-off between having good quality reads and having enough full-length reads. All in all, you may start with gentle trimming and check the result with FastQC.
Note, that in paired-end sequencing, reads from one spot are treated together: if one of the reverse reads is removed, its corresponding forward read should be removed too, and vice versa. As a result, you will use only high-quality trimmed reads in FASTQ file for further analysis.
Short read error correction
Trimming reads unfortunately results in loss of information. Therefore, there is motivation to correct short-read imperfections, such as substitution error, which is one of the major sources of error in Illumina sequence technology. Fortunately, the low sequencing cost allows experiments to produce many reads to obtain highly redundant coverage. Coverage is defined as the number of reads that 'cover' or align to a particular genome region. This redundancy makes it possible to catch mistakes. The concept behind overlap-based error correction is illustrated below. However, the error correction procedure is intensive in both computational work and memory usage due to the large number of short reads. The Musket algorithm is an example of corrector for Illumina short-read data.
Conclusion
In this topic, we discussed quality assessment of reads files and methods for improving reliability of raw data. Raw reads are stored in FASTQ files in repositories. First, we check a report of overall quality (FastQC) and make a decision on filtering and trimming. Contamination can be checked with a FastQ Screen (known contamination source) or Kraken (unknown contamination source) and then filtered. Next, we may trim bad-quality ends, short sequences, and adapters with Trimmomatic or a similar tool. An alternative to read trimming is an error correction procedure. However, correctors are resource-intensive programs and they may not produce high-quality results. All in all, after pre-processing steps we end up with high-quality data, which is relevant for our downstream analysis.