Natural scienceBioinformaticsData and ToolsFASTQ

FastQC

10 minutes read

With high-throughput sequencers, such as Illumina, tens of millions of sequences can be generated in one run, but the sequencing quality isn't perfect. For this reason, quality control should always be done before analyzing the sequence. It ensures that your raw data is accurate and does not have any errors or biases which may affect its future use. Usually, an Illumina quality check is performed with FastQC, a simple program that allows you to find all types of problems originating from the sequencer itself or in the starting library materials.

About FastQC

FastQC is an open-source application developed to identify issues in datasets from high-throughput sequencing (mostly Illumina sequencing). It performs a series of analyses on one or more raw .fastq or .bam/.sam sequence files and then generates a QC report with a summary of the findings. You can find more information regarding FastQC on its official website.

First steps

To install FastQC, you can go directly to the project page and download the compiled packages for Windows, OSX, and Linux.

There are two ways to run FastQC: interactively for small-scale analysis of FASTQ files or non-interactively for integration into a larger pipeline for systematic processing of many files at once. The interactive way involves downloading a FastQC application, where you can produce a QC report and then download it as an .html file. In this topic, we will describe how to use an interactive application and analyze the results. The non-interactive way uses the application via your command line (in Terminal). Here's an example of a command:

fastqc sequence.fastq sequence2.fastq

When you run the program from the command line, it will produce an .html report which you can open using any browser application.

To open one or more sequence files, run the program and select File > Open. Here what it looks like:

FastQC welcome window

You can then select the files you want to analyze. If you have selected several files, each one will be opened in a separate tab at the top of the screen.

To make a permanent record of the analysis, you can create an HTML report (it creates automatically in non-interactive mode). To do so, select File > Save Report from the main menu. A report will be generated using the name of the fastq file with _fastqc.html appended. This will produce a report file for a selected tab.

Located on the left side of the app (or at the top of the HTML report) is a summary of the analysis modules which were run, along with an indication of the module's results: green (tick), orange (exclamation point), or red (cross). Having a warning doesn't always mean your data is invalid, sometimes it might just indicate that your dataset is a bit different from the standard one.

A summary of all of the FastQC modules

Alongside the HTML file is a zip file with the same name as the HTML file. This file contains the graphs from the report and data files in case you want to perform your own evaluation of the raw data.

Quality Check

Now, let's see what FastQC can do! We will briefly introduce the modules listed on the left side of the app screen!

Basic Statistics

Basic Statistics generates some simple statistics:

The Basic Statistics module

Most of the provided variables are quite clear, but not all of them. For example, in Sequence Length, the shortest and longest sequences are listed (if all sequences are the same length, only one value is reported), and %GC shows the overall guanine-cytosine content of all bases in all sequences.

Per Base Sequence Quality

Per Base Sequence Quality displays an overview of quality values at each position of the FASTQ file across all bases:

Quality scores across all bases

A box plot with whiskers is drawn for each position. The Phred Quality Scores are shown on the graph's y-axis, and the x-axis shows the position in read. Plot elements include:

  • Yellow boxes represent interquartile ranges (25-75%)

  • Whiskers at the top and bottom represent 10% and 90%, respectively

  • Red line indicates median value

  • Blue line indicates the mean quality

The graph's background divides the y axis into calls of extremely high quality (green), calls of reasonable quality (orange), and calls of low quality (red). On most platforms, call quality will decrease as the run goes on, so base calls frequently fall into the orange zone at the end of a read. Researchers frequently choose to trim reads whose quality is in the red zone.

Per Sequence Quality Scores

Using this module, you can find out whether a certain subset of your sequences has universally low quality values. Phred Quality Scores are plotted on the X axis, and the number of sequences is plotted on the Y axis. In some cases, this is the result of poorly imaged sequences (end of field of view, for example), but these should represent only a small portion of the total. The presence of a large number of sequences with low quality could indicate that there is some kind of systematic issue with the run or possibly just part of it. The graph below shows a distribution with decent quality; typically, the highest peak should not fall below a Quality Score of 27:

Quality score distribution over all sequences

Per Base Sequence Content

Per Base Sequence Content plots how many times each of the four DNA bases (Guanine, Adenine, Thymine, Cytosine) appears in a file for each position in read. Here's an example of a plot:

Sequence content across all bases

It is expected that DNA bases would be equally distributed among all read positions, so the lines in this plot should be parallel with each other, rather than being imbalanced.

Per Sequence GC Content

Here, each sequence's GC content is measured over its entire length, then compared to a normal distribution of GC content. DNA with low GC-content is less stable than DNA with high GC-content, and a sequence with high GC-content will have a higher melting point. Many sequencers (including Illumina) have trouble reading high GC-content sequences. A broadly normal (bell-shaped) distribution of GC-content would be expected, with the central peak reflecting the GC content of the genome.

Here are two examples showing a good (left) and bad (right) GC-content plot.

An example of acceptable GC-content plot

An example of unacceptable GC-content plot

How can we interpret an unusually-shaped distribution?

  • In some cases, bias could be due to a contaminated library; or, if we have DNA reads from different species with different average GC-content, we would see several peaks (which would be ok in that case);

  • Sharp, jagged peaks generally indicate overrepresented sequences or a small library;

  • Shifts in the normal distribution indicates systematic bias, regardless of base position.

Per Base N Content

In this module, you can see the percentage of base calls in each position for which N (base marked as unknown by a sequencer) was called. It is important to admit that it's totally fine to have a small proportion of unknown bases in a sequence (at the end of it, for example). However, when this proportion increases above a few percent, it's possible that the analysis pipeline is not good enough at interpreting the data to make valid base calls. The plot below is an example of a standard outcome:

N content across all bases

Sequence Length Distribution

Here, the program creates a graph that displays the distribution of sequence fragment sizes in the file under analysis. The resulting graph will often show only a peak at one size:

Distribution of sequence lengths over all sequences

For FASTQ files with variable lengths, this will show the relative amounts of each fragment size, thus warnings (but not errors!) can be ignored here. Additionally, this module can be quite helpful in evaluating the trimming results.

Sequence Duplication Levels

This module determines the amount of sequence duplication for a specific library and plots the amount of sequences with various levels of duplication.

The plot consists of two lines. The blue line depicts the distribution of duplication levels throughout the whole sequence. The proportion of de-duplicated sequences is represented by the red line. In the final plot, any sequences with more than 10 duplicates are grouped into bins to give a clear idea of the total duplication level without having to show each individual duplication value.

Sequence duplication level

The majority of sequences in a library should appear on the left of the plot (both lines). Spikes (usually on the blue line) tend to appear to the right of the plot due to the increased amount of duplicates per bin. Sometimes it can be removed with trimming, but if these peaks remain, it indicates that there are numerous highly duplicated sequences, which may be a sign of a contaminated set or a very serious technical duplication. For instance, a high level of duplication at the start or end of a sequence may indicate the presence of adapter reads or some difficulties with library preparation.

The module also determines the predicted overall loss of sequence in the case of deduplication of the library. It is shown at the top of the plot: "Percent of seqs remaining if deduplicated".

Overrepresented Sequences

Usually, there will be a wide variety of sequences in a library. If you find that a single sequence is significantly over-represented in the set, it is either highly biologically significant, the library is contaminated, or, perhaps, the library is not as diverse as you had thought. In this module, all sequences that account for more than 0.1 percent of the total are presented. The program will check for matches in a database of common contaminants for each over-represented sequence and report the best hit it finds. Finding a hit doesn't prove that this is where the contamination originates. Additionally, it's important to keep in mind that many adapter sequences are really similar to one another, which means you can have a hit reported that isn't technically accurate but has a very similar sequence to the actual match.

Here's an example of how it could look like:

A list of overrepresented sequences

Adapter Count Module

In the last module, the program checks if the reads in the FASTQ file contain a significant amount of adapter sequences (technical sequences important for the process of sequencing). The outcome is displayed as a graph, with the amount of adapter content represented on the y-axis. Typically, adapters present in more than 5% of the sequence are seen as a significant amount (it will produce a warning). There shouldn't be any adapters in the data you plan to further analyze.

Here's an example of a bad case: we can clearly see the red curve indicating large adapter sequences presence:

The presence of common adapter sequences

Conclusion

In this article, we covered the major features of FastQC: how to install it, how to open files, and a brief review of all the analysis modules. You should now know how to install it, how to use it, and how to interpret the results. Remember that since every dataset is different, FastQC results are highly dependent on the data and the situation.

10 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo