Natural scienceBioinformaticsData and ToolsFASTQ

Trimmomatic

10 minutes read

As we already know, data from sequencing experiments is of varying quality and can contain contamination and other artifacts of sequencing. However, it is possible to identify contaminants and remove them, leaving valid data intact. Here we will introduce a widely-used bioinformatics tool — Trimmomatic, which is a simple and convenient preprocessing tool to filter and trim sequencing data.

What is a Trimmomatic?

Trimmomatic is an open-source (meaning it's free for anyone to use) console app (meaning that you have to run it from a terminal). Usually Trimmomatic is used when a bioinformatician already did some sort of quality check like FASTQC or any other QC analysis program. As its name suggests, Trimmomatic's main feature is to trim technical sequences, adapters, etc. The program has several trimming modes, one of which is designed specifically to detect palindromes. This comes in handy when we work with Illumina sequencing data, where we have FASTQ files for forward and reverse reads. Preprocessing results can also be checked in a QC analysis program.

Installation

A note before we continue — if you have questions that go beyond the content discussed here, you are very welcome to check out the official Trimmomatic page.

You can simply download a binary zip pack or build from source from Trimmomatic website or install from GitHub.

If you have conda installed, run this

conda create --name trimmomatic_env

source activate trimmomatic_env

conda config --add channels bioconda

conda install --channel bioconda sra-tools trimmomatic

conda config --remove channels bioconda

Usage

Trimmomatic can help us trim illumina paired-end and single-ended data. Let's use single-ended data for an example.

The Trimmomatic website has a manual with the full description of functionalities. Here we provide only the most popular commands:

LEADING:n – cut bases off the start of a read, if below n (a user-defined threshold); TRAILING:n – cut bases off the end of a read, if below n (a user-defined threshold); MINLEN:n – drop any read if less than n nucleotides after SLIDINGWINDOW trimming; SLIDINGWINDOW:n:m – first, start scanning from 5' end, take n-sized window. Then, if that average is m or more, move down and take the average of the next n bases; repeat this process until the end of the read, unless it hits a spot where the average is less than m, in which case it trims the read there.

Here's a picture of the trimming process:

Sliding window trimming process

A note on adapters trimming: For various reasons (such as dimers formations, short fragments of DNA), adapter sequences can still appear in some reads. In this case, the forward and reverse reads will contain adapter sequences, which are called "palindromes" – for example, if a small part of an adapter is present at the end of the forward and the reverse read, that might be a "palindrome". For the adapter clipping, there is a special command in Trimmomatic: ILLUMINACLIP. It cuts adapters from the read. Let's take a closer look on its arguments with an example:

ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 TruSeq3-PE.fa – fasta file with adapters, 2 – seed mismatches (maximum mismatch count which will still allow a full match to be performed), 30 – palindrome clip threshold (how accurate the match between two 'adapter ligated' reads must be for paired-end palindrome read alignment, a Phred Score), 10 – simple clip threshold (how accurate the match between any adapter sequence must be against a read, a Phred Score).

There are also two optional parameters:

minAdapterLength – minimum length of an adapter to be detected, keepBothReads –after palindromes detection, Trimmomatic usually drops a reverse read, as it contains the same information; with this argument set to True, you can keep it.

Example

Here's what a example command looks like (single-ended data):

java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:70

Have to admit, it DOES look scary at first. Let's break it down! Basically, it says:

Use Trimmomatic ( java -jar trimmomatic-0.35.jar; if you have conda installed, run trimmomatic instead) and let it know that we are working with single-end (SE) sequence data. Also, our data has Phred33 (this is determined by the format of your FASTQ file) encoding. Take input.fq.gz as an input file.

Next are the cropping options:

  • For each read, cut 3 bases off the start (LEADING:3) and 3 bases off the end (TRAILING:3);

  • Take the average quality of the first 4 bases and trim if the average is below 15 (SLIDINGWINDOW:4:15);

  • After the SLIDINGWINDOW trim is done, check to make sure the trimmed read is 70 nucleotides or longer (MINLEN:70); if it is not, delete the read from my data;

Finally, put the resulting reads into the output.fq.gz file.

Results

The result is stored in the output.fq.gz file. If we use paired-end data, we'll have output files grouped together, for example, output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz. You can then use any QC analysis program (such as FASTQC) to check your result and/or compare it with untrimmed data.

Conclusion

Now we know, how helpful Trimmomatic can be with preparing our data for further analysis by trimming sequences. Trimmomatic can be easily installed and run directly in the console for single- and paired-end datasets and results are stored in a zipped file, which can be later checked with FASTQC program.

6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo