Natural scienceBioinformatics

Bioinformatic pipelines

6 minutes read

In this topic we will discuss the characteristics of bioinformatics pipelines.

What is a pipeline?

In programming, a pipeline refers to a set of sequential steps that automatically processes data. The output of one step is the input of the following step. A step is usually a program that has one specific function.

The main advantage of the pipeline is that executing steps in parallel greatly speeds up the process. It's like an assembly line in a factory, where workers simultaneously work on different parts of the manufacturing process. The assembly time of one specific product (for example, a phone) is the total work time of all workers in the assembly line, but each new phone comes out in the work time of one worker (the slowest one).

Along with advantages come disadvantages: to produce such a result it is necessary to synchronize all stages; with one error at any step, the data of all running processes may be lost; finally, the operation of the pipeline requires much more memory and computing power than the sequential execution of all steps.

Bioinformatics pipelines

When you write a pipeline for your bioinformatics project, you will often include steps that someone has already implemented as a tool (or a programming tool). Thus, your task can be reduced to just writing a script that will run the existing tools in the right order. As a result, the important function of a bioinformatics pipeline is to correctly implement required tools and to coordinate their work.

Characteristics of bioinformatics tools

Biotools are most often written by scientists because programs for bioinformatics are not usually commercialized due to a very limited number of users with very specific requests. There are several features of academic programming tools that make them both easy and difficult to work with:

  • tools are generally permissive-licensed software or even public-domain software (i.e. free)

  • it's often open source software so you can examine the code in detail and modify it however you want in your local copy

  • different tools can be written in different programming languages

  • authors often neglect their products and don't provide any support (but are often very open to answering questions via email)

  • tools can use different data formats, including custom ones

  • the algorithm can be based on very specific scientific knowledge, so it may take you a considerable amount of time to understand the concept

The above list of features leads to the following particularities of work with biotools:

  • bioinformatics pipelines often require steps that convert formats

  • each time you add a new tool to your pipeline, you need to check compatibility, dependencies, create environments, rewrite legacy, etc

  • to combine tools written in different languages in one pipeline, bioinformaticians often write shell scripts

  • tool manuals can be very brief and often require an understanding of the relevant biological area

  • sometimes you actually have to debug source code

  • you may get support not from the author, but from fellow users on forums

How to choose a correct bioinformatics tool?

As you have already guessed, one must choose tools for their pipeline carefully. The main factors to consider:

  • make sure that the tool does the exact task you expect it to do

  • don't forget to check the formats of incoming and outgoing data

  • tool quality is important. The indicators of high-quality tools are:

    • popularity (the more users that use the tool, the more it is reviewed in the process)

    • articles that are published by the authors along with the tool release (where they demonstrate its performance)

    • benchmark studies where scientists compare the performance of several tools that fulfill the same function

    • your own favorable opinion, if experience and expertise allows you to form one

    • price (paid tools generally have reasonable quality)

  • whether the tool is implemented as a local version software or a web application (a local version tool is much easier to integrate in a pipeline, while a web application does not consume memory and processing power of your local machine)

  • programming language that a tool is written in (in case you plan to modify its code)

Example of bioinformatics pipeline

Let's take a look at a typical bioinformatics pipeline that performs genome assembly from illumina data. The task is as follows: you are given illumina raw reads and you need to assemble them in a complete genome.

An example of bioinformatics pipeline: various steps and tools are required to assembly a complete genome from raw Illumina reads

Base illumina pipeline includes FastQC for quality control check on raw data. It helps you analyze how to trim reads from technical sequences and poor quality sequences using Trimmomatic. Now the reads are ready for assembly (for example with SPAdes). Then another quality control check follows, now on the genome assembly (check out QUAST and certainly don't forget about BUSCO). Finally you can add annotations (gene map) to the genome with Prokka and download your result in the genome browser (like IGV browser) to visualize it. Voila!

Conclusion

Now you have an idea what's inside a bioinformatics pipeline and what to pay attention to when choosing tools for your pipeline. You can also choose a ready-made pipeline based on the same principles. Most likely you can and will need to modify both the pipeline and the tools inside according to your needs and the type of your raw data. All in all, bionformatics provides you with complete freedom, but with freedom comes responsibility.

9 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo