Read Quality Control. Stage 1/6

The FASTQ format

Report a typo

Description

We see the enthusiasm in your eyes. Let's get started to dig into bioinformatics data!

Bioinformaticians obtain the data from a device called the sequencer. A sequencer is a magic machine that can transform genetic material into text. For instance, you can transform your DNA into a set of letters — A, T, C, and G. In this project, we will work with bacterial DNA data sequencing.

The first data type you need to deal with is lightly-processed sequencing data from the Illumina sequencer. Such data is stored in text files of the FASTQ format.

A FASTQ file includes four-line parts and looks like a huge text file. To get things straight, let's take a look at one four-line part of a FASTQ file. This part is called a read:

part of the FASTQ file

The first line collects information about the instrument ID, run number, flowcell ID, lane, tile, x-pos (the X-coordinate of a cluster), y-pos, UMI, read number, filtering status, control number, and index of the read. The second line is the sequence constructed from nucleotides (letters A, T, C, and G). The third line starts with a + presenting the same data as the first line or can store additional information about the read. Finally, the last line shows information about the quality of each nucleotide in a special format. If you feel curious, this Wikipedia article can help you delve deeper.

Objectives

  • Take the input path to the FASTQ file with data;
  • Open the file and print all the information relevant only to the first read (see the Example section for the output format).

Example

The greater-than symbol followed by a space (> ) represents the user input. Note that it's not part of the input.

Example 1: contents of the test.fastq file

@SRR16506265.1071862 1071862 length=76
CGCTGGGTCTGTCGCTGGTCACCCTGTTGTTTATGACTACCGCCCTGCTGGGCTGGTACTACGTTTTGCCGTTCGT
+SRR16506265.1071862 1071862 length=76
C9CCC>CGEG9ECEDFGGEFCD@8FGFGGGCGEEGFFGFFCFGGGGGFFGGGCDGGDG9F998FF@EECF@C:FGF
@SRR16506265.1071863 1071863 length=76
GTGTGGGCCAGCGTCATATCAACCTGTGACGCCCCCAAATCACCAAAACGGAATACACGCACTGGCTGCCAGCGTT
+SRR16506265.1071863 1071863 length=76
CCCC@--CF;-@:FC@EAFFFGGGGGGGGGGGEEFGGGGGGGGFGGGGGGEGGFGGGGGGGGGFGGGGGGGGFGFD

Program output

> test.fastq
@SRR16506265.1071862 1071862 length=76
CGCTGGGTCTGTCGCTGGTCACCCTGTTGTTTATGACTACCGCCCTGCTGGGCTGGTACTACGTTTTGCCGTTCGT
+SRR16506265.1071862 1071862 length=76
C9CCC>CGEG9ECEDFGGEFCD@8FGFGGGCGEEGFFGFFCFGGGGGFFGGGCDGGDG9F998FF@EECF@C:FGF
Write a program
IDE integration
Checking the IDE status
___

Create a free account to access the full topic