Natural scienceBioinformaticsData and Tools

FASTA, FASTQ

Provided by: Edvancium

9 minutes read

FASTA and FASTQ are special text-based formats for storing nucleotide and amino acid sequences. It was adopted from the FASTA alignment software and has since become a standard in bioinformatics. Sequences are stored in FASTA format, while sequencing data with quality scores for each read is recorded in FASTQ format.

In this topic, we'll take a closer look at the formats themselves and discuss their features.

FASTA

With the FASTA format, sequences are easy to parse with any text-processing tool. To open a .fasta file (extension can be different as there is no universal standard, but it is often .fa, .fasta, .ffa or .ffn), you can use any text editor (for example, WordPad) available on your computer. Usually, the .fasta file looks something like this:

Nucleotide FASTA:

>Seq3 [organism=Phalaenopsis equestris var. leucaspis]

CCTATACCTAATTTTCGGCGCATGAGCCGGAATGGTGGGTACCGCTCTAAGCCTCCTCATTCGAGCAGAA

CTAGGCCAACCCGGAGCCCTTCTGGGAGACGACCAAGTCTACAACGTGGTTGTCACGGCCCATGCCTTCG

Amino acid FASTA:

>sp|P01308|INS_HUMAN Insulin OS=Homo sapiens OX=9606 GN=INS PE=1 SV=1

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED

LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

One .fasta file can contain several FASTA sequences, but each sequence in the file starts with a Definition Line, which has:

"greater-than" (">") sign as the first symbol;
sequence description, where — depending on the database or software you use — sequence number, gene name, organism name, etc, can be found;
optional commentary.

Any line without the ">" symbol is a Sequence Line, where characters represent IUPAC-coded nucleotides and amino acids.

Basically, there are only two types of lines: definition lines and sequence lines. Each sequence line should contain the same amount of symbols (usually from 50 to 80) to facilitate indexing (this is not a strict rule, just a recommendation) .

To avoid parsing problems, there are certain formatting rules: for example, you should place the ">" symbol only at the beginning of Definition Lines and you should NEVER put spaces between characters in sequence lines. :)

However, the FASTA format isn't the only one used in bioinformatics nowadays. Developed in the early 2000s, the FASTQ format is another widely-used format.

FASTQ

Increasingly, high-throughput sequencing instruments (such as Illumina) utilize the FASTQ format as the format for storing output. The format differs from regular FASTA in the following ways:

The first line begins with the '@' character instead of '>' and is followed by a sequence identifier and, sometimes, a description; everything from the '@' to the first whitespace character is a sequence identifier, everything after is the sequence description;
On the second line, you'll find the sequence itself;
Third line starts with + and can optionally repeat the same sequence identifier;
Lastly, the fourth line encodes the quality scores as ASCII characters.

Here's an example (from official documentation)!

@EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCC

+

;;3;;;;;;;;;;;;7;;;;;;;88

@EAS54_6_R1_2_1_540_792

TTGGCAGGCCAAGGCCGATGGATCA

+

;;;;;;;;;;;7;;;;;-;;;3;83

@EAS54_6_R1_2_1_443_348

GTTGCTTCTGGCGTGGGTGGGGGGG

+EAS54_6_R1_2_1_443_348

;;;;;;;;;;;9;7;;.7;393333

But why has FASTQ become so popular? In terms of sequencing, FASTQ solves a specific problem: since different sequencing technologies work differently, the confidence in each base call (or probability of identifying nucleotides correctly) varies. This is expressed in the Phred Quality Scores. FASTA has no way of encoding this, we can only store our sequence and the sequence name there.

Phred Quality Scores

Line 4 of the FASTQ file contains information about sequence quality, so-called Phred Quality Scores. Quality scores are a way to assign confidence to a particular base call — each Phred Quality Score (Q) reflects the probability (P) of sequencing error (meaning that the base call was incorrect). Using the equation below, we can calculate the Phred Quality Score:

$Q = -10\log P$

To calculate the probability of error, we can divide the Quality Score by the divisor, which is (-10) here and then, using rules of logarithm, we will get:

$P=10^{\frac{-Q}{10}}$

For example, if the Phred Quality Score is 40, it means that the chances of this base to be called incorrectly are 1 in 10000, and the accuracy is 99.99%.

Phred quality score	Probability of incorrect base call	Base call accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10000	99.99%
50	1 in 100000	99.999%

As we already know, quality scores are encoded with ASCII characters on the fourth line of a FASTQ file. What is ASCII, why do we use it here?

ASCII — American Standard Code for Information Interchange — is a popular character encoding format. With ASCII we can assign numerical values to letters, numbers, and punctuation marks to reduce the file size.

Here's a decimal ASCII chart as an example:

Decimal ASCII chart

Although there are multiple ways to encode Phred Quality Scores with ASCII characters, the two most popular are called Phred+33 and Phred+64. The names sound quite strange until you understand how this works.

Phred+33

In this case, you add 33 to the Phred Quality Score, then use an ASCII Table to find what character corresponds to that sum. For example, a Quality Score of 35 would be represented as letter "D": 35 plus 33 equals 68, the ASCII character for which is "D." Try practicing the conversion between Quality Scores and ASCII characters and vice versa to get more comfortable with it.

Phred+64

With this type of encoding, rules are the same except you have to add 64 to the Quality Score to determine the ASCII character. However, this encoding format isn't really popular nowadays, most platforms (Illumina 1.8+, Sanger) have adopted Phred+33 encoding.

Conclusion

FASTA and FASTQ formats are two standard ways of encoding sequences. While FASTA files store only the sequence itself for easier sequence processing, FASTQ files contain additional information about the sequence, including Phred Quality Scores. These scores can be calculated easily using above mentioned formulas and an ASCII table. Understanding how nucleotide sequences are stored and evaluated is quite important in the world of bioinformatics.

8 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo