Natural scienceBioinformaticsData and Tools

SAM & BAM formats

11 minutes read

In this topic, we will discuss the Sequence Alignment Map (SAM) formatted file and its equivalent BAM file. These are file formats commonly used to store information after aligning reads to a reference sequence. The SAM/BAM file format is a standard for storing alignments and is the expected output of all widely used alignment programs. Additionally, many tools require it as an input. Let's explore the format specification and handy tools for manipulating alignments in the SAM/BAM format.

SAM format at a glance

As you know, alignment (or mapping) is a key method in bioinformatics that is essential for comparing sequence data. With next-generation sequencing data, researchers deal with pieces of sequence (e.g. Illumina reads), which can be aligned back to the reference sequence. Researchers need effective ways to store millions of mapped reads. Before the 1000 Genomes Project Consortium sought to standardize alignment format, sequence aligners used disparate outputs that were specific to the researcher and hard to work with. Now, Sequence Alignment Map (SAM) is an accepted text-based standard for storing biological sequences alignments. Most often it is generated as a human-readable version of its sister BAM format, which stores the same data in a compressed, binary form.

General file structure. A SAM file consists of header section followed by the alignment section, as illustrated in the picture below. The header is optional and may contain general information about the reference sequence, alignment program, and other technical details. The alignment section contains the information for each sequence about where and how it aligns to the reference sequence. It consists of multiple lines (generally, one line for each read) and 11 required tab-separated columns. Before starting, arm yourself with SAM format specification.

SAM format: header and an alignment section

Header. All lines in the header start with the symbol "@" followed by a 2-letter code. For instance, the "@SQ" line stores information about names (SN) and lengths (SL) of reference sequences to which reads were aligned. For example, for a hypothetical organism with 1 chromosome of length 1.000 bp we would see the following line: @SQ SN:chr1 LN:1000.

Alignment section. As mentioned above, the alignment section has 11 mandatory columns and possibly some optional ones. The following table is a brief overview of the alignment section fields. Let's go through each field in turn.

The description of alignment section fields

SAM file content

Col 1, QNAME: A read identifier. Paired-end (PE) reads will both have the same QNAME, as they came from the same piece of DNA. Note, that a QNAME '*' indicates the information is unavailable.

Col 2, FLAG: A number that reflects features of the alignment (if this read mapped, if this read paired etc.). Although it appears as a single number, it actually contains several pieces of information. Let's look at how to code multiple attributes in a compact way. Imagine an 11-bit number: 00000000000, where the rightmost bit is the first one. Each bit is a binary (True-False) answer to a particular question (see the table below from the tutorial). Let's construct the value by the rule: if the flag is false, add nothing; if it's true, add 2 raised to the power of the bit position. Here is an example of the calculation of flag value for the read pair below. First, reads are paired, so 'zero' statement (flag) is true — add 2 to the zero power. The first flag is also true — add 2 to the power one. The next flag is false — skip it, and so on.

An example of FLAG calculation

An overview of the different properties that can be encoded in the FLAG field

In single-end (SE) experiments, flags asking about mate read are not valid, therefore single-end reads yield only three possible values — 0, 4, or 16. Although the flag value is unambiguous, the decoding is not obvious at first sight. The picard tool is a very useful website to quickly translate FLAG integers into plain English explanations.

Col. 3, RNAME: Name of the reference sequence. It is often a chromosome ID.

Col. 4, POS: Number that indicates the leftmost position, from which the alignment starts. In other words, the position with the lowest numeric coordinate. You can see an example on the picture referred to Col.6. Note, that a value 0 means that the read has not been mapped.

Col. 5, MAPQ: Quality value of the mapping process. In theory, it should be a Phred-scaled probability that the position of the read is incorrect:

MAPQ=10×log10Pr(Mapping position is wrong)MAPQ = {-10 \times \log_{10}{Pr(\textmd{Mapping\ position\ is\ wrong})}}

In practice, MAPQ scale and its meaning are completely dependent on the aligner used and range from 0 to 37 — 42. The intuition behind the quality score is that a uniquely and exactly mapped read should receive the highest score. In contrast, reads with multiple mapping locations and/or a bunch of mismatches should have score closer to 0. You may want to filter out reads with low MAPQ, but what is the correct threshold for this procedure? Actually, threshold is very dependent on a score calculation and your desired 'strictness'. For example, when you work with cancerous tissue, it may be rich with deletions, mismatches, and other biological variations in addition to sequencing errors. In this case, MAPQ >= 20 or even 10 may be too strict, while these thresholds may work well in other cases. Note, that here, a MAPQ value of 255 indicates that the score is unavailable.

Col. 6, CIGAR string: As we mentioned above, mapped reads may not be an exact piece of a reference sequence. CIGAR nomenclature is widely used to describe all of the differences between the read and reference in a compressed form. A CIGAR string looks like a chain of operations (single-letter), each proceeded with a number of nucleotides. Let's explore operations from the illustration below.

CIGAR representation of an alignment

1. 'M' stands for alignment of two letters — no matter whether it is a match or mismatch.

2. 'I' and 'D' are reverse operations relative to each other. You will see 'D' when the letter is present in the reference sequence, but is absent in the mapped one. Otherwise, when an additional letter is present only in the mapped sequence — 'I' is used.

3. 'S' and 'H' describe clipped alignment, when subsequences at the ends are not aligned to reference. However, hard clipping, unlike soft clipping, actually removes bases from the sequence. Thus, hard-masked bases do not appear in Col.10, SEQ, while soft-masked do.

4. 'N' operation represents an intron (see the topic about Transcription) and is used only for mRNA-to-genome alignment.

While you know alignment start position (Col.4, POS), end position is not explicitly present in the table. To determine the end coordinate of the aligned sequence, you need to parse CIGAR string (Col.6). Length of read equals the sum of the M, I, and S operations.

Col. 7, RNEXT: Mate's reference sequence if dealing with paired-end (PE) reads. It is a mirror of Col.3 RNAME for the second read in pair.

Col. 8, PNEXT: The leftmost position, from which mate's alignment starts. Same, it is a mirror of Col.4 POS for the second read in pair.

Col. 9, TLEN: Insert size (for PE-data) — distance from the leftmost mapped base and the rightmost mapped base of the read pair. The TLEN field is positive for the leftmost read and negative for the rightmost one in the read pair. On the below picture, you can witness how TLEN is connected with reads coordinates.

POS, PNEXT, and TLEN values of reads mapping to forward and reverse strands

Col.10, SEQ: Sequence of the read and its mate, if present. It should be noted that both PE-reads are being presented in forward orientation in SAM file (see the picture above). So, if the read aligned reverse-complement (this fact is stored in flag value), SEQ would be reverse complement counterpart of the original read sequence.

Col.11, QUAL: The quality string for the query read and its mate if present. Comes from FASTQ file.

We have now discussed all mandatory columns, but there are some optional columns that can also be present. Information in optional columns is stored in the form of TAGs. Each field looks like TAG:TYPE:VALUE, where TAG is two-character string, TYPE indicates value type (e.g. an integer, or a word). For example, additional metadata may include the edit distance between the read and the reference, e.g.NM:i:2.

An example of edit distance calculation

In real life, SAM output would look like the picture below. Since there is a one-off educational example, you may just find column value by its order in tab-separated line. When perform real tasks, you need to parse file.

SAM file content

Manipulating SAM/BAM files

Samtools are useful tools designed to work with SAM/BAM files (download current release). It is always good practice to convert SAM to BAM. In fact, we want to perform several operations during file conversion: (1) Compression, or size reduction (2) Sorting by position on reference sequence (3) Indexing for quick extraction of alignments overlapping particular genomic regions.

Here is the chain of commands:

samtools view -S -b sample.sam > sample.bam
samtools sort sample.bam -o sample.sorted.bam
samtools index sample.sorted.bam

After running, we should see two files — sample.sorted.bam and sample.sorted.bai, or the BAM index file. Examples of other useful operations are merging two BAM files, filtering by MAPQ and FLAG values, or searching for intersections with a genomic region. We will not cover other samtools commands in this topic, but there are great sources where samtools commands are clearly explained. Moreover, you can always consult documentation to find a command that suits your task in the best way.

Conclusion

SAM and its binary equivalent, BAM, are flexible formats to store all the alignment information generated by various alignment programs. We can outline SAM and BAM formats strengths: (1) Compressed (2) Indexed — fast viewing, slicing, etc. (3) Relatively simple to generate (4) Store both single-end and paired-end data (5) Can be manipulated with handy toolkits.

How did you like the theory?
Report a typo