Natural scienceBioinformaticsBioinformatics algorithmsAssembly

Genome assembly

6 minutes read

DNA sequencing technologies enable us to determine an organism's nucleotide sequence. However, these technologies process only short DNA stretches. In order to identify the whole genome sequence one has to break genomic DNA into pieces, sequence them, and then put reads into the right order. The last step is called genome assembly. In this topic, we will discuss the main ideas behind genome assembly.

Main ideas

Here, we are talking about de novo genome assembly, which means there is no reference genome sequence available for an organism. A reference genome is a representation of a common species' genome sequence, which is assembled from the DNA of multiple individual donors.

The primary challenge of genome assembly is the fact that we are incapable of reading a whole genome in one shot. Existing state-of-the-art sequencing technologies read only short DNA pieces at a time. The typical read size for Illumina sequencing accounts for 50-300 bp, while for instance, the human genome is 3 × 10^9 bp in length. Therefore, de novo genome assembly resembles putting together a big puzzle without knowing what it looks like.

Generally, assembly consists of two steps: contigs constructions and joining contigs into scaffolds.

  1. In the first step, similar to puzzle assembly algorithms group reads into longer sequences based on short overlap regions between reads. These longer sequences are called contigs (from continuous).

  2. There are some genome regions, that are hard to sequence and could be absent in the pool of reads. Such regions create gaps and therefore all reads cannot be assembled in one contig. This is why in the second step we determine the relative positions of contigs and build them into scaffolds. A scaffold represents the right order of contigs separated by gaps. As a result of genome assembly, we obtain a set of contigs or scaffolds. For instance, the GRCh38 version of the human genome consists of 473 scaffolds or 999 contigs.

The difference between contigs and scaffolds

Contig and scaffold construction

The construction of contigs is the most challenging step of genome assembly. Assembly algorithms for short reads exploit de Bruijn Graphs. These approaches break reads into smaller pieces of the same length and place them into nodes of the graph, then create a directed link between two nodes if the suffix of one node equals the prefix of the other node. After building the graph algorithm tries to visit every edge exactly once and write down sequences that it encounters in nodes during its walk on the graph. Fitted together, these sequences make a contig. De Bruijn Graph building requires the presence of a large number of overlapping regions. To get that we need redundant reads. Genome coverage is used as an indicator of such redundancy. Coverage is the average number of reads covering any position of the genome. Short reads assembly requires coverage of around thirty to fifty reads on a position.

Gaps between contigs in a scaffold originate from either poor sequenced or repetitive regions. Repeats are widespread in the genome and cannot be accurately positioned in the assembled sequence. As a result, repeats create gaps during assembly. Information about the relative position of contigs could be derived from either pair-ended or mate-pair sequencing. Pair-ended and mate-pair reads differ in sample preparation protocol and the insert length between reads in the pair. Insert length is around 100 nt for pair-ended reads and several kilobases for mate-pair reads. In both cases, reads from the same pair might overlap with different contigs, indicating these contigs are close.

Genome assembly from short paired-end reads and mate-pair, long-insert, paired-end reads

Assembly quality metrics

There are some metrics that help assess genome assembly quality. N50 metric shows the minimum contig length from the set of the longest contigs covering 50% of the genome. The true genome length is usually unknown and hence, aggregated length of all contigs is used instead. To get the N50 value, sort all contigs by their length, then go from the longest to the shortest until the total length of contigs reaches at least 50% of the genome, and then get the minimum contig length of the obtained set.

The same metric can be applied to scaffolds. Keep in mind that the comparison of N50 values between different assemblies makes sense if the assemblies are the same size, in other words, the total lengths of their contigs are equal. L50 metric is the smallest number of contigs whose length sum accounts for 50% of genome size. The procedure of L50 calculation is similar to N50, but instead of minimal length, take the number of contigs in the set. Other common quality metrics are N90 and L90. They mean the same as N50 and L50, but for 90% of the total genome length.

Ideally, the optimal genome assembly would consist of a few contigs representing entire chromosome sequences. Such an assembly would have a high N50 value and a small L50. In contrast, a low quality assembly would consist of many tiny contigs, leading to a low N50 and a high L50 value (regarding the total number of contigs). Moreover, N50 is a measure of sequence contiguity and not assembly correctness. In the modern era of long-read genome assemblies, an N50 value over 1 Mb is generally considered good. Keep in mind, N50 could be biased if short contigs are excluded. In this case, the observed N50 value would be higher than the real one.

Given a set of contigs, N50 metrics is calculated as the sequence length of the shortest contig at 50% of the total assembly length

BUSCO tool evaluates assembly completeness in terms of gene content. The tool uses a set of particular genes, which are highly conserved in a taxonomic group and have remained in a single copy. Ideally, these genes are supposed to be presented in genome assembly in one copy. BUSCO has built-in gene sets for several taxonomic groups including vertebrates, fungi, eukaryotes, etc. The user chooses a taxonomic group and inputs an assembly, and then BUSCO searches for genes from the chosen set in the assembled sequence and annotates it. In the best case, BUSCO finds a full-length gene and classifies it as "complete" [C]. "Complete" genes found with more than one copy are named "duplicated" [D]. "Duplicated" genes occur due to biological reasons or assembly inaccuracies. Genes only partially recovered are referred to as "fragmented" [F]. "Missing" [M] genes are not found in the assembly. A high proportion of missing genes indicate assembly incompleteness. Take a look at the BUSCO assessment of some assemblies.

BUSCO assessment for five species

Generally, assembly with >90-95% complete single-copy BUSCO genes is of good quality. For example, The Earth BioGenome Project criteria for a reference-quality assembly requires a complete and single-copy BUSCO score > 90%. The score < 75% indicates a poor assembly. However, BUSCO results depend on a chosen taxonomic group. Therefore, in rare cases, an assembly could have a high fraction of missing genes not because of its poor quality but because it truly lacks some genes from the chosen set.

Genome assembly tools

Algorithms for genome assembly of Illumina short reads utilize the de Bruijn graph. This approach is described in detail in the next topic. Here we consider only some de Bruijn graph-based assemblers, although there are tools exploiting other assembly algorithms.

Velvet tool is a standard tool for assembling small and medium-sized, basically bacterial, genomes. SPAdes genome assembler is also aimed at bacterial genome assembly. Velvet and SPAdes were initially designed for small genomes. It is not recommended to use these tools for large genomes, since they require a big amount of memory and could collapse. For this reason, were developed assemblers capable of efficiently processing large genomes, like ABySS assembler and SOAPdenovo2.The two assemblers also work well on small genomes.

Most tools expect input parameters that are defined by the user. These parameters affect assembly results, therefore, it is reasonable to try several values. For instance, there is a dedicated tool for the Velvet assembler. Velvet Optimiser performs numerous Velvet assemblies to find the best one. It provides parameters for the best assembly and some statistics about all constructed assemblies such as N50, maximum contig length, number of contigs, etc.

Conclusion

There are two main steps of genome assembly: contigs creation from the sequence overlaps between reads and putting contigs into the right order. The result of genome assembly is a set of scaffolds. N50 and L50 metrics technically evaluate assemble quality, while BUSCO assesses the gene content of an assembly. Velvet and SPAdes assemblers are suited for small genomes, whereas ABySS and SOAPdenovo can handle genomes of all sizes.

How did you like the theory?
Report a typo