4 minutes read

In the field of genome assembly, scaffolding techniques allow for the construction of a more comprehensive and contiguous reference genome, which is the foundation of genomic research. In this topic, we will explore the role of scaffolding in genome assembly, including its challenges, and provide examples of scaffolding software.

What is scaffolding?

Recall that in the process of assembling the genome, we first have short overlapping sections of the genome, called reads. Further, using various genomic assemblers, we can get longer sections — contigs. Together, the contigs make up the desired genome sequence. However, the order in which the contigs are in the genome is unknown to us by default, so we must take some further steps.

Scaffolding in bioinformatics refers to techniques used to improve the completeness and contiguity of draft assemblies, which is a crucial step in the assembly of the genome. To be more specific, scaffolding approaches are used to infer contig orientations and positioning and to generate longer sequences — scaffolds.

The difference between contigs and scaffolds

Contig assembly is a rather routine and algorithmic process, whereas scaffolding is more of a creative task that heavily depends on numerous individual data aspects. Scaffolding should, in theory, result in the reconstruction of the organism's chromosomes. For instance, since humans have 46 chromosomes, 46 sequences should be obtained.

The reads produced by different sequencing technologies are very different and have different abilities to overcome the difficulties of scaffolding. Protein sequences and related reference genomes can also be used for scaffolding. Hence, the design of existing scaffolding methods is often based on one type of input data. Therefore, let's divide existing scaffolding methods into 4 categories based on data types.

Paired-end reads

How can we understand if two contigs are "genome neighbors"? For example, if we know that there is a sequence in the genome in which the ends of contig A and contig B are present, that is, the end of contig A in the genome is followed by the beginning of contig B, then we can be reasonably sure that in the genome these contigs follow each other.

Paired-end reads provide additional information about contigs' positioning

How can we check this? We can cut the studied DNA into pieces of roughly known length. And then we read each of them from the beginning and from the end. Thus, despite the fact that a part of the sequence between the ends will remain unknown, we will have pairs of reads with roughly known distances between them.

After that, we can align such pairs of reads to sequences of contigs already known to us. Separately, we note that in the process of assembling contigs, information about paired reads is not used, that is, these are two independent sources of information about the genome. Accordingly, if each of a pair of reads is part of a different contig, then we can combine the contigs into a single scaffold. Moreover, in this way we can understand not just the order, but also the orientation of contigs — after all, we know an approximate distance between pairs of reads!

We add that in such a method, the very distance between the known reads is an important parameter — the larger it is, the larger sections of the genome we can capture. If we use the mate-pair sequencing method (the technique that allows us to obtain paired-end reads with long inserts), which will facilitate the assembly process, but this method is more complicated and expensive.

Of course, technologies have advanced to the point that it is now possible to obtain much longer reads (up to hundreds of thousands of base pairs) than previously available. It is logical that it is much easier to assemble the genome from pieces of greater length. However, long reads usually have a large number of errors. In this case, a hybrid method that uses both short and long reads is an efficient strategy. At the same time, we first, as usual, perform assembly using only short reads and then align the resulting contigs into long reads. This method is thus a bit like the paired read method.

Subcloning-based method

Scaffolding methods based on subcloning involve breaking up the genome into discrete fragments that can be cloned using vectors or plasmids. Fragments of the studied genome are inserted into a larger DNA molecule, for example, into a bacterial artificial chromosome (BAC) grown in Escherichia coli.

After inserting the studied fragment into a plasmid, the resulting hybrid molecule can be sequenced. The sequence data can then be used to assemble a larger DNA molecule, with the inserted fragment serving as a scaffold that helps connect adjacent contigs. Thus, the assembly process is first performed for each fragment separately, and the resulting assemblies can be combined together to reconstruct the complete genome sequence.

Newer technologies perform the subcloning process in vitro. For example, in technology from 10x Genomics, large DNA fragments are split into smaller fragments, which in turn are placed in separate drops. Each piece of DNA in each drop is assigned a unique barcode containing information to which large piece of DNA it belongs. Then these DNA molecules with unique barcodes are sequenced, and a special algorithm, receiving the sequences of each of the sections as input, and knowing the codes assigned to them, can group the sequences that came from the same large DNA fragment. The reads combined in this way can be assembled together to create complete reconstructions of individual large fragments.

Chromosomal contact data

In the previous methods, we described DNA simply as a nucleotide sequence. However, we can also use the fact that in reality DNA is "packed" in a complex way in the nucleus, and knowledge about its three-dimensional structure can be useful to us. For simplicity, we can imagine that the DNA molecule is like a thread wrapped around many coils (histone proteins act as such coils). The structure of DNA and proteins formed in this way is called chromatin. For scaffolding, it is useful that the closer two regions of the genome are to each other in the DNA sequence, the more often they are in contact in the 3D structure of the genome in chromatin. As a result, knowing which regions of the genome are physically close to each other, we can assume that they are neighbors in the genomic sequence as well. We can group fragmented genomic DNA sequences based on the frequency of contacts between different regions of the genome in chromatin. Thus, we reduce the space of options for which contigs we should check for proximity, which means we increase the speed and accuracy of our genome assembly.

The question remains, how do we get data on the genomic organization? One research method for this is Hi-C. Let us briefly describe the main ideas of this method. First, cells are cross-linked with formaldehyde, which forms bonds between DNA and proteins, keeping the three-dimensional structure of chromatin in place. The chromatin is then cleaved by a restriction enzyme, which cuts the DNA into fragments of varying lengths. Next, the DNA fragments are linked together, and it is obvious that the most effective binding will be those fragments that were found close to each other in three-dimensional space.

Hi-C workflow: genomic DNA crosslinking, fragmentation, ligation events, and sequencing of chimeric fragments

The ligated DNA is then cut into smaller fragments and sequenced. To infer the spatial proximity of different regions of the genome, we would have to study how often certain DNA fragments are linked together — again, for the reason that the closer they were to each other, the more likely they were to join.

Physical mapping

The main essence of this group of methods is that in order to reduce the space of variants of the location of the regions known to us along the genome, here we are trying to estimate the location of specific DNA segments along the genomic chromosomes. The DNA segments of interest to us may be, for example, restriction sites for specific enzymes. If we know exactly where specific sites (e.g. restriction sites) are located in the genome, then we will also receive additional information that will help us assemble the genome.

Indeed, in this case, we will clearly know into which segments the entire DNA molecule will be cut, and then comparing the segments resulting from restriction with those contigs that we obtained as a result of sequencing, we will again be able to find out contigs order in the genome.

An example of optical map — is shows physical  location of restriction enzyme sites

One of the common physical mapping methods used in scaffolding is optical mapping. This method uses high-resolution microscopy to create a physical genome map based on the location and size of specific DNA fragments. In optical mapping, the DNA molecule is stained with fluorescent dyes, allowing researchers to take images of the DNA. The images are then analyzed to determine the size and location of the restriction fragments in the genome, which can be used to build a physical map. Once the physical map is built, it can be compared to the assembly of the genome to reveal overlaps between contigs. These overlaps can then be used to combine contigs into scaffolds.

Conclusion

Scaffolding is a process that finalizes the assembly of the genome, during which we combine reads into longer scaffolds based on additional information. We can get this information by various methods such as the paired-end reads method, subcloning-based method, chromosome conformation capture methods, and physical mapping method.

It is a rather complicated process, so we have to be critical of our own and other people's results, and still, we will never be protected from mistakes. However, with a good choice of methods (or the use of combinations of methods) and correct work with possible errors, it can be said with a high degree of confidence that the process of studying the genome will be successful.

How did you like the theory?
Report a typo