12 minutes read

Sequencing is the process by which one determines the sequence of nucleotides in a DNA molecule. Remember, the nucleotides are like links that make up the chain of the DNA molecule. Sequencing is an effective tool in the modern researcher's toolkit. It provides key insight into the principles that govern all life forms and their functions, from Escherichia coli to cancer cells. DNA sequencing results are used in many areas of biology and medicine such as biological systematics, evolutionary biology, disease diagnosis and prognosis, and genetic engineering.

History of DNA sequencing

Since the middle of the 19th century, we have understood that DNA is the only carrier of genetic information (see Avery-MacLeod-McCarty experiment and Hershey-Chase experiments). Since then, our understanding of DNA has improved continually, with the result that scientists figured out how to read DNA sequences and interpret this four letter language that governs all life forms.

Pioneer sequencing methods

Chemical sequencing method:

One of the first methods to define DNA sequences was the chemical cleavage method or the Maxam-Gilbert method. It consists of the following steps:

  • Mark one end of the DNA chain with a radioactive label.

  • Divide the DNA into 4 samples.

  • In each sample, use specific chemical reactions to chemically break each DNA molecule at letter-specific positions, C, C+T, G and G + A. By providing a limited amount of the chemicals that cause the breaks, these breaks will be limited to one per DNA molecule in the sample. Thus, each DNA molecule will be cut into a different set of two pieces of varying lengths.

  • Visualize the DNA fragments in the resulting mixture using polyacrylamide gel electrophoresis, which sorts the fragments based on length and X-ray film, which captures an image of the gel.

  • Piece together the original sequence based on the four sets of fragment lengths, going nucleotide by nucleotide, looking at all four gels to see at which letter each sequential break happened along the DNA chain.

Sanger sequencing method:

The second method, called the chain termination method or Sanger sequencing, was proposed almost simultaneously with the first one. It is very similar fundamentally, but the four sets of DNA fragments are synthesized by enzymes called DNA polymerases, not by chemically cleaving the initial DNA molecules. Here are the main steps:

  • Divide the DNA into 4 samples, one for each nucleotide.

  • Add DNA polymerase and a mixture of all 4 nucleotides to each sample.

  • To each solution, add a small amount of one of four modified nucleotides. These modified nucleotides resemble the A, T, C, and G bases we are already familiar with, but they have slight chemical modifications that cause the polymerization to stop when they are added to the chain. These modified nucleotides are also labeled with radioactive markers.

  • Use gel electrophoresis to visualize the lengths of the obtained fragments and determine the sequence.

Key steps of Sanger sequencing method: the incorporation of chain-terminating nucleotides and the separation of the DNA fragments using gel electrophoresisThe chain termination method had several advantages that made it more suitable for optimization and scaling. Instead of labeling the modified nucleotides with the same radioactive label, each modified nucleotide was labeled with a different fluorophore — a fluorescent dye that can be visualized using a laser. Because the four nucleotides can be identified by fluorescence, there is no need to split the DNA sample into four different samples. The reaction can occur in one tube. Eventually, all stages of Sanger sequencing were automated, and new machines called sequencers were developed. These devices are widely applied today for sequencing, which has become standardized and quite routine for a modern biology lab. The sequencer output is a chromatogram, the illustration of a sequence of fluorophore flashes, that correspond to the sequence of nucleotides in the studied DNA.

Assigning bases to chromatogram peaks

Second generation methods

In 2005, the era of high throughput sequencing or next-generation sequencing (NGS) began. Improvements in tools and data processing made it possible to carry out tens and hundreds of thousands of reactions simultaneously instead of one. This, in turn, significantly accelerated the sequencing process and greatly reduced its cost. This family of methods is called second-generation sequencing.

Principle:

The central second-generation method is sequencing-by-synthesis. Just as in Sanger sequencing, it is based on an enzymatic reaction that terminates due to nucleotide modification but in a reversible manner. The original DNA molecules are attached at one end to a solid surface, fixing them at one point in space. All nucleotides added to the reaction are modified, and chain synthesis pauses after every addition the DNA polymerase makes. The nucleotides are fluorescently-labeled, and at each pause, the color of the fluorophore at each chain's spot indicates which nucleotide has joined the chain. Then, the fluorophore and the modification are cleaved off from the last attached nucleotide, and the process is repeated.

Each of the four bases has a unique emission, and after addition of new nucleotide, the machine records which base was added

The main drawback of the second generation methods is a short read length, no more than a thousand bases. In the case of sequencing-by-synthesis, the problem stems from its technical implementation. The actual fluorescence signal is generated by several hundred synchronized identical growing chains and when their synthesis is eventually dephased, the quality of the signal decreases. With a human genome size of 3 billion letters, the sequence reconstruction from several hundred long reads requires complex algorithms and high computing power.

Third generation methods

In the 2010s, in an effort to solve the problem of short reads, the third generation of sequencing methods appeared.

Principle:

The two main methods — Single-molecule real-time sequencing and Nanopore sequencing — are based on different principles, however, both detect the signal from one DNA molecule and allow reads of tens and even hundreds of thousands of bases. Single-molecule real-time sequencing is similar to second-gen sequencing in that it uses an enzymatic reaction and detects the signal of fluorophores attached to nucleotides, but there is no need to interrupt the synthesis process: the fluorophore is attached in a position where it naturally cleaves off from the nucleotide when it is added to the chain. Nanopore sequencing is fundamentally different from all previously described: it uses neither chemical nor enzymatic reactions and is based on simple electronics. A DNA molecule is passed through a nanosized protein hole (nanopore) that is placed in an electric field. As the nucleotide passes through the nanopore, the ion flux inside the hole changes, which is detected by the mini transistor. Because nucleotides differ in shape and size, the signal they generate varies, which makes it possible to differentiate them.

Single-stranded DNA passes through a nanopore protein that causes alteration in ionic current

Conclusion

To date, there is no universal sequencing method. All three generations of methods are in use and have unique benefits and drawbacks:

  • 1st generation Sanger sequencing remains the medical standard, as it produces sufficiently long reads of maximum accuracy.

  • 2nd generation sequencing-by-synthesis method provides high sequence yield at low price, but with more limits on read length. When complemented by long reads generated by 3rd generation sequencing, the algorithms required to reconstruct the sequence decrease in complexity and the final assembly is of higher quality.

  • The development of 3rd generation methods has not been completed. Initially, their reading accuracy was significantly inferior to the 2nd generation sequencing-by-synthesis method, but the error rate has decreased considerably over the past decade. The Nanopore method allows the detection of natural modifications of nucleotides, and the read length has increased by orders of magnitude.

Major DNA sequencing platforms

Method (platform)

Read length

Accuracy

Reads per run

Time per run

Cost per 1 billion nucleotides, $

Chain termination (Sanger sequencing)

400-900

99.9%

1

20 minutes to 3 hours

2,400,000

Sequencing-by-synthesis (Illumina)

up to 600

99.9%

up to 3 billion

1 to 11 days

5 to 150

Single-molecule real-time sequencing (Pacific Biosciences)

up to more than 100,000

99%

100–200 billions

30 minutes to 20 hours

7.2 to 43.3

Nanopore Sequencing (Oxford Nanopore)

up to 2,272,580 reported

87-98%

dependent on read length

1 min to 72 hours

7 to 100

11 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo