Human genome reference is a starting point for any human genomics and genetics study. Over 20 years, its sequence is gradually refined by correcting errors, closing gaps, detecting complicated regions such as centromeres, and capturing population diversity. In this topic, we focus on human genome assemblies — from the first draft of the human genome to the most accurate and complete version of the human genome reference.
Why do we need human genome reference?
The reference genome is an accepted representation of the genome sequence that is used by researchers as a standard, for a given species. To determine the sequence of a large DNA molecule, the genetic material is first fragmented into multiple fragments, then the segments are sequenced separately and combined into the reconstruction of the original sequence. While genome assembly is a complicated and computationally expensive process itself, complex genomes, such as a human genome, pose additional challenges. First, assembly complexity is dependent on the number of DNA pieces and the length of sequencing reads. Second, the assembly problem is complicated by genomic repeats, or nearly identical DNA stretches throughout the genome. In humans, examples of complicated regions include:
- The telomere and subtelomere regions that cap the ends of chromosomes;
- Centromeres that are essential for cell division;
- Acrocentric arms — short and highly repetitive parts of some chromosomes.
In fact, roughly half of the human genome consists of repetitive DNA.
Although the genome assembly of large genomes remains challenging, for some tasks there is no need to assemble the genome each time a new individual is sequenced. To date, the research groups obtained high-quality reference genomes for many model organisms, and human is no exception. To maintain the reproducibility of studies throughout the world, the human genome standards were accepted which are known as human genome references. These sequences and statistics on assembly are publicly available.
When human genomes are sequenced in clinical and other studies, sequencing reads are nearly always aligned with the reference genome for comparison. This task is referred to as re-sequencing — sequencing of the new individual genome for which the reference sequence is already known. As the reference genome serves as a standard, it should be error-free and as complete as possible. It is clear that any inaccuracies represented in the reference will impact the conclusions we make about data referenced against it. These sequence "imperfections" include unknown bases, or gaps, low-quality stretches of sequence, and so on. Gaps remain problematic because any reads originating from those regions remain unaligned, while reads derived from repetitive regions may align to multiple genome regions simultaneously. What is more, many positions in the human genome vary due to differences between human populations. No single genome can represent population-specific and individual variability, therefore efforts are made to represent genomic diversity.
Let's now consider how the human genome reference evolved over the years, from the first draft to the current version of the human genome reference.
The first human genome
The Human Genome Project was a 13-year-long, international project launched in 1990 with the objective of generating the first sequence of the human genome. Sequencing of the human genome was based on insights gained from other model organisms, such as yeasts and worms, however, the volume of human genetic material (human DNA is 3.3 × 109 base pairs long) was unprecedented. Additionally, every base in the genome should be covered several times to ensure that it is properly identified.
In 1998 privately funded company Celera Genomics, headed by J. Craig Venter, emerged and began to compete with publicly funded HGP. Celera Genomics and HGP used different methodologies of genome sequence assembly.
Hierarchical shotgun assembly. HGP was mainly based on a hierarchical shotgun approach. In the first step, human chromosomes were divided into overlapping DNA fragments, and these fragments were cloned into bacterial artificial chromosomes (BACs). These BACs could then be transported to different labs all over the world. The next step includes shotgun sequencing itself: a DNA insert from a BAC clone was fragmented and sequenced an average of four times. In this approach, local sequence assemblies on a level of the BACs were performed, and the final sequence was obtained by merging sequences from BACs. It should be noted that BAC clones' chromosome locations were known with the help of the mapping technologies, such as physical mapping, genetic mapping, and FISH.
Whole-genome shotgun assembly. The method developed and preferred by Celera is called whole-genome shotgun (WGS) sequencing. In this method, the entire genome is decomposed directly into random reads, thus eliminating the BACs library construction step from HGP's approach. This approach is faster and cheaper, however, the risk of incorrect assembly is potentially higher.
In 2000, both Celera and the HGP reported "draft" sequences of the human genome. The reports appeared in February 2001, the HGP consortium published the first Human Genome in the journal Nature, followed one day later by a Celera publication in Science. Both draft genomes accounted for 90% of the human genome, still missing out certain sequences. However, there is doubt whether the Celera's assembly was independent of the HGP genome assembly. It is worth mentioning that Celera had access to the HGP data but not vice versa. Then, only HGP has chosen to convert the draft genome sequence to a finished genome sequence. In April 2003, the International Human Genome sequencing consortium announced that the draft sequence was substantially improved — the new version accounted for 92% of the human genome and had approximately 400 gaps.
Why the HGP and parallel projects didn't produce a complete genome sequence? It was limited by the technologies for sequencing DNA allowed at the time. A Sanger DNA sequencing method was mainly used, which is low throughput (~500 – 800 bases per read). Moreover, both private and public genome efforts were based on a shotgun approach. In this approach, only relatively short fragments can be sequenced, while technically challenging and repetitive chromosomal regions remain unresolved.
The human genome reference sequences do not represent any one person's genome. The first genome sequenced by HGP was a mixture of the genomes of several donors, largely came from Buffalo, New York. In detail, 93% of the sequence came from 11 donors, and 70% just from one donor.
The path from HGP to T2T
As a result of HGP, a highly accurate sequence of the vast majority of the human genome was obtained. The assembly produced by HGP has been continually updated over the years with an effort of the Genome Reference Consortium (GRC). Overcoming Sanger sequencing limitations, newer technologies referred to as Next Generation Sequencing (NGS) produced higher data volumes in less time. However, in some complicated genomic regions, such as highly repetitive telomeric and centromeric zones, multicopy genes remain unresolved for years because of technological limitations.
Finally, in 2022 the Telomere-to-Telomere (T2T) consortium announced that it had filled in the remaining gaps and produced the most complete human genome sequence to date. Mentioned in the name, telomere, a region of repetitive sequences at the ends of chromosomes, was a part of the mystery 8% of the genome uncovered by T2T.
The recent T2T-CHM13 genome finally resolved apparent gaps and incorrect regions of previous human genome versions. It provides a full representation of each autosome and X chromosome, except the Y chromosome and some ribosomal DNA arrays. Recently, a complete T2T human Y chromosome was published. These achievements became possible thanks to long-read sequencing, such as PacBio HiFi and Oxford Nanopore ultralong-read sequencing and other advanced techniques.
Human genome reference builds
Since the original draft from HGP, the reference genome has been revised and updated regularly as new information emerges. Newer versions of the genome are more complete and better represent the population diversity found in many parts of the human genome.
Assembly updates are reflected in the versions of reference assemblies or builds. The latest build of the human reference genome is GRCh38 (Genome Research Consortium human build 38) or shortly hg38. But it's not that simple — all reference human genome updates have major and minor releases. Examples of major releases are GRCh38 (hg38) build and the previous one, GRCh37 (hg19).
In contrast, patch (p) stands for a minor release of the current build. For instance, GRCh37.p3 is the third minor release of hg19. Actually, there are two types of patches:
- Fix patches are assembly corrections that will replace the primary sequence in the next major release;
- Novel patches represent population sequence variants or alternate loci.
With the new patch, some information is added without changing chromosome coordinates. Actually, patches are just sequences that aligned to the current assembly and do not cause changes in the primary sequence. An example of a new patch is the transition from the second minor release, GRCh37.p2, to the third one — GRCh37.p3. In contrast, major releases lead to coordinate changes. So, the coordinates of your gene may not be the same in the next release of the assembly. On this page, you may evidence assembly patches and find some information about them.
Let's sum up: the actual version of the human reference genome is GRCh38, which was released in 2013 and most recently patched in 2022 (GRCh38.p14). In the picture below, you can see a visual representation of GRCh38.p13 from the Ensembl database. Compared with previous versions, the GRCh38 build contains fewer sequencing errors and more sequences from complicated regions. Moreover, the latest GRCh38 build expanded the repertoire of so-called alternate haplotypes (ALT) — regions, which are specific to different populations. So, the original linear sequence is supplemented with alternative sequences, to represent allelic diversity. The better the representation of ALT contigs, the greater our ability to detect genomic variations in specific populations is. Now the T2T-CHM13 reference genome, or the most complete one, pretends to replace GRCh38, the currently used reference.
Which human genome reference is better to use in your study? In most cases, the currently used GCRh38 assembly is recommended. The usage of older versions should be reasonable. For example, you may want to use some external data in your research (for example, the annotation of unusual transcripts in some cell lines), but coordinates are available only for hg19. First, double-check whether the annotation has already been converted to a newer version. Next, you may try to convert genome coordinates and annotation files between assemblies. Switching data requires lift-over tools (e.g. LiftOver tool) that may work imperfectly, though. Another possible obstacle comes from bioinformatics processing tools — some of them may be not compatible with concrete genome builds.
Download the reference human genome
The official source for the latest major release is the GRC website. From here it can be downloaded with FTP: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38. The sequences are in GCA_000001405.15_GRCh38_genomic.fna.gz .
The assembly contains:
- Assembled chromosomes: chromosomes 1-22 (autosomes,
chr1-22), chrX, chrY, and mitochondrial genome (chrM) - Unlocalized sequences, which are known to locate on concrete chromosome, but with unknown order or orientation (
_randomsuffix) - Unplaced sequences — the chromosome of origin is unknown (
chrU_prefix)
Reference genomes (GRCh37, GRCh38) and can also be found on other multiple resources, such as NCBI. At NCBI you can additionally download Reference Genome Annotation (RefSeq) and other files related to the human genome. Another widely used whole genome annotations include GENCODE sets and also Ensembl. The latest human genome sequence, T2T, can be found on NIH website.
Post-HGP projects
First, human genome sequencing initiated the discovery and cataloging of most human genes. For the first time, researchers revealed that only ~2 % portion of the human genome encodes proteins. HGP led to the development of ENCODE (Encyclopedia Of DNA Elements) Project which aims to understand the functional parts of the genome. The HGP has inspired subsequent large-scale data collection initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas. Big genomic studies dedicated to concrete human populations were also launched, an example is The Cyprus Genome Project.
The latest T2T reference genome sequences can help to reveal more important genetic information about diseases, aging, evolution, and other important life processes. Now, the T2T Consortium has teamed up with the Human Pangenome Reference project to make a collection of the most complete, high-quality assemblies, which will together comprise the human "pangenome" reference (over 300 genomes).
Conclusion
The human genome size and complexity have presented many challenges for sequence assembly. A human reference genome produced by HGP served as a backbone for newer genome versions. As genomics technologies improved, sequencing artifacts were corrected and challenging regions (centromeres, telomeres, ribosomal DNA, etc.) were revealed. GCRh38 is the currently used reference genome, while T2T is the most complete human genome sequence to date.