Natural scienceBioinformaticsBioinformatic databases

Ensembl

5 minutes read

Genome annotations comprise a vast amount of information of various types (gene coding regions, polymorphic sites, regulatory regions, epigenetics etc). A genome browser is an important tool to organize and access such a volume of data. In this topic, we are going to explore the Ensembl genome browser.

Ensembl overview

The Ensembl project is aimed to provide relevant high quality, comprehensive genome annotations for many eukaryotic genomes. It was created in 1999 by EMBL European Bioinformatics Institute and the Wellcome Trust Sanger Institute.

Genome annotation is a description of functional elements along the genome sequence. In the first place, it includes the identification of protein-coding genes, their transcripts, and the description of protein's functions. Furthermore, annotation involves the identification of non-coding genes and corresponding non-coding RNAs. More advanced annotations contain descriptions of regulatory sequences (for example, promoters and enhancers), gene variants, repeats, and other genome elements.

All data on the Ensembl website is freely available to the scientific community and is organized as a genome browser. The project stores annotations for organisms of different taxons, which are separated into individual ensembl sites: Ensembl for vertebrate genomes, Ensembl Bacteria, Ensembl Fungi, Ensembl Plants, and Ensembl Protists. However, in this topic, we are focusing only on Ensembl for vertebrates, since it is the main site. Today the project supports evidence-based gene annotations for over 300 vertebrates. A selected genome set includes additional data such as variation, comparative, evolutionary, functional, and regulatory annotation. To ensure the relevance of the provided information Ensembl website and underlying databases are updated every 2-3 months. For example, the current Ensembl version 110 was released in July 2023. The project stores previous releases in the Ensembl Archive.

Ensembl also provides a set of tools for data analysis and processing. For instance, BioMart tool exports custom datasets from Ensembl, BLAST/BLAT searches sequences of interest in the Ensembl genomes database, and Variant Effector Predictor predicts functional results of a suggested variant.

Exploring Ensembl pages

Now let's explore the Ensembl website starting from the main page.

Ensembl main page

Firstly, you can find the Ensembl release number and a link to all available tools in the top white block. Next, there is a blue block with the main search bar. It is a portal into all of the Ensembl data. Here you can search for gene names, gene IDs, genomic coordinates, variant IDs, phenotypes, etc. in species of interest.

The blue block is followed by a white block, which contains links to the genome annotations of all available vertebrates. If we choose human, we will see the following page.

Human genome annotation page

It contains 6 blocks. The top blue block has a search bar, where you can enter gene name, ID, coordinates, etc. The rest 5 blocks provide information about each feature of the annotation.

  1. Genome assembly section stores the DNA sequence of the latest genome assembly and basic statistics.
  2. Gene annotation describes the positions and functions of protein-coding genes, various non-coding genes, gene transcripts, pseudogenes, and proteins. The section provides their sequences in FASTA format as well as annotation files in gff/gtf format. Gene annotation is a complicated process that can be performed manually by a specialist or by an automatic annotation pipeline. You can learn more about it by clicking the "More about this gene build" link in the "Gene annotation" box.
  3. Variation data includes single nucleotide polymorphisms (SNPs), short nucleotide insertions and deletions, and structural variants. The primary sources of this type of information are various databases such as dbSNP, EVA database, DGVa database, etc. The Ensembl team checks the quality of the variants and excludes "suspicious" ones. Then the team imports from other databases some linked information about allele frequencies, associated phenotypes, diseases, and publications. Unfortunately, variation data is now available only for 23 species.
  4. Regulation. The section contains information about promoters, enhancers, repressors, histone modifications, DNA methylation, and transcription factor binding sites. Ensembl team mines such information from various genomic assays: ChIP-seq, ATAC-seq, and microarrays. Regulation data is also available for a few species.
  5. Comparative genomics. Ensembl offers a wide range of functions to perform comparative genomic analyses: gene tree construction, homology predictions, whole genome alignments, etc.

All data stored in the database has its own Ensembl stable ID. They have a fixed format that depends on data type and species. For human genome entities, it looks as follows.

  • ENSG########### – Ensembl Gene ID
  • ENST########### – Ensembl Transcript ID
  • ENSP########### – Ensembl Protein ID
  • ENSE########### – Ensembl Exon ID
  • ENSR########### – Ensembl Regulatory region ID

While for other species a suffix indicating a species is added. For instance,

  • MUS (Mus musculus) for mouse: ENSMUSG###
  • DAR (Danio rerio) for zebrafish: ENSDARG###

Genome browser

You are already familiar with the genome browser concept from this topic. So let's take a closer look at the Ensembl genome browser using the example of the human genome region located on the 3rd chromosome in 32315086-32400268 nt from the start. Remember, gene coordinates should be entered in the following format: chromosome number: start position-stop position. Thus, in our case, the entry should be formatted as 13: 32315086-32400268. To get to the genome browser page choose human, enter 13: 32315086-32400268 in the search bar on the main page and press go.

The main search bar

You will get to the following page.

Genome browser page

The page is split into 3 separate views and each view is more detailed than the previous one. There is a chromosome overview on the top. Underneath there is a one-megabase overview. You can see a red box there that shows our region of interest. As in other genome browsers, here data is organized in tracks. Basic tracks are contigs and genes (or transcripts). Other tracks can be added if available. The legend can help you to know the meaning of the color coding in the tracks.

At the very bottom, there is the last view, where we can see the region in detail from start coordinates to end coordinates.

Detailed view in the genome browser

There are two gene tracks: one for the forward strand and the other for the reverse strand. All tracks related to the forward strand have a blue bar on the left side, while data on the reverse strand is marked by a yellow bar. In our example, a gene is present on only the forward strand. Gene tracks are present in the form of transcripts. A transcript has a following representation with a line being an intron and a block being an exon. Colored blocks are normal exons, while non-colored blocks are non-coding exons.

Transcript representation

You can notice that introns are much longer than exons. In the example, transcripts vary in color. Gold and red transcripts are protein-coding but differ in annotation method. A transcript is assigned golden if there was an identical annotation between automated and manual methods. The rest protein-coding transcripts are red. The blue color is for non-coding transcripts.

You can show or hide additional tracks using the "Configure this page" button on the left side of the page. Annotation features in additional tracks will be displayed as colored blocks. For example, there are 7 enhancers in our region of interest (yellow blocks in the "Regulatory Build" track).

Conclusion

The Ensembl genome browser has proven to be an invaluable tool for researchers studying genomics. Its functionality and a user-friendly interface allow users to efficiently navigate and explore vast amounts of genomic data. With its continuous updates, the Ensembl genome browser remains at the forefront of genomics research, aiding in the discovery and understanding of the complexities of the genome.

How did you like the theory?
Report a typo