6 minutes read

Introduction

Scientific knowledge has grown so rapidly over the last several decades that individual or analog collections were no longer useful. To keep up with the rate of discovery and to allow researchers around the world access to the most up to date data, the scientific community decided to create a common database to manage massive amounts of data. In this topic you will learn about one of the largest biological databases and its structure.

What is NCBI?

In 1988, the National Center for Biotechnology Information (NCBI) was created within the National Library of Medicine at the National Institutes of Health. Its main mission is to develop information systems for molecular biology (link to the paper).

NCBI manages a collection of databases that are divided into 6 groups: literature, genomes, genes, proteins, chemicals, and clinical databases. Many of the NCBI databases are critically important to scientists conducting research in the life sciences. One example is PubMed, which is the most popular and useful tool for searching articles — all scientists want their paper to be indexed by PubMed.

Retrieval of information can be done via website, python package Bio.Entrez, or APIs (Application Programming Interfaces). Website searches enable Boolean logic (e.g. "promoters OR response elements") to improve relevance of the results. Programmatic access can be done via E-utilities. It is a set of nine programs to get database statistics (EInfo), to search text (ESearch), to download data records (EFetch), and so on. The list of NCBI databases in use as of 4 September 2021 is provided in the paper (table 1, paper).

Overview of NCBI databases

Literature databases

PubMed database contains full information about scientific and medical papers — abstracts, citations (such as doi, pubmed id), author names and affiliations, references, journal name, and information about grant support. PubMed can suggest similar articles based on those you have read. NCBI also stores free, full-text journal articles in PubMed Central database (PMC). This repository contains not only the article text and figures but supplemental information and online methods, making it the best tool to analyze and use articles.

Genomes databases

Nucleotide is a collection of DNA and RNA sequences from GenBank and the Reference Sequence (RefSeq) [links to GenBank about-page and RefSeq about-page]. GenBank is an annotated collection of all publicly available DNA sequences. Its file format includes fields such as data type (e.g. DNA, protein), sequence data, author information, features of sequence (e.g. location, organism information protein sequence if applicable). RefSeq is a nucleotide sequence database with a file format very similar to GenBank. RefSeq tends to have fewer similar or nearly identical sequences. They are often validated, and sequence annotation in RefSeq is more consistent than in GenBank.

Sequence Read Archive (SRA) is an archival repository of next-generation sequence data (NGS). It has both publicly available content and controlled-access data (dbGaP). Any user from a research institution or company can request access to data in dbGaP. SRA has a hierarchical structure. The top-level layer is called "Study," for example SRP250911 (every Study identifier starts with "SRP"). Every Study includes one or more Experiments ("SRX"), and each Experiment includes one or more Runs ("SRA"). Each biological sample has an annotation called BioSample.

Genes databases

Gene database is a collection of gene reports. It contains information about the function, location, and expression of genes in various organisms. Gene locations are linked with current and previous genome assemblies to support clinical users that have not yet updated to the current reference assembly.

Gene Expression Omnibus (GEO) is an expression and NGS data repository. It has two modules: GEO DataSets and GEO Profiles. The first module stores information about samples, platform information, experimental design, and publication information if applicable. GEO Profiles is a tool used to check the expression level of a gene of interest across all samples in DataSet.

Proteins databases

Protein database is a collection of protein sequences obtained from several sources, including translations from GenBank and RefSeq. The Protein file format has identical fields as Nucleotide, but the sequence consists of amino acids rather than nucleotides.

On NCBI, you can also find structural information about proteins, like crystallographic and nuclear magnetic resonance (NMR) data which is useful for protein analysis and visualization. This data is in Structure or Molecular Modeling Database (MMDB). MMDB obtains information from Protein Data Bank (PDB) and allows visualization of protein structure using NCBI 3D viewer, iCn3D.

Clinical databases

The Database of Single Nucleotide Polymorphisms (dbSNP) is a collection of human genomic variations (i.e. single-nucleotide variations and other small-scale variations) and their frequencies in different populations. The repository includes common variants (at least 1% of population) and rare variants (< 1%).

ClinVar database is an archive of human variations of clinical significance with supporting evidence (e.g. submitted or published papers). "Clinical significance" means that the variation is linked with phenotype alteration or a specific medical condition.

ClinicalTrials.gov is a database of ongoing and completed clinical research (human studies). Every clinical trial has an identification number that starts with "NCT," for instance, NCT03637543. Each clinical trial has a page containing study description, study design, current progress (i.e. clinical trial phase), as well as the results of the study.

Chemicals databases

PubChem database provides chemical information for millions of compounds and substances. PubChem Substance stores information on chemical substances submitted to PubChem by depositors. PubChem Compound includes unique and validated structures. It stores information about structure, chemical safety, molecular formula and weight, and patents if applicable. PubChem has a discrete module about the effects of chemicals on biological systems (PubChem BioAssay).

Conclusion

NCBI is a meta-database containing dozens of databases cataloging tons of information from scientific literature to sequence repositories. Information can be retrieved using the NCBI website, python package Bio.Entrez, or APIs. NCBI provides access to nucleotide and protein sequences in GenBank, RefSeq, and other formats. SRA and GEO are two main databases for expression and NGS data storing and downloading. Both dbSNP and ClinVar are used for variant annotation. MMDB and iCn3D are used to analyze and visualize protein structures. PubChem is essential for finding information on the properties and biological effects of chemicals.

6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo