The field of biological science has been developing at a very high rate for the last 10-15 years. The European Bioinformatics Institute (EBI) documented the so-called "data explosion" beginning in 2008 and attributed it to the new availability of inexpensive sequencing platforms. This massive amount of data is why bioinformatics needs databases to store and organize information.
Types of bioinformatic databases
Biological databases can be classified in different ways. The generalized classification is based on the way data was obtained: primary and secondary databases.
Primary databases are filled with data obtained from experiments. It can be nucleotide/protein sequences or structures of macromolecules. Data is deposited by researchers.
Secondary databases are populated with aggregated data from primary databases. In other words, secondary database add layers of information to nucleotide/protein sequences. Such databases are often curated manually or by a computer. Secondary databases can have several sources, i.e. primary or secondary databases. For instance, UniProt Knowledgebase (UniProtKB) combines two primary databases.
However, there is a more intuitive way to describe a variety of biological databases, which Nucleic Acids Research journal (NAR) uses to publish annual updates on biological databases. NAR offers database classification based on the type of biological information (e.g. nucleic acids, proteins, macromolecule structure, metabolism and signaling, microbes and viruses, pharmacology, etc). The 2022 database issue contains 185 papers, and the online database collection has more than 1500 entries (link to the paper; online database collection).
Bioinformatic databases overview
Nucleotide sequence databases
International Nucleotide Sequence Database Collaboration (INSDC) combines three large databases: GenBank from NCBI ("NCBI databases" topic), European Nucleotide Archive (ENA), and DNA Data Bank of Japan (DDBJ). INSDC has submitting standards for tags in nucleotide entries like "/country" and "/type_material". These standards are the same for all three databases. ENA is a foundation that supports several large databases including BioStudies database. BioStudies collects information about study design (it is often located in the "Experimental Design" or "Description" section), sample count, organism, sequencing protocol (if applicable), etc. It also stores ftp links to raw sequencing files (e.g. *.fastq.gz) in *sdrf.txt file.
Genomes databases
The Encyclopedia of DNA Elements (ENCODE) is a database of functional elements in human and mouse genomes. It stores information at different levels: genomes, genes, and regulatory elements. One can inspect GC-content at genome scale, untranslated regions in genes, or promoter/enhancer-like regions. UCSC Genome Browser provides access to a database of annotated genomic sequences of different organisms including human and mouse. There is an option of adding tracks like expression, epigenetics, phenotype, and disease association data during visualization of a genome.
Protein sequence databases
UniProt Knowledgebase (UniProtKB) combines two protein databases: Swiss-Prot and Translated EMBL (TrEMBL). TrEMBL is an unreviewed collection of protein sequences. The database consists of automated translations of open reading frames. Swiss-Prot is a curated database. Curators derive entries from TrEMBL, check them, and deposit results to Swiss-Prot. Every UniProtKB entry includes such fields as metadata (e.g. review status, protein length), sequence annotation (e.g. organism name, article title, identifiers in other databases, feature table) and the sequence itself. Feature tables describe secondary structure, binding sites, and modified residues. InterPro database is a collection of functionally analyzed proteins assembled in protein families. The database provides a resource for protein domain analysis. Researchers can find a set of similar proteins based on domain architecture. InterPro has a predecessor, Pfam, which is now archived and not updated.
Structure databases
Protein Data Bank (PDB) stores coordinates of three-dimensional macromolecule structures for standalone proteins, protein complexes, and nucleotide-protein complexes. The database also collects information about experimental procedures and has an integrated web tool for 3D visualization.
Chemical databases
PubChem provides chemical information for millions of compounds and substances. It is a collection of information about molecular formula, weight, structure, and effect of chemicals on biological systems. DrugBank database collects information about drugs and active substances. It includes drug classification, molecular targets, and clinical trials associated with a drug.
Metabolic and signaling databases
Reactome database stores biologically relevant entities like biological pathways, chemical reactions, chemical compounds, and drugs related to biological molecules. Biological pathway depicts a graph of the molecular interactions and shows how it fits into the process of cell metabolism. Chemical reaction stores information about reaction participants and catalytic activity. Drug entity shows PubChem compound identifier and the biological process that is affected by the drug. Kyoto Encyclopedia of Genes and Genomes (KEGG) is another database that stores information about biological pathways, biochemical reactions, enzymes, and drugs. KEGG PATHWAY provides access to metabolic pathways (e.g. photosynthesis), different cellular processes (e.g. cell cycle), and organismal systems (e.g. platelet activation).
Human Genes and Diseases database
1000 Genomes Project is a catalog of human genetic variations (e.g. single-nucleotide variations and chromosomal translocations) across different ethnic groups. All entries were obtained from healthy individuals. This database provides a trusty reference for scientists who want to find variation frequency. NCBI also collects a few databases that describe human genomic variations (dbSNP, dbVar) and human variations of clinical significance (ClinVar) [see "NCBI databases" topic].
Microarray data and other gene expression databases
NCBI manages more than 30 databases including Gene Expression Omnibus (GEO). GEO is a collection of microarray and NGS data. For more information about NCBI databases, see topic "NCBI databases." Expression Atlas provides information about gene abundance in different tissues and cells at different developmental stages and cell cycle phases across species.
Biological data accessibility
Many scientists are trying to make access to biological data free. For instance, GenBank, GEO, UniProtKB, ENCODE, InterPro, PDB are free to use for any user. However, there are controlled-access datasets due to grant restrictions or institution policies. Such datasets are often deposited in the Database of Genotypes and Phenotypes (dbGaP, NCBI database) and the European Genome-Phenome Archive (EGA). It means that a researcher has to request access to download data from these databases. Access is often limited to non-commercial use.
Conclusion
The past decade gave biologists an enormous quantity of data. These data include sequencing results, dataset annotations, publications, clinical trials results, macromolecule structures, and so on. So it became necessary to create new, specific biological databases. It is important to keep in mind that there are databases that share the same biological types of data among each other, but they still can provide different functionality to users. So, the major task for a bioinformatician is to choose the most useful database for one's goals and objectives.