Metagenomics is a fascinating branch of biology that allows us to explore and understand the hidden world of microorganisms. It provides a snapshot of the collective genetic material from all the microorganisms present in a sample, providing insights into the community structure, diversity, and functions.
Let's take a deep dive into this field!
Setting up a metagenomic study
Metagenomics is a technology that includes sequencing of the total DNA of microbiome communities and consequent analysis. The process of community analysis starts with the sampling of material (i.e. soil, water, saliva), followed by DNA extraction and sequencing.
There are several approaches to sequence metagenome:
- Targeted sequencing
- The sequencing of a single region or a gene (typically, bacterial 16S rDNA).
- Shotgun sequencing
- The sequencing of the whole DNA of a community with the usage of short-reads sequencing.
- Nanopore sequencing
- The sequencing of a whole DNA with the usage of nanopores in order to assemble metagenomes (or MAGs: Metagenome-Assembled Genomes).
The metagenomic study plan in general is independent of the sequencing approach. Before sequencing your sample you have to formulate the test hypothesis, collect samples (positive, negative control samples, and experimental ones), extract DNA and finally sequence it. After that, sequencing data requires quality control and processing, followed by data analysis.
Sample preparation is the most crucial step; impure DNA or lack of DNA may lead to ineffective sequencing. Errors during PCR enrichment of the library will lead in particular to lower library size and an increased proportion of chimeric sequences (we will talk about it later).
In this topic, we will focus on a targeted sequencing approach — amplicon metagenomics.
Amplicon metagenomics aspects
Amplicon sequencing involves the targeted amplification and sequencing of specific genetic regions:
- 16S rRNA gene for bacteria and archaea.
- The internal transcribed spacer (ITS) region for fungi.
- 18S rRNA for other eucariota.
By designing primers specific to these regions, researchers can selectively amplify and sequence DNA fragments from the diverse microbial populations in the sample.
Amplicon sequencing offers several advantages, including high-throughput capabilities, cost-effectiveness, and the ability to generate large-scale datasets for studying microbial diversity across various environmental samples. It has become a fundamental tool in studying the structure, dynamics, and interactions of microbial communities in diverse ecosystems, providing insights into the roles and functions of microorganisms in their respective habitats. However, the results you acquire with this approach are highly influenced by the markers you choose to sequence.
16S rRNA gene
16S rRNA gene is one of the most popular targets for metagenomics. Comparison of this gene across bacterial species delineated conservative and variable regions. Typically several variable regions (V3-V4 is the most common) are used in metagenomic research, because of two main advantages: relatively low cost of the experiment and relatively high taxonomical resolution.
Taxonomic resolution is high when we annotate samples on the species level, and it is low when precise taxonomic identification is available only on phylum or class level. In most cases, it is not likely to produce species-level annotation using variable regions of 16S rDNA; typically the highest annotation level of this approach is genus level.
Sequencing of full-length 16S rDNA enables taxonomic identification on a species level, however, to gain this resolution, it should be sequenced with long-reads technologies, such as PacBio.
18S rRNA gene
As well as the 16s rDNA gene, the 18S gene has variable and conservative regions. Commonly targeted variable regions include V4, V5, and V9, although other regions such as V1-V3 and V7-V8 may also be used in certain studies. These regions are selected based on their ability to provide sufficient sequence variability to distinguish between different eukaryotic taxa while still allowing for efficient and accurate sequencing.
ITS region
The Internal Transcribed Spacer (ITS) region is a commonly targeted gene for amplicon sequencing in metagenomic studies, particularly for fungal communities. It is located between the highly conserved small subunit (SSU) and large subunit (LSU) rRNA genes in the ribosomal DNA (rDNA) repeat unit. The ITS region consists of two variable regions, ITS1 and ITS2, separated by a highly conserved 5.8S rRNA gene. The ITS region exhibits higher sequence diversity compared to other genomic regions, making it suitable for distinguishing between fungal taxa at various taxonomic levels.
Targeted vs shotgun metagenomics
One of the advanced metagenomics techniques is shotgun metagenomics. It may be used to confirm findings achieved with amplicon sequencing or to perform functional gene analysis. Here is a small table, summarizing the basic differences between these two approaches.
| Amplicon Metagenomics | Shotgun Metagenomics | |
|---|---|---|
| Target | Amplifies and sequences specific genetic regions (16S/18S/ITS) | Sequences of all DNA present in a sample |
| Resolution |
Provides detailed taxonomic information up to genus level, sometimes up to species level |
Allows for taxonomic (up to species level) and functional analysis |
| Cost | $ | $$$ |
| Functional Analysis | Limited functional gene analysis | Allows for functional gene analysis and metabolic pathway exploration |
In further paragraphs, we will discuss the processing of reads achieved by 16S rDNA gene region amplicon sequencing.
Processing of metagenomic sequencing data
Alright, you have sequenced your samples and received your sequencing results – raw reads. What shall you do with them?
Demultiplexing
Demultiplexing of reads is the process of sorting and assignment of the raw sequencing reads to their respective samples based on sample-specific barcodes that were introduced during library preparation. With barcodes, it is easy to identify to which sample belongs the read. In practice it looks like this: you have one file (with non-demultiplexed reads) and you split it into several files according to the barcodes of the reads.
Reads processing
The first step after you performed the quality control of the reads (you may refresh your knowledge here), that you may need to perform is read merging (or pairing). Pairing of the reads is a step, that will allow you to generate longer, more informative sequences or contigs; to reconstruct the amplicon sequence. The longer sequences can provide better insights into the taxonomic composition and functional potential of the microbial community targeted by the amplicon analysis.
During the pairing step, the paired-end reads, often identified by unique barcode sequences or index sequences associated with each sample, are aligned to a reference sequence or a reference database. By aligning the reads, overlapping regions between the pairs can be identified. These overlapping regions represent the portion of the DNA fragment that is sequenced by both reads. Combining these overlapping regions allows for the generation of longer, contiguous sequences, improving the accuracy and completeness of the reconstructed amplicon sequences.
Detection and removal of chimeric sequences.
Chimeras are hybrid products between multiple parent sequences that can be misinterpreted as novel organisms. This, consequently, may skew downstream analysis, such as taxonomic assignment, gene prediction, and diversity estimation.
Chimeras may appear during PCR amplification: cross-contamination or template switching between DNA fragments can lead to the creation of chimeric sequences. Similarly, in sequencing library preparation, the ligation of DNA fragments from different sources can result in chimeric artifacts. Chimeric sequences may be detected by such tools as USEARCH. It uses a reference database and detects abnormal alignment patterns with the help of the UCHIME algorithm.
The resulting sequences after pairing and removing chimeras can be filtered by the length: we can choose for further analysis only sequences, that approximately match the size of the target gene region.
Analysis of metagenomic data
Assignment of taxonomy
Assigning taxonomy to the amplicon sequences is an essential step in understanding the composition of microbial communities. Various tools and databases, such as the SILVA database, Greengenes, or NCBI database, can be utilized to match the sequences against reference sequences and assign taxonomic labels at different hierarchical levels (e.g., genus, species). The accuracy and resolution of taxonomy assignment methods can vary, and it is important to consider the limitations and potential biases associated with each method.
Clustering methods
After taxonomy assignment, clustering methods can be employed to group similar sequences into operational taxonomic units (OTUs). OTUs serve as proxies for taxonomic units and provide a measure of organisms' diversity. Clustering helps to reduce dataset complexity by collapsing similar sequences into representative OTUs. This step simplifies downstream analysis and allows for the calculation of population statistic metrics.
Note: There are multiple approaches to group amplicon sequences, so you may encounter not only OTUs, but also ASV, zOTUs and others.
Data analysis
During previous steps, we performed taxonomical identification of sequenced species and this allows us to answer such questions as:
- Is the taxonomical composition of the positive control sample correspond with expected abundances?
- Microbial composition is different or similar between the two samples? Which sample is richer in species?
- Does the community changes through time?
- Are there any pathogenic bacteria that we were looking for? Are there any other bacteria of the same genera?
To answer these questions you will operate with a feature table and representative sequences, achieved in the previous steps.
Feature table usually consists of rows and columns, which represent samples and OTUs respectively.
Representative sequences in this context are processed reads, which were merged and taxonomically annotated
In the next topic, we will cover basic statistic metrics; these metrics are actively used in metagenomics.
Metagenomics applications
Using the study plan that we have just defined you now can analyze a community of organisms with a targeted metagenomic approach. Here are some applications of metagenomic studies, that may be a source of inspiration for you:
-
Microbial Diversity and Community Composition: Metagenomics enables the study of microbial diversity and community composition in various environments such as soil, oceans, human gut, and more. By analyzing the genetic material in these samples, researchers can identify and quantify different microbial species present, gaining insights into the richness and structure of microbial communities.
-
Functional Annotation and Gene Discovery: Metagenomics allows for the annotation of functional genes and the discovery of novel genes within microbial communities. By analyzing the DNA sequences, researchers can identify genes associated with specific metabolic pathways, antibiotic resistance, virulence factors, and other functional traits, contributing to our understanding of the potential functions and activities of microorganisms.
-
Bioprospecting and Biotechnology: Metagenomics offers the opportunity to discover new enzymes, biomolecules, and natural products with potential applications in biotechnology, pharmaceuticals, and bioremediation. By exploring the genetic repertoire of diverse microbial communities, researchers can uncover novel bioactive compounds and enzymes that could be used for drug development, biofuel production, waste management, and other biotechnological applications.
-
Ecological and Environmental Studies: Metagenomics provides valuable insights into the ecological roles of microorganisms and their interactions within ecosystems. By studying microbial communities in different habitats, researchers can assess the impact of environmental factors on community dynamics, nutrient cycling, and ecosystem functioning. Metagenomic analysis can also aid in monitoring environmental changes, identifying indicator species, and evaluating ecosystem health.
-
Microbiome Research and Human Health: Metagenomics plays a crucial role in understanding the human microbiome and its influence on health and disease. By analyzing microbial communities in the human gut, skin, oral cavity, and other body sites, researchers can identify specific microbial signatures associated with health conditions, such as inflammatory bowel disease, obesity, and even mental health disorders. Metagenomic analysis of the human microbiome has the potential to contribute to personalized medicine, diagnostics, and the development of microbiome-based therapies.
Conclusion
This topic has covered the basics of metagenomic analysis: from experiment plan to hypothesis testing. We got acquainted with the amplicon sequencing approach and learned OTU term and statistics which is typically applied during metagenome analysis.