5 minutes read

In this topic we will discuss what SNP stands for, what is SNP-calling, and why and how to perform it on the example of a basic pipeline.

Definition

SNP stands for single nucleotide polymorphism — a minimal genetic variation. It is a one-base difference in the same position between almost identical aligned DNA sequences. The term polymorphism implies that the said difference is studied not between two DNA sequences, but on a set of sequences. As a rule, SNP is expressed as a percentage of these sequences that have a certain nucleotide at a certain position, while the rest have another nucleotide at the same position. Usually, a nucleotide substitution in more than 1% of the population genomes is considered significant enough to be called SNP.

SNP-calling pipeline

The biological meaning of SNP

SNPs are searched for in data describing the genetic diversity of populations of organisms whose genomes differ by fractions of a percent. Examples would be a viral population, descending from the first viral particle that infected the host, or the gene pool of modern mankind (the sequenced part). Some SNPs have a direct effect on an organism's functioning and the mechanism of action is clear: SNPs can form a stop codon inside gene sequence, alter the spatial structure of RNAs, disrupt a conservative regulatory region, etc. Other SNPs are located in regions without any alleged function and the mechanism of their impact on phenotype is not obvious. Finally, there are SNPs that do not affect any known biological process and their impact is the subject of further studies.

The SNP profile (the aggregation of all SNPs inside the genome) is unique for each organism, it's like fingerprints imprinted in the genome. It reflects the evolutionary path of a given organism and all its ancestors. Certain combinations of SNPs can correlate with complex behavior in higher organisms, but most of the time the reasons for the correlation remain unclear.

SNP-calling pipeline

NGS technologies have opened up wide opportunities for the study of SNPs. The search of SNPs in NGS data is called SNP-calling. Generally, it means that you have a reference genome and NGS data on the population of interest. SNPs are searched in reads aligned to the reference. Let's break down the process as an example of an SNP-calling pipeline.

Genomes containing SNPs on some positions

Data preparation. The input data in the SNP-calling pipeline are reference genome (as a whole or a part of it) and NGS data, e.g. RNA-seq reads or WGS "re-sequencing" reads. In a biological sense, it could be the known genome sequence of some virus and the deep sequencing data obtained from a sample isolated from a patient infected with the same virus, but with atypical features such as an unusually high concentration of viral particles in the sample. In this case, we might suspect that the patient is infected with the evolved virus variant and further analysis would require an SNP profile of the alleged new virus variant.

The reference genome can be obtained from DNA and RNA sequence databases such as NCBI Nucleotide. NGS data in our example would be collected by your colleagues, but for practice purposes, you can find some raw NGS data in NCBI Sequence Read Archive (SRA). In any case, the first thing you need to do is to assess NGS reads quality using FastQC. We discussed the basic approach in the Bioinformatic pipelines topic.

Alignment. The next obvious step is to map the reads to the reference sequence. Again, the process was discussed earlier in the Alignment topic. The standard tool for such purpose is BWA (Burrows-Wheeler Aligner) or Bowtie aligner. The important part is to filter mapping results: non-uniquely mapped reads should be removed, and identical reads should be deduplicated (such reads may arise from errors in the library preparation stage). The resulting SAM file should be compressed to its binary analog and sorted by sequence coordinate on reference (use samtools package).

Variant calling. The next step is to generate a VCF (Variant Call Format) file, that contains the following information about all positions of the reference sequence: reference base at the said position, number of reads covering this position, and the variants of mapped bases and their corresponding proportions. The standard instrument to generate VCF files is mpileup from samtools.

SNP-calling. The SNP-calling is about finding varying positions in the VCF file that are actual SNPs. Some of the most popular SNP-calling tools are BCFtools program, Varscan variant caller, and the GATK variant discovery tool. At this step, the key basic parameter is variant frequency cut-off — the minimum proportion of non-reference bases at a position required to call it an SNP. A more advanced user can set more complex statistical calculations for filtering to achieve more reliable results.

Filtering. When searching for rare SNPs, one must be able to distinguish them from false positive SNPs, which are caused by technical errors at the stage of library generation, sequencing, as well as inaccuracies in the raw data processing. The simplest method would be to estimate the background level of such false positive SNPs using the same pipeline on biological and technical replicates of NGS data.

Interpretation. The resulting file can be visualized using a standard genome browser such as IGV. Then you can start the interpretation of found SNPs: whether they change any protein or RNA sequences, alter regulation sites like promoters within the genome, or whether they are associated with any phenotype features or have any effect at all. There are databases such as ANNOVAR, dbSNP database, GWAS catalog, and ClinVar database that accumulate knowledge about SNPs, that is, statistics on their occurrence within and across different species and relationships between SNPs and phenotype variants, if any. These instruments will greatly boost your research conclusions.

SNP applications

Actually, SNP-calling can find its application in all kingdoms of living organisms. In microbiology, SNPs display the composition of microbial populations. When observed in dynamics, these data become indispensable knowledge for studying natural and laboratory evolutionary processes, the interaction of pathogens with host cells, and tracking their outbreaks in host populations. Similarly, SNPs in crop genomes may indicate the geographical spread of their ancestors and the process of adaptation to different ecological niches. New crop varieties are developed based on these data. In human and veterinary medicine, genome-wide association studies search for associations between SNP patterns and the manifestation of complex diseases.

These results are used to predict a person's susceptibility to a particular disease. Also, the SNP profile can be associated with a person's response to environmental factors and different types of therapy. Therefore, studies about SNPs in human genomes improve personalized medicine. SNPs also enhance the search for cancer cells in transcriptome data and are used in genealogical tests.

Conclusion

The overall pipeline, from the biological sample to results interpretation, consists of numerous stages, each of which requires a careful and thoughtful approach to get reliable results. At the same time, some SNPs can have biological significance only in combination with others, so the prevention of errors becomes even more significant. Depending on the goals of your research, it can be important to assess whether false positives or false negatives can be ignored. It is in our hands to assess the quality of raw data and conduct a thorough bioinformatics analysis according to the best practices of professionals.

How did you like the theory?
Report a typo