Natural scienceBioinformaticsData and Tools

Variant call format

5 minutes read

In this topic, we will discuss the Variant Call Format (VCF). This is a standard file format for storing gene sequence variants. Obtaining of VCF file is a required step in any SNP-calling pipeline. Let's study the format features and tools utilizing it.

Variant call format

Variant Call Format (VCF) is a text file format used in bioinformatics to store genetic variation data. It contains information about genomic variants, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations. VCF files are widely used in genetic research to analyze and compare genomic variation across different individuals or populations. They are also used for variant calling, which is the process of identifying and classifying genetic variants from sequencing data. The VCF format is flexible and can be easily customized to include additional information if needed.

VCF is a standard bioinformatic file format. Here you can find the VCF version 4.2 specification. Documentation of later VCF versions is presented on the GitHub page.

Let's take a closer look at the VCF format using the following example. If you open a VCF file, you will see something like this. It may look complicated, but we are going to learn how to read it.

fileformat=VCFv4.2
fileDate=20090805
source=myImputationProgramV3.1
reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
contig=ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=Homo sapiens,taxonomy=x
phasing=partial
INFO= ID=NS,Number=1,Type=Integer,Description=Number of Samples With Data
INFO= ID=DP,Number=1,Type=Integer,Description=Total Depth
INFO= ID=AF,Number=A,Type=Float,Description=Allele Frequency
INFO= ID=AA,Number=1,Type=String,Description=Ancestral Allele
INFO= ID=DB,Number=0,Type=Flag,Description=dbSNP membership, build 129
INFO= ID=H2,Number=0,Type=Flag,Description=HapMap2 membership
FILTER= ID=q10,Description=Quality below 10
FILTER= ID=s50,Description=Less than 50% of samples have data
FORMAT= ID=GT,Number=1,Type=String,Description=Genotype
FORMAT= ID=GQ,Number=1,Type=Integer,Description=Genotype Quality
FORMAT=ID=DP,Number=1,Type=Integer,Description=Read Depth
FORMAT=ID=HQ,Number=2,Type=Integer,Description=Haplotype Quality
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

VCF file consists of three parts: meta-information, the header, and the body (variants' data).

Meta-information

Meta-information lines start with ##. Each string corresponds to one entity and should be structured like this "key=value". Meta-information section contains an obligatory file format line that specifies the VCF format version number. Our example file has VCF version 4.2, which is presented in the first line: ##fileformat=VCFv4.2. Meta-information also contains descriptions of the header entries used in the body of the VCF file. The section preferably includes specifications of the INFO, FILTER, and FORMAT entries of the body. INFO-line has the following structure:

##INFO=<ID=id,Number=number,Type=type,Description="description",Source="source",Version="version">

  • Type refers to a type of data used for INFO entities with ID=id in the body of the file;
  • Number contains an integer indicating the number of values in the ID field;
  • Description explains the meaning of the ID. It should be surrounded by double quotes;
  • Source and Version fields are optional. They specify variant annotation sources and versions. Double quotes are also required.

FILTER and FORMAT meta-information lines are structured as follows:

##FILTER=<ID=ID,Description="description">

##FORMAT=<ID=ID,Number=number,Type=type,Description="description">

Meta-information section may also include descriptions of other header entries and used reference genomes.

Meta-information is followed by the header and the body. The latter two form a table where all lines are tab-delimited.

The header has 8 mandatory items representing columns in the body of the VCF file:

  • #CHROM contains a chromosome ID, where the variant is located;
  • POS refers to the position on the chromosome;
  • ID represents the variant identification name;
  • REF contains reference alleles, and ALT includes alternative/mutated alleles;
  • QUAL field has a variant's quality. QUAL=10log10(prob(A))QUAL =-10 * log_{10}(prob(A)), where prob(A)prob(A) is a probability that the call in ALT is wrong;
  • FILTER describes whether a variant passed the applied filter;
  • INFO contains additional information about a variant.

The table could also include a FORMAT column followed by variable columns. Each variable column contains information about the genotype in a particular sample and has a name corresponding to the sample's ID. FORMAT specifies the types and order of data presented in the variable columns.

Body

Finally, we get to the body of the VCF file. A row here represents one genomic position and contains information about a variant in the format corresponding to the header. Let's try to read the first row in our example.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003 20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1|1:43:5:..

This variant is located on the 20th chromosome and is 14.370 nt from the start. Its identification name is rs6054257. The reference genome has G in this position, while some samples contain A. The variant quality is 29 and this SNP passed the applied filter. The next two columns (INFO and FORMAT) include many abbreviations. You can find its descriptions either in the meta-information section or in the VCF specification file.

INFO column of the rs6054257 variant contains the following information: NS=3;DP=14;AF=0.5;DB;H2. This record has several entities delimited by ; . Each entity has an ID, those are NS, DP, AF, DB, H2 in our example. To understand their meanings we can go to the meta-information section. Let's decipher NS=3 . Since it is in an INFO column, we should search for lines starting with ##INFO in meta-information. Within these lines, we look for a string with ID=NS. Here it is.

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

Thus, NS=3 means that 3 samples in the file have data about the rs6054257 variant. If you repeat this procedure for other entities in the INFO, you will decipher them too. Finally, NS=3;DP=14;AF=0.5;DB;H2 should be read as follows. The rs6054257 variant is presented in 3 samples (NS=3), has a total depth of 14 (DP=14), its allele frequency equals 0.5 (AF=0.5) and it is also a member of dbSNP (DB) and HapMap2 project (H2) — the databases of gene variants.

FORMAT column also includes scary IDs delimited by :GT:GQ:DP:HQ . To understand them we have to check meta-information lines starting with ##FORMAT and search for GT, GQ, DP, and HQ IDs. Therefore, the record means that variant characteristic in each sample is structured as follows: Genotype (GT): Genotype quality (GQ): Read depth (DP): Haplotype quality (HQ). Genotype is encoded as allele values separated by either / (genotype unphased) or | (genotype phased). The allele values are 0 for the reference allele, 1 for the first allele listed in ALT, 2 for the second, etc.

In our example FORMAT column is followed by 3 columns NA00001, NA00002, and NA00003. Those are the IDs of 3 samples and each column contains genotype data about structured as described above. For instance, the rs6054257 variant has 0|0:48:1:51,51 record in the NA00001 column. Variant genotype (GT) is 0|0, i.e. it is phased and both alleles coincide with reference allele G. The genotype has a quality score (GQ) of 48, read depth (DP) for the variant accounts for 1, and both haplotypes have a quality (HQ) of 51 (51, 51).

The variant rs6054257 is a simple SNP, but you can find more complicated cases in our example. For instance, the microsat1 variant is a good sample of deletion (G) and insertion (GTCT), while a variant from the 1230237th position is a monomorphic reference. A monomorphic reference means there is no alternate allele in the position, i.e. all the samples have the same genotype coinciding with the reference genome. This case is denoted as a dot in the ALT column.

Generating VCF

Aligned reads and reference genomes are used to generate raw VCF files. Raw VCF contains information about all positions of the genome. The majority of the positions in a raw VCF file are monomorphic sites. Mpileup from bcftools or mpileup from samtools are the most popular instruments to produce raw VCF files.

Raw VCF should then be filtered to keep only relevant variants. There are a variety of tools that perform this task. Some examples are Haplotypecaller from GATK, freebayes variant detector, call command from bcftools. You can perform variant calling with bcftools as follows:

bcftools mpileup -f reference.fa alignments.bam | bcftools call -mv -Ov -o calls.vcf

The command makes 2 sequential actions. bcftools mpileup creates a raw VCF file to the output, which is then passed to bcftools call using the pipe character |. reference.fa stores reference genome, alignments.bam contains aligned reads and calls.vcf is the output VCF file with actual variants.

It is a good practice to run several variant callers and compare results to select more accurate variants. These variants can be further analyzed using different tools, for instance, PLINK.

PLINK program (PLatform for INtegrated Knowledge discovery) provides a comprehensive set of tools for analyzing genetic variants data. It is a command-line tool that can perform various tasks such as quality control, association analysis, linkage disequilibrium analysis etc. The tool is widely used in genetic research to identify genetic variants associated with diseases or traits. It supports various input file formats and provides options for data manipulation and filtering.

You can download the PLINK .exe file from the source page and run it in the command line. Let's look at the basic structure of the PLINK command.

plink --<format> <input> --command(s) --out <output>

<format> specifies input file format, for example, --vcf, --bfile etc. <input> is the name of the input file with gene variants. <output> is a prefix of the output file.

PLINK operates with its own PLINK 1 binary format, which consists of the binary fileset .bed + .bim + .fam. You can convert a .vcf file to PLINK 1 binary format with --make-bed command as follows:

plink --vcf infile.vcf --make-bed --out output/variants

Filtering is one of the most useful PLINK functions. For instance, you can remove all variants not genotyped to 95% using the following command.

plink --bfile input --geno 0.05 --make-bed --out output

Here, --geno filters out all variants with missing call rates exceeding 0.05 and --make-bed creates PLINK 1 binary fileset for filtered variants. --bfile stands for PLINK 1 binary format.

PLINK has a tremendous toolbox, you can continue to explore it on the PLINK page.

Conclusion

Variant Call Format is a flexible format to store all the necessary information about genetic variants, including reference and alternative alleles, quality, coverage, etc. VCF files are simple to generate by various variant-calling tools and can be easily processed by different instruments such as PLINK.

How did you like the theory?
Report a typo