Exome SNP Calling Accuracy

A head-to-head evaluation of GATK vs Google DeepVariant Exome SNP calling accuracy for Illumina vs BGISEQ-500 data

We evaluate the performance of two NGS pipeline for SNP variant calling accuracy, GATK haplotypeCaller against Google DeepVariant for exome sequencing. GATK HaplotypeCaller takes an assembly based novel approach to identify variants while DeepVariant calls genetic variant based on deep convolutional neural network. Variant calling accuracy depends on a number of factors that include SNP calling algorithm, sequencing platform and exome capture kit. Here we not only compare the two popular variant calling pipelines but also compare across sequencing technology – Illumina vs BGISeq to evaluate which platform is most suitable for SNP variant calling.

We compare our exome-seq variant calling accuracy against high confidence gold standard list of variant for NA12878 and NA24149 that were experimentally validated by genome in a bottle consortium (NIST-GIAB). NA12878 (ERR1831353) was sequenced on BGISEQ-500 and NA24149 (SRR2962692) was sequenced on HiSeq 2500 platform while Agilent v5 Sure. Select protocol was used for exome capture for both datasets. As per NIST-GIAB list, NA12878 has 35,237 SNPs while 33,356 SNPs are present in NA24149 within the regions defined by the exome bed file.

GATK and DeepVariant pipelines were run in-house on both datasets and parameter optimization performed to generate best SNP calling. VCF Comparator tools was used to evaluate accuracy. Here are key take away from the head on comparison:

We observed that DeepVariant outperform GATK haplotypeCaller though the gap between the two SNP calling pipelines is marginal.
GATK HaplotypeCaller called more false positive SNV as compare to deepvariant.
Illumina and BGISeq-500 seems to generate similar quality data but Illumina based Exome Seq variant calling accuracy for both SNP calling pipeline is higher.
Best SNP calling accuracy achieved is 99.925% with DeepVariant pipeline on Illumina HiSeq data.

Comparison of True Positive (TP) and False Positive (FP) SNP variant calls by GATK and DeepVariant on Illumina HiSeq sequencing platform.Comparison of True Positive (TP) and False Positive (FP) SNP variant calls by GATK and DeepVariant on BGISeq-500 sequencing platform.

SNP calling pipeline accuracy comparison across Illumina and BGISeq-500 sequencing data using GATK and DeepVariant.

Dataset	SNP caller	True positives	False Positives	False Discovery Rate (FDR)	True Positive Rate (TPR)	False Positive Rate (FPR)	Positive Predictive Value (PPV)
NA24149 – Illumina HiSeq	GATK	33314	103	0.0030822635	0.99874085	0.0030879	0.9969177
NA24149 – Illumina HiSeq	DeepVariant	33308	25	0.0007500075	0.99856097	0.0007494903	0.99925

NA12878 – BGISEQ 500	GATK	35188	228	0.006437768	0.9960653	0.006453987	0.9935622
NA12878 – BGISEQ 500	DeepVariant	35161	73	0.0020718623	0.99530107	0.002066408	0.99792814

True Positive (TP): SNP correctly identified by pipeline being tested and exists in NIST-GiaB SNP list. False Positive (FP): SNPs incorrectly identified by pipeline being tested as absent from NIST-GiaB list of known SNPs. True Negative (TN): SNP not detected by the pipeline being tested and is absent in NIST-GiaB list. False Negative (FN): SNP not detected by the pipeline being tested but exists in the NIST-GiaB list. FDR = FP/(FP+TP) TPR = TP/(TP+FN) FPR = FP/(FP+TN) PPV= TP/(TP+FP) Notes: vcf comparator tool used for SNP calling accuracy comparison. Hap.py tool used for evaluation. Reference Genome: hs37d5.fa References

Krusche P1 et al; Global Alliance for Genomics and Health Benchmarking Team. Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes. Nat Biotechnol. 2019 May; 37(5):555-560. doi: 10.1038/s41587-019-0054-x. Epub 2019 Mar 11.
Adam Cornish and Chittibabu Guda. A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference. BioMed Research International Volume 2015.
Zook JM, Chapman , Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and Indel genotype calls. Nat Biotechnol. 2014 Mar; 32(3):246-51. doi: 10.1038/nbt.2835. Epub 2014 Feb 16.
Ryan Poplin et al. Creating a universal SNP and small indel variant caller with deep neural networks.
Van der Auwera, Geraldine A et al. “From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.” Current protocols in bioinformatics vol. 43,1110 (2013): 11.10.1-33.

A head-to-head evaluation of GATK vs Google DeepVariant Exome SNP calling accuracy for Illumina vs BGISEQ-500 data

SNP calling pipeline accuracy comparison across Illumina and BGISeq-500 sequencing data using GATK and DeepVariant.

Related articles

A Timeline of Steep Fall in Human Genome Sequencing Costs

Illumina Sequencing Explained

How to Choose a Superior Quality, Cost Effective Bioinformatic Analysis Service Provider?

Why Get Exome Sequencing Done?

Contact Us

Address

Follow Us

Services

Articles

Structural Variation identification in Breast Cancer with Whole Genome Sequencing using Long reads

Next Generation Sequencing reveals transcriptional analysis of Masson Pine (Pinus massoniana) under High CO2 Stress

Next Generation 16s rRNA analysis assessing Potentially Active Bacteria and Foodborne Pathogens.

Exome SNP Calling Accuracy

A head-to-head evaluation of GATK vs Google DeepVariant Exome SNP calling accuracy for Illumina vs BGISEQ-500 data

SNP calling pipeline accuracy comparison across Illumina and BGISeq-500 sequencing data using GATK and DeepVariant.

Related articles

A Timeline of Steep Fall in Human Genome Sequencing Costs

Illumina Sequencing Explained

How to Choose a Superior Quality, Cost Effective Bioinformatic Analysis Service Provider?

Why Get Exome Sequencing Done?

Contact Us

Address

Follow Us

Subscribe to our newsletter

Services

Articles

Structural Variation identification in Breast Cancer with Whole Genome Sequencing using Long reads

Next Generation Sequencing reveals transcriptional analysis of Masson Pine (Pinus massoniana) under High CO2 Stress

Next Generation 16s rRNA analysis assessing Potentially Active Bacteria and Foodborne Pathogens.