Comparison of Somatic Variant Calling Pipelines On DNAnexus

The detection of somatic mutations in sequenced cancer samples has become increasingly standard in research and clinical settings, as they provide insights into genomic regions which can be targeted by precision medicine therapies. Due to the heterogeneity of tumors, somatic variant calling is challenging, especially for variants at low allele frequencies. Researchers use common somatic variant call tools, including MuTect, MuSE, Strelka, and Somatic Sniper,  that detect somatic mutations by conducting paired comparisons between sequenced normal and tumorous tissue samples. Each of these variant callers differ in algorithms, filtering strategies, recommendations, and output. Thus we set out to compare how these individual apps perform on the DNAnexus Platform. Each app was evaluated for recall and precision, cost, and time to complete.  

To benchmark some of the common somatic variant calling tools available on the DNAnexus Platform, our team of scientists simulated synthetic cancer datasets at varying sequencing depths. DNA samples from the European Nucleotide Archive were obtained and mapped to the hs37d5 reference with the BWA-mem FASTQ read mapper on DNAnexus.

These samples were then merged into a single BAM file representing the normal sample. To obtain the tumor sample, synthetic variants were inserted into each individual sample with the BAMSurgeon app on DNAnexus. All simulated samples were then merged into one BAM file constituting the tumor sample. Both the synthetic tumor and normal BAM files had approximately 250X sequencing depth.The synthetic tumor BAM file was then downsampled into a range of sequencing depths. With the help of sambamba through the Swiss Army Knife application, these files were reduced to 5X, 10X, 15X, 20X, 30X, 40X, 50X, 60X, 90X, and 120X coverage files. The file representing the normal sample was downsampled into a 30X sequencing depth file.  Once the synthetic cancer dataset was created, the common somatic variant calling tools MuTect, MuSE, Strelka, and Somatic Sniper were run to detect single nucleotide variants. Upon completion, the high quality variants were filtered from each VCF.

Results:

Recall

MuTect performed the best at classifying correct variants followed by Strelka, MuSE, and Somatic Sniper. This was consistent across allele frequency thresholds of 01, 0.2, 0.3, 0.4, and 0.5.

Coverage and Recall

One interesting finding – for the callers investigated, the ability to recall variants at lower frequencies showed a similar pattern. Each of the callers discovers more of the variants before plateauing at a recall ceiling at a certain coverage. Lower allele frequencies require more coverage before saturating for recall at a caller. 30-fold coverage was required to reach the plateau of 0.5 allele frequency variants, while 40-fold coverage was required for 0.1 allele frequency variants. Reliable detection of lower frequency variants presumably require still more coverage to reach a recall plateu.

Precision

All tools performed well at identifying relevant variants (>95% precision) regardless of tumor sequencing depth.

To get a more accurate view of the interplay between precision and recall, the harmonic mean of precision and recall (F-score) was computed for each output VCF by depth. MuTect had the best performance overall, followed by Strelka, and then MuSE, and Somatic Sniper. Runtime & Cost

Out of all the apps, Strelka finished most rapidly for the lowest cost. Compared to MuTect, Strelka did not score as high for precision or recall, but completed the analysis of single nucleotide variants in a fraction of the time.

To get a more detailed comparison between MuTect and Strelka, this 3-way venn diagram compares these tools to the truth set. Note, the false negatives called by MuTect are likely due to noise in the dataset.

To better visualize the differences between the callers, we converted the output of each of the callers into high-dimensional vectors in which each variant call in any of the samples is one of the dimensions. This format allows us to calculate the distances between each of the programs and with the truth set. This also allows us to use standard methods such as Mulitdimensional Scaling to convert these distances into positions in 2-D space (axes units are arbitrary, only relative position matter is the graph below).

Valid variant calling results are crucial as next-generation sequencing data is increasingly applied to the development of targeted cancer therapeutics. Our analysis of MuTect, MuSe, Strelka, and Somatic Sniper found that the best results with respect to precision and recall can be achieved by using MuTect. Strelka was also a top performer, and simultaneously reduced runtime and cost.

Need to detect variants in your dataset? Get started using these tools on DNAnexus today.

This research was performed by Nicholas Hill and Victoria Wang as part of their internship with DNAnexus. The project was supervised by Naina Thangaraj, Arkarachai Fungtammasan, Yih-Chii Hwang, Steve Osazuwa, and Andrew Carroll.

Introducing htsget, a new GA4GH protocol for genomic data delivery

DNAnexus is here in Orlando for the fifth plenary meeting of the Global Alliance for Genomics and Health (GA4GH), the standards-making body advancing interoperability and data sharing for genomic medicine. We’re especially pleased this year to join in launching version 1.0 of htsget, a new protocol for the secure web delivery of large genomic datasets, especially whole-genome sequencing reads which can exceed 100 gigabytes per person. 

Htsget complements the incumbent BAM and CRAM file formats for reads, which GA4GH also stewards, and their ecosystem of tools. It adds a standardized protocol for accessing such data over the web, securely, reliably, efficiently, and even federally when needed. Retrieval with htsget is now built into the ubiquitous samtools via its underlying htslib library, allowing bioinformaticians to leverage htsget with most existing tools via a familiar Unix pipe. At the same time, htsget’s streaming parallelism enables scalable ETL into cluster environments like Apache Spark, providing a gradual transition path from incumbent file-based toolchains toward modern “big data” platforms. Lastly, htsget simplifies data access for interactive genome browsers, by unifying authentication and removing the need for index files.

On the server side, htsget has been deployed at the Sanger Institute and the European Genotype Archive; DNAnexus operates a multi-cloud htsget server indexing data within Amazon S3 and Azure Blob storage, which we call htsnexus; Google Cloud Platform has open-sourced their own implementation. Clients can speak a uniform protocol abstracting the diverse authentication and storage schemes of these service providers.

These groups, and others, have all shaped the htsget specification through the GA4GH’s highly collaborative process. But it started in large part with a contribution from DNAnexus, drawing on our experience optimizing how our systems utilize cloud object stores in the huge genome projects we’ve served, such as CHARGE, 1000 Genomes Project, TCGA, and HiSeq X Series data production. Through htsget and other work streams under the new GA4GH Connect framework announced today, DNAnexus looks forward to further contributing from our experience and network to advance the GA4GH’s essential mission.

For more information about how DNAnexus is working with htsget, please contact us at info@dnanexus.com.

DNAnexus at ASHG: Accelerating Your Path from Genomic Data to Insight

We are looking forward to attending the annual American Society of Human Genetics (ASHG) meeting next week in Orlando.  

We’re excited to share updates on recent projects, including our new data analysis and management solution for NovaSeq™ instruments, our collaborative microbiome informatics platform, and the latest software tools available on DNAnexus from our partners at Edico Genome and PacBio.

If you’re headed to ASHG, stop by DNAnexus booth 811 to learn about the broad research and clinical applications of the DNAnexus Platform. Can’t make it to any of our events? Stop by booth 811 anytime during the conference, or email us to schedule a meeting with a member of our team.

Lunchtime Talk 

Optimize Your Path to Variant Production: Real World Examples

Friday, October 20th, 1:00pm-2:15pm
Hilton Orlando Hotel, Lake George Room, Lobby Level

Join our lunchtime discussion to learn about DNAnexus CloudSeq, a powerful solution for rapidly scaling cloud-enabled bioinformatics infrastructure for research and clinical sequencing applications. You will hear case studies from Baylor College of Medicine’s Human Genome Sequencing Center and Rady Children’s Hospital about navigating the complexities of integrating large multi-omic datasets, and developing pipelines to analyze and share data and insights across global R&D organizations.

Guest Speakers:

  • HGSC Baylor College of MedicineWill Salerno, PhD, Director of Genome Informatics, Human Genome Sequencing Center at Baylor College of Medicine
    • Talk: Translation of NIH Data in Discovery Commons
  • Narayanan Veeraraghavan, PhD, Director of IT at Rady Children’s Institute for Genomic Medicine
    • Talk: Creating a Critical Nexus: Making Rapid Whole-Genome (rWGS) Based Precision Medicine Accessible to NICUs and PICUs Across the Country

RSVP today; lunch will be provided.

Booth Activities

Debuting DNAnexus CloudSeq
Stop by to learn about our powerful data analysis and management solution for the NovaSeq™ series of sequencing systems.

  • Wednesday, October 18th, 1:00pm in DNAnexus Booth #811

Edico Genome’s DRAGEN on DNAnexus
See a demo of DRAGEN, Edico Genome’s ultra-rapid, accurate, and cost-efficient genomic data analysis pipeline on DNAnexus. Sign up here or in our booth to take advantage of limited-time promotional pricing on DNAnexus.

  • Wednesday, October 18th,10:40am in Edico Genome Booth #710 
  • Wednesday, October 18th, 2:30pm in DNAnexus Booth #811

PacBio SMRT Analysis Suite 5.0 Available Now on DNAnexus
Test drive PacBio’s SMRT Analysis software on DNAnexus. The suite of SMRT tools includes a comprehensive set of applications for genomic analysis, including de novo assembly, variant calling, transcriptome analysis, epigenomics, and more.

  • Thursday, October 19th, 1:00pm, DNAnexus Booth #811
  • Friday, October 20th, 11:00am, PacBio Booth #722

Join the Microbiome Research Community  
Stop by to learn how to get involved in a series of community challenges aimed at increasing the understanding of the human microbiome and its relation to disease

  • Thursday, October 19th, 2:30pm, DNAnexus Booth #811

Posters Featuring DNAnexus  

PgmNr 745: Access, visualize and analyse pediatric genomic data on St Jude Cloud.

  • Speaker: Scott Newman, St Jude Children’s Research Hospital
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 2563: Improved molecular tracking of individual genomes for clinical whole-genome sequencing.

  • Speaker: Sergey Batalov, Senior Bioinformaticist, Rady Children’s Institute for Genomic Medicine
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 1951: Exome-wide association study of kidney function in 55,041 participants of the DiscovEHR cohort.

  • Speaker: Claudia Schurmann, Statistical Geneticist, Regeneron Genetic Center, Regeneron Pharmaceuticals
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 763: Whole genome sequencing signatures for early detection of cancer via liquid biopsy.

  • Speaker: Bahram Kermani, Founder & CEO, Crystal Genetics
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 1281: Cloud-based quality measurement of whole-genome cohorts.

  • Speaker: Will Salerno, Human Genome Sequencing Center, Baylor College of Medicine
  • Time: Friday, October 20th, 11:30am-12:30pm

View Details