PrecisionFDA Receives FDA Commissioner’s Award for Outstanding Achievement

Today, the precisionFDA Next Generation Sequencing (NGS) Team received the FDA Commissioner’s Special Citation Award for Outstanding Achievement and Collaboration in the development of the precisionFDA platform promoting innovative regulatory science research to modernize regulation of NGS-based genomic tests. This award recognizes superior achievement of the Agency’s mission through teamwork, partnership, shared responsibility, and fostering collaboration to achieve the FDA goals.

 

PrecisionFDA is an online, cloud-based, virtual research space where members of the genomics community can experiment, share data and tools, collaborate, and define standards for evaluating and validating analytical pipelines. This open-source community platform, which has become a global reference standard for variant comparison, includes members from academia, industry, healthcare, and government, all working together to further innovation and develop regulatory standards for NGS-based drugs and devices. Launched in December 2015, the precisionFDA community includes nearly 5,000 users across 1,200 organizations, with more than 38 terabytes of genomic data stored.

To date, the precisionFDA NGS Team has engaged the genomics community through a series of community challenges:

  • The Consistency Challenge (Feb-Apr 2016): Invited participants to manipulate datasets with their software pipelines and conduct performance comparisons.
  • The Truth Challenge (Apr-May 2016): Gave participants the unique opportunity to test their NGS pipelines on an uncharacterized sample (HG002) and publish results for subsequent evaluation against a newly-revealed ‘truth’ dataset.
  • App-a-thon in a Box (Aug-Dec 2016): Invited the community to contribute NGS software to the precisionFDA app library, enabling the community to explore new tools.
  • Hidden Treasures Competition (Jul-Sep 2017): Participants beta-tested the in-silico analyses of NGS datasets for the purpose of determining the reliability and accuracy of different NGS tests.
  • CFSAN Pathogen Detection Challenge (Feb-Apr 2018): Participants helped to improve bioinformatics pipelines for detecting pathogens in samples sequenced using metagenomics.

We are thrilled that precisionFDA has been recognized for its efforts in fostering shared responsibility for the evaluation and validation of analytical pipelines. PrecisionFDA’s proven success has driven other scientific communities such as St. Jude Cloud to promote pediatric cancer research, and the Mosaic microbiome platform for advancing microbial strains analysis, to establish their own collaborative ecosystem for members to contribute and innovate. DNAnexus is proud to be the platform that powers precisionFDA and other community portals to advance scientific research through a secure and collaborative online environment.

To learn more about DNAnexus community portals please visit: http://go.dnanexus.com/community-portals.

SMRT Leiden Assembly Grant

Submit your unique plant or animal genome proposal for a chance to win free de novo assembly services on PacBio SMRT Sequencing data. See details below.

We are excited to participate in our partner, PacBio’s, annual SMRT Leiden Conference in Leiden, Netherlands from June 12th – 14th. This back to back conference will include the SMRT Scientific Symposium on June 12th & 13th, featuring presentations from key experts and opinion leaders sharing their scientific discoveries and latest achievements from a variety of fields. The SMRT Informatics Developers Conference will follow on June 14th, focused on developing and improving analysis tools for PacBio SMRT Sequencing data. Software developers and bioinformaticians will spend the day focused on advancing new and existing tools for de novo assembly, genome phasing, structural variation, base modifications and Iso-Seq analysis.

During the SMRT Informatics Developers Conference on June 14th, DNAnexus will be presenting “Evaluating haplotype phasing from FALCON Unzip” at 10:30am in the session titled “DE NOVO ASSEMBLY.” In this talk, we evaluate the performance of FALCON Unzip in forming phased haplotypes by assembling and phasing the genomes of an artificial human.  By examining SNP’s that are known to be unique to one of the parents, we show that FALCON Unzip is able to produce impressive phasing information requiring nothing more than a little additional time in the compute environment to process the data.

DNAnexus is also honored to be a sponsor of PacBio’s Leiden Conference by providing the “SMRT Leiden Grant powered by DNAnexus” offering free de novo assembly for the most unique plant or animal genome in the world. One lucky winner will be selected for DNAnexus de novo assembly services on PacBio SMRT Sequencing data. Participants can submit proposals on the SMRT Leiden Grant website, with information on organism type and its impact on the scientific community. Proposals should be approximately 250 words in length and the genome size up to 1.5 Gbp, (>1.5 Gbp will be considered under special circumstances). Please note de novo assembly services will only be applied to data generated through PacBio SMRT Sequencing and sequencing is not included in the SMRT Leiden Grant.

Deadline for submission is June 29th, and the winner will be announced the week of July 9th.

Requiring massive computational resources to assemble reads or run structural variation detection across datasets, genome assembly is made even more challenging due to high levels of genetic diversity, repetitive elements, and duplicated genomic regions. Our bioinformatics expertise and computational power enable the delivery of high quality results, leveraging multi-omics data and tools in a collaborative and secure ecosystem. You can learn more about our fast, accurate, and cost efficient reference-quality assembly services that enable complex genome assembly, structural variation analysis, and physical mapping to achieve complete and accurate views of all types of genomic variation on our de novo assembly website.

Questions about DNAnexus de novo assembly or the SMRT Leiden Grant? Email us!

Training and Applying Genomic Deep Learning Models

The application of Deep Learning methods has created dramatically stronger solutions in many fields, including genomics (as a recent review from the Greene Lab details). In this blog, we focus on a different aspect – the ability of deep learning to empower those with domain insight to rapidly create methods for new technologies or problems.

Last weekend, two teams with members from DNAnexus participated in a SV.AI hackathon chaired by Ben Busby, who leads a series of NIH Hackathons. NCI and GDC also helped a great deal.

For this hackathon, a patient with renal cell carcinoma donated both sequence data from their tumor and their corresponding whole blood sequence. This sequence data was >90x coverage from the BGI-SEQ 500 instrument.

The hackathon itself was stellar, and all presentations and data are available on GitHub including those from both DNAnexus teams, CNN and RNN. Another group was able to provide a very comprehensive analysis of the tumorigenesis.

Because the BGI-SEQ is such a new instrument, and because public deep coverage WGS from it is difficult to find, the RNN team decided to use the hackathon to evaluate the analytical methods being used on BGI-SEQ and to explore the possibility to improve tools for this data type.

This work proved to strongly validate the concepts of deep learning. In only a single weekend, we achieved significant improvements to two separate deep learning methods.

Evaluation of BGI-SEQ

BGI-SEQ EvaluationFirst, we attempted to assess what issues may exist in applying tools developed for Illumina technology to the BGI-SEQ 500. To do this on a genome which did not have a truth set, we developed a method called DOMA (Drop Out of Mutual Agreement).

A more detailed description is in our presentation. We plan to release a further blog post on this method later. Values closer to 0 (higher on the chart) reflect better performance.

BGI-SEQ Evaluation DomaAs a high-level summary, we found error rates were about 3x higher in SNP calling and 10x higher for indel calling. False negatives for indel calling was the dominant error mode, consistent with the only 30x BGI-SEQ Genome in a Bottle sample.

Some methods performed relatively worse in this evaluation. For example, it is unusual to see GATK depart so greatly from Sentieon.

Minding the Gap: Why Performances Differ

The implication here is that components of the BGI-SEQ technology are different enough from Illumina that the methods built with Illumina in mind either can’t take advantage of new information in BGI-SEQ, don’t know how to mitigate BGI-SEQ weaknesses, or bring biases that don’t apply to BGI-SEQ data. Even a quick visual inspection hints at this:

BGI-SEQ Visualization

In this GC rich region, it appears there are some systematic A/T > C errors that only happen in those reads sequenced from 5’ to 3’. While this may not be prevalent, it could cause false positive variant calls especially the sequence coverage is low.

There are currently two published variant calling tools that use deep learning methods: DeepVariant and Clairvoyante. We were able to train both of them on BGI-SEQ data.

Training a BGI-SEQ model for Clairvoyante

Clairvoyante was published in bioRxiv just a couple weeks ago, led by Ruibang Luo, an assistant professor at the University of Hong Kong who developed Clairvoyante as a postdoc in Michael Schatz’s lab.

Clairvoyante DiagramClairvoyante uses a less complex architecture than DeepVariant, allowing it to be trained using less hardware and to make variant calls quickly. Though its accuracy on Illumina data is not as strong, the framework can call variants on long-read data. It is likely the strongest variant caller for Pacbio and Oxford Nanopore. Ruibang graciously cites Jason’s VariantNet blog post, which discusses many of the concepts for deep learning on genomic data with similar architectures.

To extend Clairvoyante, Jason generated a workflow and applets to reproduce preparing data for training and to call variants as reproducible pipelines on DNAnexus platform.

The end-to-end training to calling with Clairvoyante can be decomposed to three stages:

(1)   Building training data: the workflow is configured with WDL and Docker. We use dxWDL to transform the WDL into a DNAnexus workflow for deployment. The WDL workflow (shown below) is slightly more complicated than the rest of the steps, as it involves some complicated variant file filtering steps.

WDL Workflow

(2)   Model training: a native DNAnexus applet running the Clairvoyante convolutional neural network to generate trained models.

(3)   Variant calling: a WDL task-based DNAnexus applet taking the trained and alignment BAM files to generate variant calls.

With this setup, we can run each step using simple dx-toolkit commands or the DNAnexus platform UI to fully reproduce the deep learning variant calling pipeline for the three technologies used as examples in the preprint and the BGI-SEQ data. Within the DNAnexus project environment, all of these can be done without explicit knowledge on how to set up the working environments for different steps. A user only has to set up the proper input files for the workflow.

­­­­

Command Line

Launching the training data preparation step with the command line

Platform Analysis

Launching the training data preparation step with the DNAnexus platform UI

All data files, trained models, variant calls, workflow and applets are available on the DNAnexus public project “clairvoyante_dnanexus_demo” to those with DNAnexus accounts. If you’d like to try running these steps yourself, you can copy the contents of the public project to a new project and execute the code in the “job_scripts” directory of the GitHub repo using the dx-toolkit command line environment.

Jason’s work on re-training Clairvoyante on DNAnexus platform is in a GitHub repo. While the code targets deployment on the DNAnexus platform, the workflow and environment setup could be adapted to a generic computation platform or cluster as well.

The charts below indicate the performance of Clairvoyante on Illumina and BGI-SEQ data. Clairvoyante is able to achieve a high overall accuracy on BGI-SEQ data. The slightly lower accuracy suggests that BGI-SEQ data is somewhat more challenging to use.

Clairvoyante Comparison

Thinking more broadly, genomic deep-learning model developers can use DNAnexus as a platform for continuous integration and testing. The platform provides a seamless integration platform for genomic data management and cloud job executions, which can make pipeline execution tests possible with even larger and more realistic datasets. A production center which added every validation run to a DNAnexus project could continuously build new models and create a constantly improving method tuned to the production conditions of their operation.

Training DeepVariant for BGI-SEQ

We have already discussed DeepVariant in several blogs evaluating the method, its robustness, and the Google Brain’s continued improvements to it. Pi-Chuan Chang from Google Brain demonstrated how to retrain DeepVariant.

DeepVariant Google BrainGoogleBrain recently released Nucleus, a framework which provides hooks from common genomic data formats into TensorFlow. Pi-Chuan’s work used only this publicly-facing resource; anyone can replicate these steps to extend DeepVariant to their data or problem of choice.

The document Improve DeepVariant for BGISEQ germline variant calling gives detailed, step by step instructions on this process (with screenshots).

Referencing back to Indel recall being the main error mode, the chart below shows the improvement from even a small amount of training (chromosome1 of one genome). Because this amount of data is much less than that used in the full DeepVariant training, and because we use the Illumina-trained DeepVariant as a base, we call this “fine-tuning”.

Method Data Type SNP F1 Indel F1
GATK4 Best Practices BGI-SEQ 99.74% 87.49%
DeepVariant – ILMN trained BGI-SEQ 99.83% 94.28%
DeepVariant – ILMN trained + BGI-SEQ fine-tuned BGI-SEQ 99.89% 98.10%
DeepVariant Baseline Illumina 99.96% 99.72%
GATK HC Baseline Illumina 99.87% 98.75%

For context, an Indel F1 (before training) of 94% would be a number you might expect from early days of GATK UnifiedGenotyper. With an Indel F1 (after training) of 98%,  at the end of this tuning, DeepVariant on BGI-SEQ is about as accurate as the current version of GATK HaplotypeCaller on Illumina.

That means that in a single weekend, we could make the equivalent of several years of methods development progress. And while we don’t think that means we can take a few years off to sit on the beach, it does mean we are excited about how much further we can push this technology in the future.

Acknowledgements

We want to specifically thank the patient who donated their data for this hackathon, and whose presence and discussion at the hackathon inspired so many teams. We believe their contribution will increase the benefit of genomics for all. We thank SV.AI organizers, and all of the participants, mentors, judges and sponsors which created so much positive energy and an environment for collaboration, learning, discovery, and innovation. We also like to thank Ruibang for sharing the training data and insights for us to successfully reproduce the Clairvoyante work.