SMRT APAC Genome Assembly Grant & Upcoming Webinar

SMART APAC GrantSubmit your unique plant or animal genome proposal for a chance to win free de novo assembly services on PacBio SMRT Sequencing data. See details below.

De novo genome assembly is a complex task, which can require massive computational resources to weave long-reads into a final, polished assembly. Plants and some species of animals can prove particularly challenging due to their high levels of genetic diversity, repetitive elements, and duplicated genomic regions. Bringing together multiple technologies (long-read, short-read, next-generation mapping) can improve contigs and scaffolding to provide more complete and accurate genome assemblies.

At DNAnexus we have a team of scientists that provide fast, accurate, and cost-efficient reference quality assembly services. Our key experience includes 3000 Rice Genomes Project, Vertebrate Genomes Project, along with many complex individual assemblies, including the tobacco genome (4.5 billion base pairs, tetraploid & highly repetitive) and the immune recognition regions of the Rhesus macaque.

APAC Genome Webinar

Join us, in collaboration with our partners PacBio and Microsoft, for a webinar – Best Practices for Rapid Reference-Quality Genome Assembly – which takes place September 5th at an APAC friendly time 10:00 AM (China Standard Time). Register today.

In this webinar, you’ll learn:

  • Why de novo assembly is important
  • How we define a reference quality genome
  • Best practices with PacBio’s FALCON, FALCON-Unzip, & NHGRI’s Canu assemblers
  • Layering next-gen mapping for more accurate assemblies
  • Case study: Rhesus macaque
  • Benefits of DNAnexus assembly services

La Jolla Institute and University of LouisvilleFor the webinar case study, the La Jolla Institute for Allergy and Immunology and University of Louisville utilized DNAnexus de novo assembly services to serve the need for a better quality R. macaque reference genome for the research community to use. The R. macaque genome provides a good reference for use in vaccine trials in order to understand both efficacy and antibody response. Examining this response though proved challenging as the research community lacked the full picture of the genome. Although the genome assembly of primate is not difficult in general, the DNA regions involved in immune recognition are difficult because of the highly repetitive and segmental duplication and haplotype variation between individuals. DNAnexus scientists were able to assemble the genome to get 2.8 billion base pairs out of the entire genome with an N50 of contigs length of 8.4 million base pairs and N50 of haplotype contigs length of 451,000base pairs.

SMRT APAC GRANT

DNAnexus, is excited to announce the SMRT APAC Grant powered by DNAnexus, through which we are offering free de novo assembly for the most unique plant or animal genome in the world. One lucky winner will be selected for DNAnexus de novo assembly services on PacBio SMRT Sequencing data. To enter simply submit a 250-word proposal, which includes a short description of your particular plant or animal genome, (up to 1.5 Gbp in size), and how its sequencing and de novo assembly would benefit the larger scientific community. We are now accepting submissions on the SMRT APAC Grant website. Please note de novo assembly services will only be applied to data generated through PacBio SMRT Sequencing and sequencing is not included in the SMRT APAC Grant.

Deadline for submission is September 30th with the winner announced the week of October 8th.

Do you have genome that needs to be assembled? Learn more about our de novo assembly services or request a quote.

Breaking Down Crumble: A New Method to Significantly Reduce NGS Data Footprint while Preserving Results

Alessandro Andrew Authors

Crumble is a newly published NGS compression method by James Bonfield, Shane McCarthy, Richard Durbin of the Sanger Institute. In this post, we demonstrate that Crumble provides size savings of 20-70% for BAM and CRAM files from HiSeq2500, HiSeqX, and NovaSeq Whole Genome Sequencing (WGS) and exomes. We profile the resulting variant calls, showing generally similar results when using Crumble, even slightly improving variant-calling accuracy on the same sequencing data.

Introduction

Among computational fields, genomics is especially data-heavy. For a 30x WGS sample, the size of the BAM file can easily exceed 100 GB. The collective size of genomics projects is growing at a tremendous rate. One highly cited work by Stephens et al. projects that, with current trends, the volume of NGS data in 2025 will exceed the data footprint of YouTube, Twitter, and astronomy.

Most of the data footprint is in BAM files. In this format, the data footprint is divided between Read Names, the Sequence itself, Quality Values (QV) and Tags. The CRAM format uses lossless reference compression for the sequence, pushing the relative data footprint toward storing the quality values (see our Readshift blog for background on QV).

QV UncompressedIn an uncompressed (SAM) file, each quality value occupies a bin with a fixed amount of space (see the figure on the right, where each quality value is represented as a different shape). The BAM format compresses this data with a standard method similar to gzip. To simplify compression methods greatly, they identify recurring patterns within the data that can be used to pack elements together. More irregular data (data with higher entropy) is harder to compress. This blog from Dropbox contains interesting examples on how compression methods can work.

As Illumina has progressively increased instrument throughput, it has fought the growing friction in managing data footprint by decreasing the range of possible QV (HiSeq2500 – 40, HiSeqX – 8, NovaSeq – 4). The figures below illustrate how packing a stream of QV pieces together is easier when there are fewer types of QVs and these types are more similar. These changes are generally synergistic with other advanced compression technologies such as Petagene.

QV Compressed Comparison

The Crumble paper calls this strategy “horizontal” because it applies generally to all bases. Crumble additionally applies a “vertical” approach to consider the context of surrounding reads, preserving more quality values in regions of lower coverage or in signals of interest (like variant positions) and compressing more in regions where reducing entropy is unlikely to impact results. The figure below attempts to conceptually represent some of the conditions under which Crumble bins more or less aggressively.

Crumble Results

Results: Crumble Saves Significant Space for Both Whole Genomes and Exomes

To assess Crumble, we first applied it to several WGS and exome BAM and CRAM files produced across different machines. As Figure 1 illustrates, Crumble saves significant space across all types of WGS files, with larger savings in HiSeqX compared to NovaSeq. In these charts, Crumble is run at its default settings. Lower numbers are better (i.e. BAM/CRAM which takes less space).

Crumble Compressions

 Crumble Compression 2

 

The average savings in size relative to the starting state are:

Sample Type Percent Size Savings (BAM) Percent Size Savings (CRAM)
WGS HiSeqX 55 % 70 %
WGS NovaSeq 20 % 34 %
Exome 47 % 60 %

Applying Crumble Improves Variant Calling Results in Almost All Conditions

Crumble’s intelligent application of binning allows it to minimize possible negative consequences. Interestingly, Crumble has a positive impact on variant calling results overall. We suppose this occurs because Crumble bins more aggressively in regions of unusually high coverage or poor mapping. Though QVs indicate the probability that a given base is in error, taking into account the additional probability that a read is mismapped is a more complex calculation that variant callers are likely overconfident about. Crumble’s binning may correct for this by reducing unfounded confidence.

Crumble improved accuracy on NovaSeq WGS for almost every pipeline, with substantial decreases seen in several. In all charts that follow, lower values on the y-axis (that is fewer errors) is better.

Crumble NovaSeq

For exomes, the changes in accuracy were less pronounced. Only GATK showing a substantial difference, with an improvement of 10%.

Crumble Exome

The following table summarizes how Crumble impacts variant calling across the use cases. In this, larger negative numbers indicate that Crumble has a larger improvement. In almost every case, Crumble improves results in 73% of investigated cases, with substantial (~10% of more) improvements in 33 % of cases. In those cases where Crumble made pipelines worse, the impact was small. GATK4, in particular, was consistently helped by Crumble.

Change in total errors by using Crumble v0.8.1 relative to baseline (non-Crumble).

Strelka2 DeepVariant Freebayes Sentieon GATK4
HiSeqX WGS – 1.3 % – 19.2 % – 5.2% + 3.1% – 27.1%
NovaSeq WGS – 1.2 % + 3.5 % – 2.3% – 10.3% – 22.6%
Exome – 3.0 % + 1.4 % + 1.3 % + 0.1 % – 9.9%

Crumble Level Parameter is More Important for Exomes. Using Level 9 (Default) is Different for Both WGS and Exomes

Crumble comes with presets for parameter combinations that correspond to increasingly aggressive binning strategies (Level 1, 3, 5, 7, 8, and 9, with 9 as the default). We applied each of these settings to 4 WGS samples (HiSeqX HG001, HG002, HG005, NovaSeq HG001) and 3 exomes. In every case, our results were consistent between WGS samples and between exome samples. Therefore, we present results with only one representative in each.

Our initial investigation of level used Crumble v0.8. Version 0.8.1 was released 2 weeks after, adding a new Level 9 and changing the old Level 9 to Level 8, but not otherwise changing settings. So we have only re-run the new modes in this chart.

Crumble and Output

Crumble and Output2

Crumble Senteion

Crumble and Senteion2

Crumble Is Fast and Cheap to Run

By running Crumble, a user chooses to pay a compute cost in order to achieve persistent savings in storage. This cost-benefit decision depends on the speed of the method. Fortunately, Crumble is extremely fast, especially when compared to methods for mapping samples or calling variants. We are generally able to run Crumble on a 35X WGS in just 8 CPU-hours.

Unfortunately, Crumble does not currently support multi-threading. However, users can restrict it to running on a genomic region. So, to run Crumble quickly and efficiently, we do the following:

  • Generate a contig list from the BAM header,
  • Set the unplaced reads aside
  • Use gnu-parallel to apply Crumble in parallel to each chromosome.
  • use samtools cat to combine everything.

(the following image is a code snippet from our app code):

Crumble App Code

With this, we can execute the Crumble program in about 1 hour of wall-clock time on an 8-core machine (download and uploading to the cloud worker both take a bit of additional time).

The break-even point at which point the storage savings justify the extra compute cost depends on your relative costs of each. However, under most realistic scenarios, running Crumble will pay back the investment in just a few weeks to months, and will accrue lasting savings well beyond that. There are other benefits to reducing data size: it is easier for collaborators to access, and it takes less time to load into and out of cloud workers from cloud storage.

Crumble on DNAnexus

An efficiently wrapped app to run Crumble is available to all users on the DNAnexus app library at https://platform.dnanexus.com/app/crumble (platform login required to access this link). This app can both take data and produce output in BAM or CRAM output (CRAM output requires a matching reference genome).

Crumble workflowCrumble App

Final Notes

These tests are all performed on germline sequencing. It is unknown how Crumble would interact with somatic methods. There are other operations which can shrink BAM/CRAM files further without impact, such as lossy compression of read names. We hope to explore these in a later post.

Conclusion

We hope we have demonstrated that Crumble is a useful tool capable of achieving significant space savings on diverse data. Under most conditions, Crumble has a negligible or positive impact of downstream processes. It can be run at Level 7 to effectively compress both exomes and WGS. The rapid speed of Crumble means that it is favorable to run from a cost-benefit evaluation.

As genomics continues to grow in scale, we will need methods such as Crumble that enable us to consider how to ensure that, as a field, we have solutions which can scale to future needs. The value of genomics should be judged not by the magnitude of its data footprint, but by the value the field brings to society.

We thank the authors of Crumble, James Bonfield, Shane McCarthy, and Richard Durbin of the Sanger Institute, for their hard work in making this tool available to the community, and for providing helpful comments and suggestions on this blog.

New features for managing workflows and releasing them to your global network of collaborators

DNAnexus Blog Authors

 

 

 

 

 

Computational genomics workflows are regularly used to not only rapidly accelerate R&D in the field of genomics, but they are also increasingly used to make clinical diagnoses tailored to individual genomes. As DNAnexus has grown to support a large network of industries and collaborators, we have noticed that these workflows are often developed and shared across users and organizations on a global scale.

DNAnexus workflows currently provide the core functionality of allowing users to create and execute a computational workflow within a DNAnexus project. However, for users and organizations collaborating on multiple private or public projects, these local workflows may be less suitable for long-term maintenance in the context of larger organizations and consortia.  In recognition of the need to manage workflows with a truly global network of collaborators, we are excited to introduce an additional suite of features that can be applied to objects we call ‘global workflows’.

As with DNAnexus applications, global workflows are published to a global space accessible by authorized users across projects.  Like a Github repository or Docker repository, global workflows are versioned and updated with a globally unique name. Global workflows can be tagged, associated with broad categories (e.g. ‘read mapping’, ‘germline variant calling’, ‘somatic variant calling’, ‘tumor-normal variant calling’), and defined to run across cloud regions and providers.  They can also be developed by a specified set of users and subsequently published or released to a larger set of authorized users who can run but not modify the workflow. Together, these features empower workflow developers to better share and advertise their workflows to a broad set of users and organizations across multiple regions and clouds.

A user creates a global workflow in essentially the same way as a regular workflow (see this tutorial for more details on how to create a global workflow).  In fact, existing workflows on the DNAnexus platform can be converted to global workflows in a straightforward way.  Since workflows written in CWL or WDL can be directly converted into workflows on our platform, these workflows can also be easily converted to global workflows.  As a result, portable workflows can also be imported to our platform and used in a way that meets organizational needs for access control and collaboration at scale.

To illustrate the use of global workflows, we have published a public workflow available to all users of our platform. For example, from our CLI, you can run:

$ dx find globalworkflows
GATK4 Germline Best Practice FASTQ to VCF (hs38DH) (gatk4_germline_bp_fq_hs38dh),
 v1.0.0

Here, you can see that there is a GATK4 best practice pipeline available for you to use.  You can treat this workflow name like any other global applications on the platform. Examples for how to use these features can be seen in more detail here.

Workflow release management features were built by the Developer Experience team at DNAnexus. Thanks to the DNAnexus Science team for contributions to the design of this feature. Please see our documentation for a tutorial on how to use these features and contact support@dnanexus.com if you have any feedback or questions.