DNAnexus Detectives: Using Amazon Web Services to Help Solve a Medical Mystery

Jason Chin Chiao-Feng Lin AuthorsArkarachai Fungtammasan AuthorOur mission, and we chose to accept it, was to join more than 100 researchers and engineers to look for answers and create insights into a real patient’s mystery medical condition.

In this case, that patient was John, aka “Undiagnosed-1”, a 33-year-old Caucasian male suffering from undiagnosed disease(s) with gastrointestinal symptoms that started from birth. In his 20’s, John’s GI issues became more severe as he began to have daily lower abdominal pain characterized by burning and nausea. He developed chronic vomiting, sometimes as often as 5 times per day. Now 5’10” tall and 109lbs, he is easily fatigued due to his limited muscle mass and low weight.

Armed with more than 350 pages of PDFs containing scanned images of John’s medical records plus a range of genetic data — from Invitae’s testing panel to whole genome shotgun sequencing from multiple technologies (Illumina, Oxford Nanopore and PacBio) — could we generate ideas for diagnosis, new treatment options or symptom management? Perhaps we could interpret variants of unknown significance or identify off-label therapeutics through mutational homogeneity.

This was the challenge set to us as part of a three-day event in June organized by SV.AI, a non-profit community designed to bring together bright minds from AI, machine learning and biology backgrounds to solve real-world problems. This was its third event. Last year, we helped to apply DeepVariant from Google on a new kind of sequencing data to help out on a rare kidney cancer case.

Off to a flying start

At DNAnexus, our main mission is to help our customers to process large amounts of genomic data with cloud computing. It is straightforward for us to do the initial processing of the genomic data. In this case, our customers were the event’s community of genomic “hackers.” We decided to pre-process John’s genomic data, so that our fellow participants would not have to spend extra effort to go through variant calling.

Chai generated SNPs and structural variants before the event, and made the information available to everyone who might need it.

However, genomic data was only half of the picture. Clearly, the clinical information gleaned from John’s medical records would shed some clues. But how to get through hundreds of pages of scanned images of his medical records (under MIT licence for invited participating scientists). There must be a smarter way to process the records that would allow us — and others — to write scripts and programs to process them.

Luckily, there is: Amazon Textract and Amazon Comprehend Medical.

Developed by Amazon Web Services (AWS) using modern machine learning technology, Amazon Textract is a service that automatically extracts text and data from scanned document — think OCR on steroids. While there are many OCR software applications on the market these days, Textract provides more; it detects a document’s layout and the key elements on the page, understands the data relationships in any embedded form or table, and extracts everything with its context intact. This makes it possible for a developer to write code to further process the output data in text or JSON format to extract important information more efficiently. It is important to note that at the time of this blog post, Amazon Textract is not a HIPAA eligible service.  We were able to use it in this case because the patient data being analyzed was de-identified.  Amazon Textract should not be used to process documents with protected health information until it has achieved HIPAA eligibility.  Please check here to determine if Amazon Textract is HIPAA eligible.

Another technology developed by AWS, Amazon Comprehend Medical, was used to process the output from Textract for John’s medical records. Amazon Comprehend Medical is a natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Amazon Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records. Using it, we were able to extract medications, historical symptoms, and medical conditions from John’s doctors’ notes and testing/diagnostics reports.

With more structured patient information than merely a collection of medical records as images, combined with the variant calls generated from the genomics data, the participants of the hackathon were able to jump right into solving the medical and genomic puzzles for John to help him and relieve his symptoms.

The results

We were happy that a couple of teams from the hackathon were able to use both the variant call set and the processed medical data that we provided.

Genomics Info Word CloudA nice word cloud generated from John’s medical record and genomics information by the “Too Big to Fail John” team. (Origin: https://github.com/SVAI/Undiagnosed-1/tree/master/TooBigToFail)

  • The “Thrive” team used a similar approach to find potential variant candidates by identifying variants that are commonly seen with John’s medical conditions and predicted deleterious variant calls.

Thrive Variant Calling Schema

The schema of the Thrive team approach to analyzing both the medical records and variant call set.

  • The team Crucigrama extended the scope by including other public ‘omics data, such as metabolic profiles and NLP process, for public genomics data.

Crucigrama Problem StatementTeam Crucigrama’s problem statement to extend the traditional approach to finding new leads for Undiagnosed-1.

  • The “Beyond Undiagnosed” team also utilized the medical record in their workflow so they could gather key symptoms and diagnoses fast, to provide future care recommendations according to their findings.

Beyond Undiagnosed Extracted SymptomsThe “Beyond Undiagnosed” team used the extracted symptoms from the medical notes in their workflow for providing recommendations. All information was de-identified and did not contain PHI.

We found it very inspiring that John, who attended the event, was willing to share his medical records with all of the participants in order to help us help him — and we hope the work we did will ultimately do so.

For more information about John’s case, visit the SV.AI site, which will provide all data so that any researchers can continue working on it. We also like to thank SV.AI organizing this event and Mark Weiler and Lee Black from Amazon helping us on processing the data through Amazon Textract and Amazon Comprehend Medical.

Coming Soon! DNAnexus Navigation and UI Changes

To keep the DNAnexus platform easy enough for anyone to use and powerful enough for expert users, our team has made some layout and user interface (UI) changes. While these updates are relatively minor, they make the platform look different than you might be used to, so we’ve outlined them below to help you find your way around and see what’s new.

New Look

The first thing you’ll probably notice is that the user interface looks cleaner, flatter, and more modern. The visual updates to the interface make it easier to look at and faster to find what you’re looking for. Primary actions will be on the right of the screen and be immediately noticeable. Certain areas of the site won’t have this updated look yet, but every part of the site is getting revamped in the next few months, so if it hasn’t changed yet, it will soon.


Projects UI

We have redesigned the Project list page with an easily filterable list of all your projects. A new “pin” feature allows you to mark your favorite projects and they will remain on top of the list! 

Project List

Projects now display a line of summary text in the main list. You can add even longer text in the Descriptions section of the Info panel.

Reference Data List












We have added a new “info” panel which allows you to quickly inspect any project when you select its row. The info panel can be opened by clicking the “i” icon in the upper right. This shows information (metadata, project settings, project size, etc.) which is also available in the Project Settings. Now you can access this information directly from the Project list page. The info panel also lets you easily copy the project ID.

Pin Project







The context (three-dots) menu in each row gives you a shortcut to Leave or Delete a project (depending on your access level), share, pin and view project settings.

To view the contents of a project, click the project name and you will enter the Data Manager section.

Data Manager: Manage Tab 

The Manage section has many enhancements. Next to the Project name there is now a menu for quick access to common tasks such as Sharing projects. Also task such as Leaving or Deleting a project (tasks formerly found in Settings) are in the menu.

Data Manager







Project navigation has been enhanced with a collapsible tree as well as fully functional breadcrumbs.

Project Folder Tree





All action buttons have been consolidated to the right side of the screen (New Folder, New Workflow, Upload Data, Add Data, Copy data from project, Start Analysis)

New Action Buttons



The filter bar has been redesigned and now defaults to searching within a project. 

Project Folders

Data Manager: Manage Tab: New “info” panel

We have added a new “info” panel which allows you to quickly inspect any file. The info panel can be opened by clicking the “i” icon in the upper right. You can then select any item, or multiple items, to display their properties. This shows information previously found in the info pop up window.

You can easily copy a file or path ID from this side panel by clicking the copy icon. 

Project Info Panel

Tables are now paginated.

New folders will now be created and show at the top of the list. After you enter a name, it will move to the appropriate place in the sort order.

Project Folder Rename

Projects can be renamed in the Settings tab or in the Info panel on the main project list page.

Data Uploading

The data uploading dialog has been split into three discrete functions.

Data Uploader





The ability to apply tags and properties in the dialog has been removed. Tags and properties can easily be added in the new Info panel. Select multiple items to apply the same tag or property to multiple objects at once.

Note: Settings & Visualize tabs are the same with minor visual updates.

Start Analysis 

The start analysis dialog has enhanced to improve usability. It has the familiar filtering mechanism to easily locate an app, applet workflow or global workflow. You can now see the category and the author of the analysis tool. By selecting a row and opening the “i” Info panel, you can still view the inputs and outputs. 

Starting Analysis

Version can be toggled in the dropdown:

Allele Frequency Calculator

Open the tools details to see full information provided by the tool developer.

Tool Details Button

Tool Runner

The Tool Runner has been enhanced with a graphical representation of the analysis process. Each app or workflow has three areas that a you can configure: Settings, Analysis Inputs, Stage Settings.

Settings includes execution name, project location, output folder and optional advanced features.

Analysis inputs is where you can select appropriate inputs and can toggle to batch mode. Also, you can now view all inputs in one location.

Stage Settings contains information about each stage of a workflow, including app version, instance type and output folder. You can change these as desired.

Tool Runner

Data Manager: Monitor Tab

The Monitor Tab also has a fresh look with updated filter bar UI and action buttons consolidated to the right side. 

Monitor Tab

On the monitor details page, several actions have been moved or temporarily removed.

  • View info in not currently visibility (coming in our next release)
  • View input is removed as the inputs and outputs all shown on the page details.
  • Save as New Workflow has been removed.
  • Monitor tab is not indicating that a job is running (coming in next release)
  • Tags not showing on the Monitor table (coming in next release as its own column)

You can copy an execution ID to the clipboard by clicking the icon next to the ID.

Logs have a new Download button feature: 

Logs Download Feature

By George! DNAnexus CTO George Asimenos Recognized as Top Voice in Precision Medicine

George AsimenosWith his background in comparative metagenomics, silicon compilers, and elliptic curve cryptanalysis, George Asimenos was well poised to pioneer the bioinformatics platforms needed to make precision medicine a reality. He has now been recognized alongside other genomics luminaries from academia and industry as one of the Top 25 Voices in Precision Medicine.

Culled from a list of 200 nominations from around the world, the BIS Research and Insight Monk initiative showcases and celebrates the diversity and talent among the interdisciplinary leaders of the healthcare industry and highlights their influence on the industry.

“We went through a rigorous process of analysing the initial pool of nominations and shortlisting the Top 25 Voices through an iterative process based on eight core parameters including product developments, publications, entrepreneurial achievements, and years of experience, among others,” said Wahid Khan, principal analyst at BIS Research.

As the Chief Technical Officer of DNAnexus, George has played a critical role in building the company’s scientific and engineering foundation and exploring ways in which our technology can be used to craft novel experiences that transcend traditional genomics boundaries.

In the Insight Monk report, he spoke about genetic testing.

“Clinical sequencing is increasingly applied in oncology to advance personalized treatment of cancer, identification and treatment of Mendelian diseases, and in prenatal genetic testing,” he said.  

Top Voices AwardPrior to DNAnexus, George conducted research at Stanford University, where he participated in early efforts to analyze the human genome as part of the ENCODE Pilot Project. He has been at the forefront of precision medicine ever since, most recently hosting a panel on the challenges in applying AI on biomedical data at the 2019 Precision Medicine World Conference.

Congratulations to George and his fellow awardees, which also includes George D. Yancopoulos, President and Chief Scientific Officer of Regeneron Pharmaceuticals, Inc., one of our collaborators for the UK Biobank project.