A wealth of medical knowledge is locked away in millions of academic articles. If Natural Language Processing is the key to unlocking it, literature reviews are the key-makers.  

This post will show how data obtained from literature reviews can seamlessly inform both models and actionable applications – using the real world examples of the Gene Hunter Project and WhichGenesMatter.com.

When exploring medical literature, it's helpful to have a goal.  In the case of the Gene Hunter Project, our goal was to automatically correlate specific genes with specific medical terms.  Step one was to create the training data.  Using Sysrev, reviewers were paid $0.50 per-task to read medical abstracts and identify genes.  Paid reviewers completed 10,000 sentences in just two weeks.

Epigenetic Silencing of the mutL homolog 1 ( MLH1 GENE ) Promoter in Relation to the Development of Gastric Cancer (GC) and its use as a Biomarker for Patients with Microsatellite Instability.
An example gene annotation visualized using spaCy visualizers. Reviewers at sysrev.com/p/3144 were asked to identify genes in medical abstracts.

As all of annotation data generated by the reviewers is open access, it can easily be used to train text annotation models.  If you are interested, this blog post describes how to build gene annotation models in under 10 minutes using the Gene Hunter Project data.  By giving our users 100% flexibility in defining their labels (thereby customizing the resulting exported database), we hope to support many machine learning projects.

Back to our example – by running the NER models on PubMed medical queries, gene counts can be associated with medical concepts.  Put another way, we can see which genes are most statistically 'relevant' to any given medical term.  We've implemented this process into our toy application:  WhichGenesMatter.com.  Simply type in a medical query and see which genes are detected in the resulting abstracts.

Searching for breast cancer on whichgenesmatter.com identifies HER2 as an important gene. HER2-positive is an important type of breast cancer.

One of my favorite medical topics is longevity.  Lets see if we're picking up the right genes for aging:

Top 20 genes when querying 'longevity' on at whichgenesmatter.com

These don't look bad, in fact many of these genes are mentioned in Cynthia Kenyon's (calicolabs.com) awesome TedTalk on longevity research:

WhichGenesMatter.com is our proof of concept and a first step towards the automated analyses of medical literature.

If you like this work, please subscribe to this blog or check out sysrev.com - it's free! We use this blog to post about new features (blog.sysrev.com/features), invite Sysrev users to talk about their work (blog.sysrev.com/SRG-safety), and all other Sysrev topics.

@misc{Luechtefeld2019SysrevWhichGenes,
  title={Which genes matter? Putting NER to use.},
  author={Luechtefeld,Thomas},
  year={2019 (accessed <your date>)},
  url="https://blog.sysrev.com/which-genes-matter/"
}