In our previous blog post, we explored a case study in data extraction, where we used Sysrev's machine learning capabilities to optimize both the screening and data extraction phases of a strategic review of the substance Mangiferin.  

This post will discuss analyses of the produced data. But first, some quick context.  

The purpose of the strategic review was to inform a decision, specifically whether or not Mangiferin was worth further research into its potential use as a non-food additive.  For that reason, our objective was to extract data from in vivo studies on Mangiferin.  In the end, over 3500 labels were extracted from 292 articles.

You can see exactly which pieces of information were extracted below.

Label Definition for Project 21696 - Mangiferin Data Extraction

Sysrev allows three types of labels - or pieces of information to be extracted - Boolean, Categorical, and String.  Shown below are the answer counts for each boolean and categorical label from the Mangiferin review.  In addition to providing information, each "bar" is actually a link to the Articles Dashboard with a filter for that specific label value, a useful way to sort articles for further review.

While boolean and categorical labels allow for quick analyses, string labels take a bit more work.  The issue is that String labels can have slight variations yet intend the same information.

For that reason, we recommend using categorical labels wherever possible as it forces a "standardization" of results.  That said, there are always situations when only String labels will work, especially early in research when the overall scope is still being explored.  

The Mangiferin review contained 8 string labels.  Some string labels existed to give additional context to categorical labels.  For example, Species was a categorical label with preset options, whereas Species Detailed was a string label.  This facilitated additional information such as if a specific mouse strain was stated in the literature. For quick analysis and grouping, the data can be filtered simply by "mouse," yet the additional information still exists should it become relevant.  

In other cases, there was simply too much variability in potential answers to use anything other than a string label.  This was the case with three important labels: Disease, Outcomes, and Dose.

Dose was the easiest String label to analyze as we had stipulated that doses be given in mg/kg.  Were that not the case, we would have had to perform unit conversions before comparing.  

The histogram below is the dose distribution, binned into 10 mg/kg increments, for the 292 articles.  The majority of the studies focused on doses 100 mg/kg or less.  

Dose Distribution for Project 21696

Diseases and Outcomes require more effort - again simply due to variation in input.  All told, there were 191 unique disease descriptions.  To discover which diseases were most prevalent, we clustered the disease descriptions using a text similarity tool built on word embeddings whereby descriptions with similar meanings tend to get a higher similarity score and cluster together.  

Left: 191 unique descriptions, ordered in a dendrogram based on similarity score. Right: Zoomed in

After clustering the diseases, we counted how often each cluster appears in the reviewed articles.  The results give a quick glimpse into which mangiferin uses have already been researched.  As shown below, the "diabetes mellitus" cluster, which contained 13 unique disease descriptions, appeared most often in the reviewed literature.

Left: Dendrogram of unique disease descriptions ordered by similarity. Right: Count of Disease Clusters 

We performed an equivalent exercise for the Outcomes.  As one might expect, "anti-diabetic effect" was the most common outcomes cluster – corresponding quite well to the most common disease cluster: "diabetes mellitus."

Count of Outcomes Cluster for Project 21696

When extracted and analyzed properly, data is the greatest informer of the modern age.  Extracting the correct data requires two things: 1) asking the right questions, and 2) having the tools necessary to review whichever documents contain the data.  

Sysrev is a platform specifically designed for document review and data extraction.  Its greatest strength is its versatility, both in the types of information which can be extracted as well as its unique ability to integrate with any data source.  To learn more about custom integrations, contact us at