AIMedTech

Automating the Assessment of Quality of Medical Evidence

Core team:

Simon Suster, Timothy Baldwin, Karin Verspoor

Extended team:

Yulia Otmakhova, Jey Han Lau, Antonio Jimeno Yepes, David Martinez Iraola, Artem Shelmanov, Xudong Han

Assessing Medical Evidence with Predictive Natural Language Processing

Systematic reviews are essential for evidence-based decision-making in medicine, as they synthesise all relevant published evidence on a specific clinical question. However, the process is labour-intensive and time-consuming, often requiring over 1000 hours of manual labour per review. Predictive NLP models, such as those leveraging transformer-based architectures like SciBERT, show promise in automating parts of this process, including the assessment of risk of bias (RoB) and overall quality of evidence. Previous work, such as RobotReviewer and Trialstreamer, has successfully demonstrated the potential of NLP in the critical appraisal of randomized controlled trials based on report texts. Our developed models focus on processing larger bodies of evidence using the GRADE framework for problem formalisation. They can handle heterogeneous inputs—numerical, categorical, and textual data—allowing for a comprehensive evaluation of the evidence. By automating these tasks, NLP can significantly reduce the workload of reviewers, expedite the review process, and help maintain up-to-date systematic reviews, ultimately supporting better-informed clinical decisions.

Publications

Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study (EvidenceGRADEr)

Talks

Automated quality assessment of medical evidence to support systematic reviewing (MBZUAI, UMass BioNLP, UAntwerpen)
Using Machine Learning and Natural Language Processing to Structure Medical Evidence and Grade its Quality (MCBK, YouTube)

Acquiring Data

EvidenceGRADEr (Zenodo)
RobotReviewer test data (Zenodo)

Accessing Source Code

EvidenceGRADEr (BitBucket)
RobotReviewer (GitHub)


Advances with Large Language Models (LLMs)

In the quest to streamline the assessment of risk of bias (RoB) in clinical trials, generative large language models (LLMs) show significant potential for automating this process. Traditional methods, which rely on supervised learning models, require extensive annotated datasets and have become increasingly outdated with the introduction of the revised RoB2 guidelines. Our research investigates whether LLMs can accurately predict RoB using prompts based on RoB2 guidance without extensive task-specific training data. We evaluated the performance of general and biomedical LLMs across various bias domains, finding that their performance seldom surpasses trivial baselines, indicating that LLMs currently fall short in this task. We highlight the complexity of RoB assessment and suggest that heavier problem decomposition and task-specific adaptations could lead to more accurate predictors, making LLMs viable tools in systematic reviews.

Ongoing work suggests that task-specific fine-tuning of LLMs using Low-Rank Adaptation (LoRA) can lead to performance that, for certain RoB domains, even surpasses existing supervised systems based on pretrained representations. We aim to find out whether fine-tuning allows for using substantially less annotated data to maintain performance comparable to traditional supervised predictors. Additionally, leveraging the more plentiful RoB1 data in a transfer-learning setup could improve performance on RoB2 assessments.

Publications

Zero- and Few-Shot Prompting of Generative Large Language Models Provides Weak Assessment of Risk of Bias in Clinical Trials.

Talks

Automating risk-of-bias assessment with generative AI (Global Evidence Summit)

Acquiring Data

RoB version 2 (Zenodo)

Accessing Source Code

Zero-Shot Prompting, In-Context Learning, and Task-Specific Fine-Tuning for Risk-of-Bias Assessment


Can Automated Evidence Assessment Be Trusted?

Our research explores the interplay between trust, reliability, fairness, and debiasing in machine learning models, with a specific focus on systematic reviewing. We found that while debiasing methods are crucial for enhancing fairness by reducing biases linked to socio-economic attributes, they can negatively impact model reliability, particularly in selective classification and out-of-distribution detection tasks. The trade-off between fairness and reliability is significantly influenced by the distribution of target classes and protected attributes in the test set. Specifically, for medical evidence assessment, we find that medical evidence is unequally distributed in both quantity (e.g., across different medical areas) and quality (e.g., the prevalence of high-quality evidence varies significantly across medical areas). Taking equal opportunity as a fairness principle, all evidence is expected to be assessed with the same accuracy or predictive performance capability, regardless of the protected category to which it belongs. This research demonstrates that techniques such as data rebalancing and training-based methods (e.g., adversarial training) are the most effective in balancing fairness and reliability. Effects of debiasing should be examined not only by looking at aggregate measures of improvement but on all individual protected groups to ensure that a fairer model is not by harming an individual group. For systematic reviewing, this means that when aiming for unbiased quality assessments, it is essential to maintain the predictive performance and reliability of these assessments to preserve the integrity and trustworthiness of the review process.

Publications

Analysis of Predictive Performance and Reliability of Classifiers for Quality Assessment of Medical Evidence Revealed Important Variation by Medical Area
Promoting Fairness in Classification of Quality of Medical Evidence
Uncertainty Estimation for Debiased Models: Does Fairness Hurt Reliability?

Talks

When to trust a classifier for quality assessment of medical evidence? (ICASR)