Contributed by
Jose Marcano Belisario, PhD
Associate Director, Synthesis and Statistics
RTI Health Solutions
The Potential of AI in Evidence Synthesis
Artificial intelligence (AI) tools hold immense potential to address one of the most significant challenges in evidence synthesis: the sheer volume of work involved and its impact on the reliability and quality of synthesis products. Traditional methods of evidence synthesis require extensive manual effort to review, select, analyze, and integrate vast amounts of data from multiple sources. AI tools can, in theory, automate some of the manual and repetitive tasks in evidence synthesis, such as searching, screening, and data extraction. Automation would enable researchers to spend more time on the interpretation and synthesis of data. Recent improvements in large language models (LLMs) could further assist researchers with these tasks, making evidence synthesis more comprehensive and timely.
Ensuring Accuracy and Reliability of AI Tools
An important issue to consider in this context is the accuracy and reliability of AI tools in evidence synthesis. Given that systematic literature reviews (SLRs) and meta-analyses (MAs) form the cornerstone of decision-making in healthcare (e.g., policy, reimbursement, therapeutic pathways), it is crucial to ensure that they are reliable and accurate. While AI can automate many of the manual and repetitive tasks involved in evidence synthesis, there is a risk that it may introduce error and bias in SLRs and MAs. For example, improperly excluding relevant studies, incomplete or inaccurate data extraction, or misinterpretation of study results. Therefore, it is important to validate and have human oversight of these tools.
At RTI Health Solutions, we are committed to evaluating AI tools that could be used in our evidence synthesis projects. The obvious benefit of this approach is that we can improve the efficiency of our processes, without compromising the quality of our work. A less obvious benefit of our approach is the lessons we have learned about how the validity of AI tools can influence human oversight in practice. Human oversight of AI tools, or the human-in-the-loop approach, allows researchers to decide on the level of involvement of AI tools and carefully vet the decisions made by these tools before they are incorporated.
Evaluating AI Tools for Evidence Synthesis: Metrics and Practical Implications
Our initial evaluation efforts centered on the performance of AI tools compared with human decisions across key evidence synthesis workflows: searching, screening, and data extraction. For this, we used metrics such as recall and precision, which can give us a good indication of the theoretical performance of AI tools. But what do they mean in practice? For example, we recently evaluated the impact of increasing the size of the training dataset on the performance of an AI-assisted screening tool.
We then assessed if different levels of recall could be translated into decision thresholds based on the probability of inclusion to reduce the volume of manual screening. (For example, when recall reaches 0.80, is it possible to exclude all studies with a probability of inclusion less than 0.01?) to reduce the volume of manual screening. We were not able to identify such a threshold: even at low levels of probability of inclusion, the AI tool excluded a small number of relevant studies (no more than 2 per use case). Therefore, at least for an SLR, it would not be appropriate to rely solely on AI recommendations. Nonetheless, we learned a great deal about the factors that can influence the performance of AI tools and their implementation in practice. One factor, in particular, caught our attention: the type and complexity of a project.
The Impact of Project Type and Complexity
The type and complexity of a project refer to the following:
- Methods (i.e., SLR vs. targeted literature review),
- Type of review (i.e., clinical vs. economic vs. humanistic)
- Therapeutic area
- Population, intervention, comparator, and outcomes (PICO) complexity
- Type of search strategies (i.e., single search covering multiple questions vs. separate searches resulting in individual datasets for screening)
- Data sources (i.e., bibliographic databases, clinical trial registries, conference websites, regulatory agency websites)
- Synthesis approach (i.e., narrative vs. quantitative).
These characteristics may or may not align with current AI tool capabilities. For instance, LLMs perform better with explicit prompts for structured data extraction, such as safety data from peer-reviewed journals. Similarly, terminology in therapeutic areas such as oncology tends to be more standardized than terminology in other therapeutic areas. Working on a project related to mental health or certain chronic conditions, or a project that requires reviewing health technology assessments (HTA) reports from a wide range of agencies, may push AI tools beyond their current limit.
Project complexity also influences AI training requirements. We have found that random selection of titles and abstracts for a training dataset works better for clinical review questions focusing on a limited number of interventions. However, if working on an economic SLR that has economic evaluation, costs and resource use, and utility components, researchers need to ensure that all these components are represented in the training dataset (i.e., selective selection of titles and abstracts for the training dataset). Moreover, we have found that, when screening for an SLR, it is best to use AI to sort titles and abstracts according to the probability of inclusion; however, when screening for a TLR, it is relatively safe to rely on the AI suggestions for inclusion or exclusion.
Conclusions
Based on our evaluation of AI tools in evidence synthesis up to now, we are more comfortable using AI to assist with the extraction of key study characteristics from relevant sources than using it to assist with the extraction of patient characteristics and endpoint data. When deciding on the use of AI tools in evidence synthesis, researchers should have realistic expectations about what AI tools can achieve based on the project characteristics. They can then ensure that the relevant processes or procedures are in place before beginning work.
The responsible use of AI tools in evidence synthesis is paramount. Thorough evaluation before deployment ensures that these tools enhance efficiency without compromising the quality of work. Continuous evaluation is essential, given the rapid evolution of AI technologies. Human oversight remains crucial to mitigate error and bias, ensuring the reliability and accuracy of synthesis products. The effectiveness of AI tools currently varies based on the type and complexity of the project, necessitating realistic expectations and tailored implementation. By embracing these principles, we can harness the full potential of AI to advance evidence synthesis while maintaining the highest standards of quality and reliability.