Marcano Belisario J, Lunan M, Hawe E, Farrelly G. Evaluation of machine learning assisted title and abstract screening in 5 clinical systematic literature reviews. Poster to be given at the ISPOR 2025; May 14, 2025. Montréal, Canada.


OBJECTIVES: Many systematic literature review (SLR) programs offer built in machine learning assisted screening, which calculates the probability of each citation being advanced at the abstract and title stage. These advancement probabilities could reduce manual screening workload if a threshold for making bulk exclusions could be identified; the project objective was to assess whether an appropriate threshold can be determined.

METHODS: Screening decisions from 5 SLRs were compared with the advancement probabilities generated by Nested Knowledge across 4 training scenarios: 20% (T1), 30% (T2), 40% (T3), and 50% (T4) of randomly selected citations. These SLRs covered systemic lupus erythematosus (SLE), non small cell lung cancer, breast cancer, amyotrophic lateral sclerosis, and allergic rhino conjunctivitis (ARC). For each scenario, the cross validation metrics obtained after training the machine learning algorithm were recorded: recall, area under the curve (AUC), precision, F1 score, and accuracy.

RESULTS: Across scenarios, recall probabilities ranged from 0.67 (ARC T1) to 0.92 (SLE T3 and SLE T4) and generally increased with larger training sets (correlation = 0.62). AUC, precision, F1 score, and accuracy (0.43, 0.23, 0.32, and 0.29, respectively) were not strongly correlated with training set size. Across all SLRs and scenarios, low advancement probabilities (from 0.00 to 0.15) were assigned to at least 1 article that had been included by human reviewers. Reasons included records with no abstract; trial registry records that differed in formatting from the standard journal entries in the training set; and standard journal entries that covered aspects of the SLR question not captured in the training set.

CONCLUSIONS: This project provided practical examples of the impact that cross-validation measures have on the reliability of SLR findings. It also identified the complexity of SLR questions and the representativeness of training sets as factors that can influence advancement probabilities, and thus decisions about thresholds for bulk exclusions.

Share on: