Predicting the sample size of randomized controlled trials using natural language processing

P Windisch, F Dennstädt, C Koechli, R Förster… - JAMIA …, 2024 - academic.oup.com
P Windisch, F Dennstädt, C Koechli, R Förster, C Schröder, DM Aebersold, DR Zwahlen
JAMIA open, 2024academic.oup.com
Objectives Extracting the sample size from randomized controlled trials (RCTs) remains a
challenge to developing better search functionalities or automating systematic reviews. Most
current approaches rely on the sample size being explicitly mentioned in the abstract. The
objective of this study was, therefore, to develop and validate additional approaches.
Materials and Methods 847 RCTs from high-impact medical journals were tagged with 6
different entities that could indicate the sample size. A named entity recognition (NER) …
Objectives
Extracting the sample size from randomized controlled trials (RCTs) remains a challenge to developing better search functionalities or automating systematic reviews. Most current approaches rely on the sample size being explicitly mentioned in the abstract. The objective of this study was, therefore, to develop and validate additional approaches.
Materials and Methods
847 RCTs from high-impact medical journals were tagged with 6 different entities that could indicate the sample size. A named entity recognition (NER) model was trained to extract the entities and then deployed on a test set of 150 RCTs. The entities’ performance in predicting the actual number of trial participants who were randomized was assessed and possible combinations of the entities were evaluated to create predictive models. The test set was also used to evaluate the performance of GPT-4o on the same task.
Results
The most accurate model could make predictions for 64.7% of trials in the test set, and the resulting predictions were equal to the ground truth in 93.8%. GPT-4o was able to make a prediction on 94.7% of trials and the resulting predictions were equal to the ground truth in 90.8%.
Discussion
This study presents an NER model that can extract different entities that can be used to predict the sample size from the abstract of an RCT. The entities can be combined in different ways to obtain models with different characteristics.
Conclusion
Training an NER model to predict the sample size from RCTs is feasible. Large language models can deliver similar performance without the need for prior training on the task although at a higher cost due to proprietary technology and/or required computational power.
Oxford University Press
以上显示的是最相近的搜索结果。 查看全部搜索结果