作者
Ali Soroush, Benjamin S Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang
发表日期
2024/4/25
期刊
NEJM AI
卷号
1
期号
5
页码范围
AIdbp2300040
出版商
Massachusetts Medical Society
简介
Background
Large language models (LLMs) have attracted significant interest for automated clinical coding. However, early data show that LLMs are highly error-prone when mapping medical codes. We sought to quantify and benchmark LLM medical code querying errors across several available LLMs.
Methods
We evaluated GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat performance and error patterns when querying medical billing codes. We extracted 12 months of unique International Classification of Diseases, 9th edition, Clinical Modification (ICD-9-CM), International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM), and Current Procedural Terminology (CPT) codes from the Mount Sinai Health System electronic health record (EHR). Each LLM was provided with a code description and prompted to generate a billing code. Exact match accuracy and other performance metrics …
引用总数