查看文章

Large language models are poor medical coders—benchmarking of medical code querying

作者

Ali Soroush, Benjamin S Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang

发表日期

2024/4/25

期刊

NEJM AI

卷号

期号

页码范围

AIdbp2300040

出版商

Massachusetts Medical Society

简介

Background

Large language models (LLMs) have attracted significant interest for automated clinical coding. However, early data show that LLMs are highly error-prone when mapping medical codes. We sought to quantify and benchmark LLM medical code querying errors across several available LLMs.

Methods

We evaluated GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat performance and error patterns when querying medical billing codes. We extracted 12 months of unique International Classification of Diseases, 9th edition, Clinical Modification (ICD-9-CM), International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM), and Current Procedural Terminology (CPT) codes from the Mount Sinai Health System electronic health record (EHR). Each LLM was provided with a code description and prompted to generate a billing code. Exact match accuracy and other performance metrics …

引用总数

被引用次数：8

20248

学术搜索中的文章

Large language models are poor medical coders—benchmarking of medical code querying

A Soroush, BS Glicksberg, E Zimlichman, Y Barash… - NEJM AI, 2024

A Soroush, BS Glicksberg, E Zimlichman, Y Barash… - medRxiv, 2023

被引用次数：2 相关文章所有 2 个版本