human-level fluency suggests many opportunities in healthcare to reduce administrative
burden and improve quality of care. However, evaluating LLMs on realistic text generation
tasks for healthcare remains challenging. Existing question answering datasets for
electronic health record (EHR) data fail to capture the complexity of information needs and
documentation burdens experienced by clinicians. To address these challenges, we …