DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

N Zhou, Q Wu, Z Wu, S Marino, ID Dinov - Journal of medical systems, 2022 - Springer
Journal of medical systems, 2022Springer
Petabytes of health data are collected annually across the globe in electronic health records
(EHR), including significant information stored as unstructured free text. However, the lack of
effective mechanisms to securely share clinical text has inhibited its full utilization. We
propose a new method, DataSifterText, to generate partially synthetic clinical free-text that
can be safely shared between stakeholders (eg, clinicians, STEM researchers, engineers,
analysts, and healthcare providers), limiting the re-identification risk while providing …
Abstract
Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naïve content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients’ overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility).
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果