作者:Xavier Tannier Perceval Wajsbürt Alice Calliger Basile Dura Alexandre Mouchet Martin Hilka Romain Bey
本研究的目的是解决临床报告的错误识别这一关键问题,以便在确保患者隐私的同时,允许出于研究目的访问数据。该研究强调了在共享该领域的工具和资源方面面临的困难,并介绍了大巴黎大学医院(AP-HP)在减少临床数据仓库中文本文件的系统化名方面的经验。我们根据12种类型的识别实体对临床文档语料库进行了注释,并构建了一个混合系统,融合了深度学习模型和手动规则的结果。我们的结果显示,F1得分的总体表现为0.99。我们讨论了实现选择和演示经验,以更好地理解此类任务所涉及的工作,包括数据集大小、文档类型、语言模型或规则添加。我们在3条款BSD许可证下共享指导方针和代码。
The objective of this study is to address the critical issue ofde-identification of clinical reports in order to allow access to data forresearch purposes, while ensuring patient privacy. The study highlights thedifficulties faced in sharing tools and resources in this domain and presentsthe experience of the Greater Paris University Hospitals (AP-HP) inimplementing a systematic pseudonymization of text documents from its ClinicalData Warehouse. We annotated a corpus of clinical documents according to 12types of identifying entities, and built a hybrid system, merging the resultsof a deep learning model as well as manual rules. Our results show an overallperformance of 0.99 of F1-score. We discuss implementation choices and presentexperiments to better understand the effort involved in such a task, includingdataset size, document types, language models, or rule addition. We shareguidelines and code under a 3-Clause BSD license.
论文链接:http://arxiv.org/pdf/2303.13451v1
更多计算机论文:http://cspaper.cn/