Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Christin Seifert | Claudia Alessandra Libbi | Dolf Trieschnigg | Jan Trienes

University of Twente Netherlands

Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations.In this case, a neural language models (LSTM and GPT-2) is used for generating artificial EHR text jointly with annotations for named-entity recognition.Thus a neural language models can be used successfully to generate artificial text with in-line annotations. Despite varying syntactic and stylistic properties, as well as topical incoherence, they are of sufficient utility to be used for training downstream machine learning models.The Synthetic Data can be utilized as a replacement for real data, when real data are unavailable or cannot be shared, and as a special form of data augmentation to generate additional training examples for training ML models.

Input variables : Raw Textual EHR Data
Output Variables : Synthetic and Annotated Text

Metrics to Monitor

Statistical	:	Somers D \| Accuracy \| Precision and Recall \| Confusion Matrix \| F1 Score \| Roc and Auc \| Prevalence \| Detection Rate \| Balanced Accuracy \| Cohen's Kappa \| Concordance \| Gini Coefficent \| KS Statistic \| Youden's J Index
Infrastructure	:	Log Bytes \| Logging/User/IAMPolicy \| Logging/User/VPN \| CPU Utilization \| Memory Usage \| Error Count \| Prediction Count \| Prediction Latencies \| Private Endpoint Prediction Latencies \| Private Endpoint Response Count

Visit Model : github.com

Additional links : mdpi.com

Model Category	:	Public
Date Published	:	May, 2021
Healthcare Domain	:	Payer Provider
Code	:	github.com

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Model Details

Applications

Solutions

You can also search for

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Model Details

Applications

Solutions

You can also search for

Share