Abstract
Background: Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. Developing a model using natural language processing and machine learning to identify local recurrences in breast cancer patients can reduce the time-consuming work of a manual chart review. Methods: We design a novel concept-based filter and a prediction model to detect local recurrences using EHRs. In the training dataset, we manually review a development corpus of 50 progress notes and extract partial sentences that indicate breast cancer local recurrence. We process these partial sentences to obtain a set of Unified Medical Language System (UMLS) concepts using MetaMap, and we call it positive concept set. We apply MetaMap on patients' progress notes and retain only the concepts that fall within the positive concept set. These features combined with the number of pathology reports recorded for each patient are used to train a support vector machine to identify local recurrences. Results: We compared our model with three baseline classifiers using either full MetaMap concepts, filtered MetaMap concepts, or bag of words. Our model achieved the best AUC (0.93 in cross-validation, 0.87 in held-out testing). Conclusions: Compared to a labor-intensive chart review, our model provides an automated way to identify breast cancer local recurrences. We expect that by minimally adapting the positive concept set, this study has the potential to be replicated at other institutions with a moderately sized training dataset.
Original language | English (US) |
---|---|
Article number | 498 |
Journal | BMC bioinformatics |
Volume | 19 |
DOIs | |
State | Published - Dec 28 2018 |
Funding
Research reported in this paper was supported in part by grant R21LM012618, R01LM011663 and R01LM011962 awarded by the National Library of Medicine of the National Institutes of Health. The publication fee is covered by NIH grant R21LM012618.
Keywords
- Breast cancer local recurrence
- EHR
- NLP
- SVM
ASJC Scopus subject areas
- Applied Mathematics
- Molecular Biology
- Structural Biology
- Biochemistry
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Using natural language processing and machine learning to identify breast cancer local recurrence'. Together they form a unique fingerprint.Datasets
-
Additional file 1: of Using natural language processing and machine learning to identify breast cancer local recurrence
Zeng, Z. (Contributor), Espino, S. (Contributor), Roy, A. (Contributor), Li, X. (Contributor), Khan, S. A. (Creator), Clare, S. E. (Creator), Jiang, X. (Creator), Neapolitan, R. (Creator) & Luo, Y. (Creator), figshare, 2018
DOI: 10.6084/m9.figshare.7528163, https://springernature.figshare.com/articles/dataset/Additional_file_1_of_Using_natural_language_processing_and_machine_learning_to_identify_breast_cancer_local_recurrence/7528163
Dataset
-
Additional file 2: of Using natural language processing and machine learning to identify breast cancer local recurrence
Zeng, Z. (Contributor), Espino, S. (Contributor), Roy, A. (Contributor), Li, X. (Contributor), Khan, S. A. (Creator), Clare, S. E. (Creator), Jiang, X. (Creator), Neapolitan, R. (Creator) & Luo, Y. (Creator), figshare, 2018
DOI: 10.6084/m9.figshare.7528166, https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Using_natural_language_processing_and_machine_learning_to_identify_breast_cancer_local_recurrence/7528166
Dataset