Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning

Zexian Zeng, Liang Yao, Ankita Roy, Xiaoyu Li, Sasa Espino, Susan E. Clare, Seema A. Khan, Yuan Luo*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

30 Scopus citations

Abstract

Accurately identifying distant recurrences in breast cancer from the electronic health records (EHR) is important for both clinical care and secondary analysis. Although multiple applications have been developed for computational phenotyping in breast cancer, distant recurrence identification still relies heavily on manual chart review. In this study, we aim to develop a model that identifies distant recurrences in breast cancer using clinical narratives and structured data from EHR. We applied MetaMap to extract features from clinical narratives and also retrieved structured clinical data from EHR. Using these features, we trained a support vector machine model to identify distant recurrences in breast cancer patients. We trained the model using 1396 double-annotated subjects and validated the model using 599 double-annotated subjects. In addition, we validated the model on a set of 4904 single-annotated subjects as a generalization test. In the held-out test and generalization test, we obtained F-measure scores of 0.78 and 0.74, area under curve (AUC) scores of 0.95 and 0.93, respectively. To explore the representation learning utility of deep neural networks, we designed multiple convolutional neural networks and multilayer neural networks to identify distant recurrences. Using the same test set and generalizability test set, we obtained F-measure scores of 0.79 ± 0.02 and 0.74 ± 0.004, AUC scores of 0.95 ± 0.002 and 0.95 ± 0.01, respectively. Our model can accurately and efficiently identify distant recurrences in breast cancer by combining features extracted from unstructured clinical narratives and structured clinical data.

Original languageEnglish (US)
Pages (from-to)283-299
Number of pages17
JournalJournal of Healthcare Informatics Research
Volume3
Issue number3
DOIs
StatePublished - Sep 15 2019

Funding

This project is supported in part by NIH grant R21LM012618-01.

Keywords

  • Breast cancer
  • Computational phenotyping
  • Convolutional neural networks
  • Distant recurrence
  • Metastasis
  • Multilayer perceptron
  • NLP, EHR

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Health Informatics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning'. Together they form a unique fingerprint.

Cite this