TY - JOUR
T1 - Subgraph augmented non-negative tensor factorization (SANTF) for modeling clinical narrative text
AU - Luo, Yuan
AU - Xin, Yu
AU - Hochberg, Ephraim
AU - Joshi, Rohit
AU - Uzuner, Ozlem
AU - Szolovits, Peter
N1 - Funding Information:
The work described was supported in part by Grant Number U54LM008748 from the National Library of Medicine and by the Scullen Center for Cancer Data Analysis.
Publisher Copyright:
© The Author 2015.
PY - 2015/9
Y1 - 2015/9
N2 - Objective Extracting medical knowledge from electronic medical records requires automated approaches to combat scalability limitations and selection biases. However, existing machine learning approaches are often regarded by clinicians as black boxes. Moreover, training data for these automated approaches at often sparsely annotated at best. The authors target unsupervised learning for modeling clinical narrative text, aiming at improving both accuracy and interpretability. Methods The authors introduce a novel framework named subgraph augmented non-negative tensor factorization (SANTF). In addition to relying on atomic features (e.g., words in clinical narrative text), SANTF automatically mines higher-order features (e.g., relations of lymphoid cells expressing antigens) from clinical narrative text by converting sentences into a graph representation and identifying important subgraphs. The authors compose a tensor using patients, higher-order features, and atomic features as its respective modes. We then apply non-negative tensor factorization to cluster patients, and simultaneously identify latent groups of higher-order features that link to patient clusters, as in clinical guidelines where a panel of immunophenotypic features and laboratory results are used to specify diagnostic criteria. Results and Conclusion SANTF demonstrated over 10% improvement in averaged F-measure on patient clustering compared to widely used nonnegative matrix factorization (NMF) and k-means clustering methods. Multiple baselines were established by modeling patient data using patientby- features matrices with different feature configurations and then performing NMF or k-means to cluster patients. Feature analysis identified latent groups of higher-order features that lead to medical insights. We also found that the latent groups of atomic features help to better correlate the latent groups of higher-order features.
AB - Objective Extracting medical knowledge from electronic medical records requires automated approaches to combat scalability limitations and selection biases. However, existing machine learning approaches are often regarded by clinicians as black boxes. Moreover, training data for these automated approaches at often sparsely annotated at best. The authors target unsupervised learning for modeling clinical narrative text, aiming at improving both accuracy and interpretability. Methods The authors introduce a novel framework named subgraph augmented non-negative tensor factorization (SANTF). In addition to relying on atomic features (e.g., words in clinical narrative text), SANTF automatically mines higher-order features (e.g., relations of lymphoid cells expressing antigens) from clinical narrative text by converting sentences into a graph representation and identifying important subgraphs. The authors compose a tensor using patients, higher-order features, and atomic features as its respective modes. We then apply non-negative tensor factorization to cluster patients, and simultaneously identify latent groups of higher-order features that link to patient clusters, as in clinical guidelines where a panel of immunophenotypic features and laboratory results are used to specify diagnostic criteria. Results and Conclusion SANTF demonstrated over 10% improvement in averaged F-measure on patient clustering compared to widely used nonnegative matrix factorization (NMF) and k-means clustering methods. Multiple baselines were established by modeling patient data using patientby- features matrices with different feature configurations and then performing NMF or k-means to cluster patients. Feature analysis identified latent groups of higher-order features that lead to medical insights. We also found that the latent groups of atomic features help to better correlate the latent groups of higher-order features.
KW - Natural language processing
KW - Non-negative tensor factorization
KW - Subgraph mining
KW - Unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=84953343267&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84953343267&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocv016
DO - 10.1093/jamia/ocv016
M3 - Article
C2 - 25862765
AN - SCOPUS:84953343267
SN - 1067-5027
VL - 22
SP - 1009
EP - 1019
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 5
ER -