SCH:INT:Collaborative Research:High-throughput Phenotyping on Electronic Health Records using Multi-Tensor Factorization

Project: Research project

Project Details


As the adoption of Electronic health records (EHRs) increases, the complexity of EHR data is growing dramatically. EHR data now cover diverse information about patients including diagnosis, medication, lab results, genomic information and clinical notes. EHR data are becoming rich data sources for clinical research such as predictive modeling and risk stratification. Such large volumes of heterogeneous, longitudinal and noisy EHR data do not provide intuitive and robust representation of patients. In particular, interconnections
among those di↵erent information sources are difficult for modeling, which create tremendous challenges in using EHR for research. The fundamental question is how to transform complex interconnected EHR data into concise and meaningful clinical concepts (phenotypes) about patients. Such a transformation is referred as phenotyping. Most of the existing work on phenotyping using EHRs has been done in an ad hoc manner, where specific data processing algorithms are implemented to extract a specific phenotype on
a specific dataset (e.g., type 2 diabetes, hypothyroidism, and atrial fibrillation). There is an urgent need for a scalable way to guide the development of phenotypes from EHRs. The transformation from data (EHRs) to knowledge (phenotypes) poses unique challenges: a) patient representation, b) high-throughput phenotype generation from EHRs, c) expert-guided phenotype refinement, and d) phenotype adaptation across institutions. To address those challenges, PIs propose a general computational framework based on
multi-tensor factorization for transforming EHR data into meaningful phenotypes with expert guidance.
Intellectual Merits: PIs propose to model EHR data of patients as multiple interconnected relations represented as tensors, like tuples of patient-medication-diagnosis, patient-lab, and patient-symptoms. Then,
PIs propose to learn a suite of algorithms over those tensors to derive hidden concepts (phenotype candidates). Finally, PIs propose to methods to refine those candidates to interpretable phenotypes with the feedback from clinical experts. Under this framework, many existing algorithms such as dimensionality
reduction, topic modeling and co-clustering can be viewed as special cases. Also more powerful and general factorization and refinement algorithms will be proposed with appropriate regularization based on medical
knowledge and expert feedback. The algorithms for phenotype adaptation provide novel and principled ways for addressing phenotyping challenges in dealing with multiple institutions. This project will lay the foundation of large-scale phenotyping discovery research using interdisciplinary approach from computer science and medical informatics, and enable future clinical discoveries using EHR data. Broader Impacts: PIs will demonstrate the broader impacts of the resulting phenotypes in diverse clinical applications, including: a) cohort construction, where case and control patients are identified with respect to a specific phenotype or phenotype combinations; b) genome wide association study (GWAS), where target phenotypes of patients are tested against their single nucleotide polymorphisms (SNPs) based on statistical
association; c) clinical predictive modeling, where a model is developed to predict a target phenotype acquired in future using other phenotypes of the patients from the past. The proposed framework is expected to have major impacts on clinical research and operation including clinical trial design, predictive models, epidemiology studies and clinical decision sup
Effective start/end date9/1/148/31/18


  • National Science Foundation (IIS-1417819)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.