Modeling the Incompleteness and Biases of Health Data

Project: Research project

Project Details


Researchers are increasingly working to “mine” health data to derive new medical knowledge. Unlike experimental data that are collected per a research protocol, the primary role of clinical data is to help clinicians care for patients, so the procedures for its collection are not often systematic. Thus, missing and/or biased data can hinder medical knowledge discovery and data mining efforts. Existing efforts for missing health data imputation often focus on only cross-sectional correlation (e.g., correlation across subjects or across variables) but neglect autocorrelation (e.g., correlation across time points). Moreover, they often focus on modeling incompleteness but neglect the biases in health data. Modeling both the incompleteness and bias may contribute to better understanding of health data and better support clinical decision making. We propose a novel framework of Bias-Aware Missing data Imputation with Cross-sectional correlation and Autocorrelation (BAMICA), and leverage clinical notes to better inform the methods that will otherwise rely on structured health data only. In addition to evaluating its imputation accuracy, we will apply the proposed framework to assist in downstream tasks such as predictive modeling for multiple outcomes across a diverse range of clinical and cohort study datasets. Aim 1 introduces the MICA framework to jointly consider cross-sectional correlation and auto-correlation. In Aim 2, we will augment MICA to be bias-aware (hence BAMICA) to account for biases stemmed from multiple roots such as healthcare process and use them as features in imputing missing health data. This augmentation is achieved by a novel recurrent neural network architecture that keeps track of both evolution of health data variables and bias factors. In Aim 3, we will supplement unstructured clinical notes to structured health data for modeling incompleteness and biases using a novel architecture of graph neural network on top of memory network. We will apply graph neural networks to process clinical notes in order to learn proper representations as input to the memory networks for imputation and downstream predictive modeling tasks. Depending on the clinical problem and data availability, not all modules may be needed. Thus our proposed BAMICA framework is designed to be flexible and consists of selectable modules to meet some or all of the above needs. In summary, our proposal bridges a key knowledge gap in jointly modeling incompleteness and biases in health data and utilizes unstructured clinical notes to supplement and augment such modeling in order to better support predictive modeling and clinical decision making. We will demonstrate generalizability by experimenting on four large clinical and cohort study datasets, and by scaling up to the eMERGE network spanning 11 institutions nationwide. We will disseminate the open-source framework. The principled and flexible framework generated by this project will bring significant methodological advancement and have a direct impact on enhancing discovery from health data.
Effective start/end date6/1/203/31/24


  • National Library of Medicine (5R01LM013337-03)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.