Enormous healthcare resources are devoted to compiling electronic medical record (EMR) databases that are used to explore and identify disease risk factors by fitting statistical models that capture the relationship between a binary response variable Y (1 if a patient contracted the disease; 0 otherwise) and a set of predictor variables x that represent clinical and demographic data for the patient. Although the predictor data are often recorded reliably, the response data often have a shockingly high percentage of errors, due to the fact that poorly trained personnel are employed to enter the ICD-9 codes for the diseases. For example, in our testbed database of 23,041 cases in the Northwestern medical system, a random sample of 20 cases that had been recorded as sudden cardiac arrest events were reviewed, and it was discovered that only 5 of these were true events, which translates to a 75% error rate. In order to take these errors into account and avoid making unreliable risk assessments, it is imperative to have a doctor review a validation sample of cases to determine their true Y values. However, because of the high cost of doctors’ time, validation sample sizes are limited. Our research objective is to develop a methodology for validation sampling and reliable risk assessment (VSRRA) with error-prone EMR data that judiciously and efficiently selects validation cases for maximum information content.
|Effective start/end date||9/1/14 → 8/31/18|
- National Science Foundation (CMMI-1436574)