TY - JOUR
T1 - Comparison of variable selection methods for clinical predictive modeling
AU - Sanchez-Pinto, L. Nelson
AU - Venable, Laura Ruth
AU - Fahrenbach, John
AU - Churpek, Matthew M.
N1 - Funding Information:
Dr. Churpek has a patent pending (ARCD. P0535US.P2) for risk stratification algorithms for hospitalized patients, and he is supported by a career development award from the National Heart, Lung, and Blood Institute ( K08 HL121080 ). All other authors report no competing interests or sources of funding.
Publisher Copyright:
© 2018 Elsevier B.V.
PY - 2018/8
Y1 - 2018/8
N2 - Objective: Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. Materials and Methods: We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort. Method evaluation included measures of parsimony, variable importance, and discrimination. Results: In the large, multicenter dataset, the modern tree-based Variable Selection Using Random Forest and the Gradient Boosted Feature Selection methods achieved the best parsimony. In the smaller, single-center dataset, the classic regression-based stepwise backward selection using p-value and AIC methods achieved the best parsimony. In both datasets, variable selection tended to decrease the accuracy of the random forest models and increase the accuracy of logistic regression models. Conclusions: The performance of classic regression-based and modern tree-based variable selection methods is associated with the size of the clinical dataset used. Classic regression-based variable selection methods seem to achieve better parsimony in clinical prediction problems in smaller datasets while modern tree-based methods perform better in larger datasets.
AB - Objective: Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. Materials and Methods: We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort. Method evaluation included measures of parsimony, variable importance, and discrimination. Results: In the large, multicenter dataset, the modern tree-based Variable Selection Using Random Forest and the Gradient Boosted Feature Selection methods achieved the best parsimony. In the smaller, single-center dataset, the classic regression-based stepwise backward selection using p-value and AIC methods achieved the best parsimony. In both datasets, variable selection tended to decrease the accuracy of the random forest models and increase the accuracy of logistic regression models. Conclusions: The performance of classic regression-based and modern tree-based variable selection methods is associated with the size of the clinical dataset used. Classic regression-based variable selection methods seem to achieve better parsimony in clinical prediction problems in smaller datasets while modern tree-based methods perform better in larger datasets.
KW - Data interpretation
KW - Electronic health records
KW - Machine learning
KW - Models
KW - Regression analysis
KW - Statistical
KW - Variable selection
UR - http://www.scopus.com/inward/record.url?scp=85047302217&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85047302217&partnerID=8YFLogxK
U2 - 10.1016/j.ijmedinf.2018.05.006
DO - 10.1016/j.ijmedinf.2018.05.006
M3 - Article
C2 - 29887230
AN - SCOPUS:85047302217
SN - 1386-5056
VL - 116
SP - 10
EP - 17
JO - International Journal of Medical Informatics
JF - International Journal of Medical Informatics
ER -