Large regression data sets are now commonplace, with so many predictors that they cannot or should not all be included individually. In practice, derived predictors are relevant as meaningful features or, at the very least, as a form of regularized approximation of the true coefficients. We consider derived predictors that are the sum of some groups of individual predictors, which is equivalent to predictors within a group sharing the same coefficient. However, the groups of predictors are usually not known in advance and must be discovered from the data. In this paper we develop a coefficient tree regression algorithm for generalized linear models to discover the group structure from the data. The approach results in simple and highly interpretable models, and we demonstrated with real examples that it can provide a clear and concise interpretation of the data. Via simulation studies under different scenarios we showed that our approach performs better than existing competitors in terms of computing time and predictive accuracy.
- group structure
- homogeneous regression coefficients
- supervised clustering
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications