The prevalence of large observational databases offers potential for identifying predictive relationships among variables of interest, although observational data are generally far less informative and less reliable than experimental data. We consider the problem of selecting a subset of records from a large observational database, for the purpose of designing a small but powerful experiment involving the selected records. It is assumed that the database contains the predictor variables but is missing the response variable, and that the purpose is to fit a logistic regression model after the response is obtained via the experiment. Active learning methods, which treat a similar problem, usually select records sequentially and focus on the single objective of classification accuracy. In contrast, many emerging applications require batch sample designs and have a variety of objectives that may include classification accuracy or accuracy of the estimated parameters, the latter being more in line with the optimal design of experiments (DOE) paradigm. The aim of this paper is to explore batch sampling from databases from a DOE perspective, particularly regarding the configuration, performance, and robustness of the designs that result from the different criteria. Through extensive simulation, we show that DOE-based batch sampling methods can substantially outperform random sampling and the entropy method that is popular in active learning. We also provide insight and guidelines for selecting appropriate design criteria and modeling assumptions.
- active learning
- logistic regression
- optimal design of experiment
- sampling from databases
ASJC Scopus subject areas
- Safety, Risk, Reliability and Quality
- Management Science and Operations Research