Batch Sample Design from Databases for Logistic Regression

Liwen Ouyang, Daniel W. Apley*, Sanjay Mehrotra

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The prevalence of large observational databases offers potential for identifying predictive relationships among variables of interest, although observational data are generally far less informative and less reliable than experimental data. We consider the problem of selecting a subset of records from a large observational database, for the purpose of designing a small but powerful experiment involving the selected records. It is assumed that the database contains the predictor variables but is missing the response variable, and that the purpose is to fit a logistic regression model after the response is obtained via the experiment. Active learning methods, which treat a similar problem, usually select records sequentially and focus on the single objective of classification accuracy. In contrast, many emerging applications require batch sample designs and have a variety of objectives that may include classification accuracy or accuracy of the estimated parameters, the latter being more in line with the optimal design of experiments (DOE) paradigm. The aim of this paper is to explore batch sampling from databases from a DOE perspective, particularly regarding the configuration, performance, and robustness of the designs that result from the different criteria. Through extensive simulation, we show that DOE-based batch sampling methods can substantially outperform random sampling and the entropy method that is popular in active learning. We also provide insight and guidelines for selecting appropriate design criteria and modeling assumptions.

Original languageEnglish (US)
Pages (from-to)87-101
Number of pages15
JournalQuality and Reliability Engineering International
Volume33
Issue number1
DOIs
StatePublished - Feb 1 2017

Keywords

  • active learning
  • logistic regression
  • optimal design of experiment
  • sampling from databases

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Management Science and Operations Research

Fingerprint Dive into the research topics of 'Batch Sample Design from Databases for Logistic Regression'. Together they form a unique fingerprint.

Cite this