TY - JOUR
T1 - On Asymptotic Distributions and Confidence Intervals for LIFT Measures in Data Mining
AU - Jiang, Wenxin
AU - Zhao, Yu
N1 - Funding Information:
Wenxin Jiang is Taishan Scholar Overseas Distinguished Specialist Adjunct Professor, Shandong University in China, and Professor of Statistics, Northwestern University, 2006 Sheridan Rd, Evanston, IL 60208 (Email: wjiang@northwestern.edu). Yu Zhao is Statistician at Amazon (E-mail: yuzhaonwu@gmail.com). This article is partially based on the PhD thesis of the second author. The first author was partially supported by the ?111? project, grant No. B12023, at Qilu Securities Institute for Financial Studies, Shandong University in China. The authors thank Professors Tom Severini and Hongmei Jiang and the Associate Editor and the referees for helpful comments.
Publisher Copyright:
© 2015, © American Statistical Association.
PY - 2015/10/2
Y1 - 2015/10/2
N2 - A LIFT measure, such as the response rate, lift, or the percentage of captured response, is a fundamental measure of effectiveness for a scoring rule obtained from data mining, which is estimated from a set of validation data. In this article, we study how to construct confidence intervals of the LIFT measures. We point out the subtlety of this task and explain how simple binomial confidence intervals can have incorrect coverage probabilities, due to omitting variation from the sample percentile of the scoring rule. We derive the asymptotic distribution using some advanced empirical process theory and the functional delta method in the Appendix. The additional variation is shown to be related to a conditional mean response, which can be estimated by a local averaging of the responses over the scores from the validation data. Alternatively, a subsampling method is shown to provide a valid confidence interval, without needing to estimate the conditional mean response. Numerical experiments are conducted to compare these different methods regarding the coverage probabilities and the lengths of the resulting confidence intervals.
AB - A LIFT measure, such as the response rate, lift, or the percentage of captured response, is a fundamental measure of effectiveness for a scoring rule obtained from data mining, which is estimated from a set of validation data. In this article, we study how to construct confidence intervals of the LIFT measures. We point out the subtlety of this task and explain how simple binomial confidence intervals can have incorrect coverage probabilities, due to omitting variation from the sample percentile of the scoring rule. We derive the asymptotic distribution using some advanced empirical process theory and the functional delta method in the Appendix. The additional variation is shown to be related to a conditional mean response, which can be estimated by a local averaging of the responses over the scores from the validation data. Alternatively, a subsampling method is shown to provide a valid confidence interval, without needing to estimate the conditional mean response. Numerical experiments are conducted to compare these different methods regarding the coverage probabilities and the lengths of the resulting confidence intervals.
KW - %response
KW - Empirical process
KW - Functional delta method
KW - Subsampling
KW - Validation data
UR - http://www.scopus.com/inward/record.url?scp=84954446428&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84954446428&partnerID=8YFLogxK
U2 - 10.1080/01621459.2014.993080
DO - 10.1080/01621459.2014.993080
M3 - Article
AN - SCOPUS:84954446428
VL - 110
SP - 1717
EP - 1725
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
SN - 0162-1459
IS - 512
ER -