CAREER: An Integrated Inferential Framework for Big Data Research and Education

Project: Research project

Description

1 Overview
This proposal aims to develop novel inferential methods for assessing uncertainty (e.g., constructing confidence
intervals or testing hypotheses) of modern statistical procedures unique to Big Data analysis. It
will develop innovative inferential tools for a variety of machine learning methods which have not yet been
equipped with inferential power. It will also train the next-generation of academic leaders with the inferential
skills needed to be competitive in the modern sciences.
This proposal is constructed in the context of a new generation of statistical methods developed to handle
the increasing complexities of modern data. However, most of these methods give only point estimates
of parameters, while typically practitioners require more sophisticated inferential statements to assess uncertainty.
For instance, in genomics, the p-value of a significance test of a biomarker is scientifically more
informative than simply reporting whether this marker is selected or not. Therefore, a substantial gap exists
between the newly developed methods and their scientific applications.
Classical inferential theory has lagged behind the rapid development of these new methods due to several
unique challenges of Big Data. Firstly, the challenge of high dimensional data, which motivates the development
of estimators for simultaneous model selection and parameter estimation. Most classical inferential
methods do not take model selection uncertainty into consideration. Secondly, the challenge of massive
data, which motivates the development of heterogeneous modeling and divide-and-conquer estimators. In
contrast, classical inferential theory generally assumes the data are homogeneous and stored in a central
database. Thirdly, the challenge of complex data (e.g., heavy-tailed and missing data), which motivates the
development of highly robust estimators. The inferential theory for these estimators is much less developed.
The proposed research puts forward new methods for inferential analysis that handle the above challenges
in a general abstract fashion. The theory and methods developed in this career development plan will
serve as a foundation for modern Big Data research and education.
2 Intellectual Merit
This proposal addresses several fundamental challenges in modern inferential analysis and will lead to the
creation of a new research area named Big Data Inference. Current literature of Big Data research mainly
focuses on developing new estimators for complex data. However, most of these estimators are still in lack
of systematic inferential methods for uncertainty assessment. The proposed research bridges this gap by
developing a new generation of inferential methods for modern estimators unique to Big Data analysis. In
addition, this proposal will push the frontiers of modern statistical science by developing new technical tools
ranging from nonasymptotic concentration inequalities to asymptotic limiting theorems for many complex
estimators. Current literature of Big Data education focuses more on teaching ‘formal’ statistical inference
which consists of estimating population parameters with confidence intervals and testing conjectures about
parameters with hypothesis tests. The novelty of the proposed education plan lies in its introduction of the
practice of ‘informal’ inferential reasoning to complement the formal one. Such a hybrid approach allows
an easier integration of research and education under a unified framework.
3 Broader Impact
This career proposal will push the integration of Stat
StatusActive
Effective start/end date9/1/176/30/20

Funding

  • National Science Foundation (DMS‐1841569-001)

Fingerprint

Education
Testing
Biomarkers
Big data
Parameter estimation
Learning systems
Statistical methods
Teaching
Uncertainty