# Project Details

### Description

My research focuses on developing a new field named Combinatorial Statistics, along with its application in Brain

Science. Combinatorial Statistics studies both sampling complexity (amount of data) and computational complexity

(running time) for inferring from high dimensional distributions parameterized by discrete structures such as graphs,

partitions, permutations or multicomplexes. Such distributions play an important role in a number of important scientific

domains including computational neuroscience, genomics, social networks, coding theory and molecular and evolutionary

biology. Success on this research will lead to a new mathematical theory of finding complex structures from data, which

integrates ideas from computation, combinatorics, statistics and high dimensional probability in a unified paradigm.

To achieve this goal, I devoted my early career in establishing a new combinatorial inference theory for graphical

models. Graphical models parameterize high dimensional distributions by graphs, which encode complex conditional

independence relationships among many random variables. Combinatorial inference aims to test or assess uncertainty of

some global structural properties of the underlying graph based on data (e.g., Whether the graph is disconnected? What

is the maximum degree of the graph? Whether the graph is triangle-free?). It holds a lot of promise for modern scientific

data analysis since such global structural properties carry important scientific meanings (e.g., in studying brain networks,

the maximum degree of a node reflects the activity level of the corresponding brain region). However, until recently, no

systematic combinatorial inference theory exists in the literature. To bridge this gap, my research addresses two basic

questions: (i) What are the fundamental limits of a combinatorial inference algorithm under a computational budge? (ii).

Does an understanding of these limits direct us towards the construction of better combinatorial inference methods?

My first contribution is to establish a systematic framework which characterizes the information-theoretic lower bound

of a combinatorial inference problem under a computational budget. Unlike the classical complexity theory which mainly

exploits the Turing machine to quantify computation, my approach quantifies computation by the number of interactions

an algorithm accesses the data through a computational oracle (e.g., a gradient calculation or stochastic sampling step),

creating an oracle complexity. Compared to the Turing machine approach, this new framework is more systematic (not

case-by-case), general (analyzes a wide class of problems) and rigorous (does not rely on unproven hardness conjectures).

It also revealed several fundamental computational barriers for many combinatorial inference problems (e.g., detecting a

clique in the graph): that the computationally tractable algorithms cannot achieve the minimal statistical error rates. My

effort on this direction has led to a new research area named ‘oracle computational lower bounds’.

To match the obtained lower bounds, my second contribution is to prove that all the combinatorial inference problems

can be reduced into a multiple testing problem where the number of hypotheses is usually larger than the sample size. In

this setting, the most widely used statistical tool, the Central Limit Theorem, is no longer suitable. To make progress, we

need a fundamentally new approach that leverages the combinatorial structure of the problem. For this, I have developed

a unified methodol

Science. Combinatorial Statistics studies both sampling complexity (amount of data) and computational complexity

(running time) for inferring from high dimensional distributions parameterized by discrete structures such as graphs,

partitions, permutations or multicomplexes. Such distributions play an important role in a number of important scientific

domains including computational neuroscience, genomics, social networks, coding theory and molecular and evolutionary

biology. Success on this research will lead to a new mathematical theory of finding complex structures from data, which

integrates ideas from computation, combinatorics, statistics and high dimensional probability in a unified paradigm.

To achieve this goal, I devoted my early career in establishing a new combinatorial inference theory for graphical

models. Graphical models parameterize high dimensional distributions by graphs, which encode complex conditional

independence relationships among many random variables. Combinatorial inference aims to test or assess uncertainty of

some global structural properties of the underlying graph based on data (e.g., Whether the graph is disconnected? What

is the maximum degree of the graph? Whether the graph is triangle-free?). It holds a lot of promise for modern scientific

data analysis since such global structural properties carry important scientific meanings (e.g., in studying brain networks,

the maximum degree of a node reflects the activity level of the corresponding brain region). However, until recently, no

systematic combinatorial inference theory exists in the literature. To bridge this gap, my research addresses two basic

questions: (i) What are the fundamental limits of a combinatorial inference algorithm under a computational budge? (ii).

Does an understanding of these limits direct us towards the construction of better combinatorial inference methods?

My first contribution is to establish a systematic framework which characterizes the information-theoretic lower bound

of a combinatorial inference problem under a computational budget. Unlike the classical complexity theory which mainly

exploits the Turing machine to quantify computation, my approach quantifies computation by the number of interactions

an algorithm accesses the data through a computational oracle (e.g., a gradient calculation or stochastic sampling step),

creating an oracle complexity. Compared to the Turing machine approach, this new framework is more systematic (not

case-by-case), general (analyzes a wide class of problems) and rigorous (does not rely on unproven hardness conjectures).

It also revealed several fundamental computational barriers for many combinatorial inference problems (e.g., detecting a

clique in the graph): that the computationally tractable algorithms cannot achieve the minimal statistical error rates. My

effort on this direction has led to a new research area named ‘oracle computational lower bounds’.

To match the obtained lower bounds, my second contribution is to prove that all the combinatorial inference problems

can be reduced into a multiple testing problem where the number of hypotheses is usually larger than the sample size. In

this setting, the most widely used statistical tool, the Central Limit Theorem, is no longer suitable. To make progress, we

need a fundamentally new approach that leverages the combinatorial structure of the problem. For this, I have developed

a unified methodol

Status | Active |
---|---|

Effective start/end date | 9/15/17 → 9/14/21 |

### Funding

- Alfred P. Sloan Foundation (FG-2017-9862)