RI: Small: A Unified Compositional Model for Explainable Video-based Human Activity Parsing

Project: Research project

Project Details


The key issue in visual recognition is to cope with the uncertainty in visual patterns that generally
induce enormous variations in their visual appearances. Deep network-based methods have recently reported
impressive results on many visual recognition tasks, forcefully learning a highly non-linear mapping from
the visual space to the semantic targets in an end-to-end fashion. They do not explicitly explore and exploit
the structure of the visual space, but rather depending on the coverage of the huge amount of training data and
on computationally intensive training. However, in practice, it is not always possible to collect “sufficient”
training data that is able to well cover the complicated visual variations.
The diversity of the visual appearances is caused by many reasons. An important one is the variation
of structural composition. A larger visual pattern can be decomposed by a set of its component smaller
patterns. It is the structural composition of the smaller patterns that produce the complex variations in the
larger patterns. This is not well investigated and understood, and thus deserves more studies.
Intellectual Merit: The major goal of this research is to develop a novel end-to-end visual compositional
model that is able to effectively learn complex semantic concepts through “small data”, while achieving
good generalizability and providing explainable learning results. It is focused on:
� A principled model and its theoratical foundation. The two critical issues for visual structure include
structure modeling and visual appearance modeling. The proposed model addresses this two issues
through one unified approach in an end-to-end fashion. To handle the structural diversity and uncertainty,
it designs a stochastic grammar based on the probabilistic And-Or graph to model the structural
composition. In addition, this structural decomposition is grounded to images via deep networks to
handle the visual appearance uncertainty.
� Pattern mining for structual learning. To enable effective structural learning, this new model exploits
data-driven pattern mining to discover significant structural components, so as to generate structural
proposals in learning. This is different from existing approaches that either rely on the prior domain
knowledge to design the compositional model, or pursue the huge solution space greedily.
� Solid case studies on video human activity understanding. The research is substantialized via a challenging
task of video human activity recognition and understanding. Human activities are complex
compositions of human actions, body movements, and interactions with the environment. This incurs
almost limitless variations in their visual evidence. This research aims to more generalizable and
explainable analysis of video human activities.
� Tools and prototype systems. Effective and efficient tools are developed for human detection, articulated
body pose estimation and tracking, and contextual object discovery that many computer vision
tasks. Prototype systems are developed for video-based human perception and activity analysis.
Exploiting the structural composition of the visual patterns, this new model is able to cope with the
enormous visual variability by recursively dividing-and-conquering the uncertainties of its components,
so that the learning can be achieved via much smaller data. In addition, the inference based on such a
compositional model produces semantic parsing of visual data, and thus providing explainable interpretation
and leading to understanding. More
Effective start/end date9/1/188/31/21


  • National Science Foundation (IIS-1815561)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.