The key issue in visual recognition is to cope with the uncertainty in visual patterns that generally induce enormous variations in their visual appearances. Deep network-based methods have recently reported impressive results on many visual recognition tasks, forcefully learning a highly non-linear mapping from the visual space to the semantic targets in an end-to-end fashion. They do not explicitly explore and exploit the structure of the visual space, but rather depending on the coverage of the huge amount of training data and on computationally intensive training. However, in practice, it is not always possible to collect “sufficient” training data that is able to well cover the complicated visual variations. The diversity of the visual appearances is caused by many reasons. An important one is the variation of structural composition. A larger visual pattern can be decomposed by a set of its component smaller patterns. It is the structural composition of the smaller patterns that produce the complex variations in the larger patterns. This is not well investigated and understood, and thus deserves more studies. Intellectual Merit: The major goal of this research is to develop a novel end-to-end visual compositional model that is able to effectively learn complex semantic concepts through “small data”, while achieving good generalizability and providing explainable learning results. It is focused on: � A principled model and its theoratical foundation. The two critical issues for visual structure include structure modeling and visual appearance modeling. The proposed model addresses this two issues through one unified approach in an end-to-end fashion. To handle the structural diversity and uncertainty, it designs a stochastic grammar based on the probabilistic And-Or graph to model the structural composition. In addition, this structural decomposition is grounded to images via deep networks to handle the visual appearance uncertainty. � Pattern mining for structual learning. To enable effective structural learning, this new model exploits data-driven pattern mining to discover significant structural components, so as to generate structural proposals in learning. This is different from existing approaches that either rely on the prior domain knowledge to design the compositional model, or pursue the huge solution space greedily. � Solid case studies on video human activity understanding. The research is substantialized via a challenging task of video human activity recognition and understanding. Human activities are complex compositions of human actions, body movements, and interactions with the environment. This incurs almost limitless variations in their visual evidence. This research aims to more generalizable and explainable analysis of video human activities. � Tools and prototype systems. Effective and efficient tools are developed for human detection, articulated body pose estimation and tracking, and contextual object discovery that many computer vision tasks. Prototype systems are developed for video-based human perception and activity analysis. Exploiting the structural composition of the visual patterns, this new model is able to cope with the enormous visual variability by recursively dividing-and-conquering the uncertainties of its components, so that the learning can be achieved via much smaller data. In addition, the inference based on such a compositional model produces semantic parsing of visual data, and thus providing explainable interpretation and leading to understanding. More
|Effective start/end date||9/1/18 → 8/31/22|
- National Science Foundation (IIS-1815561)
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.