Human gesture recognition, as one of the most active and promising research areas in computer vision, has attracted a significant amount of attention over the last few years. It has a wide range of applications including human-computer interaction, motion-sensing gaming, video surveillance, rehabilitation, and smart homes. Recently, the introduction of cost-effective depth cameras provides a new way to research motion analysis and gesture recognition. However, it also brings new challenges: how to generate effective gesture representations to characterize the spatio-temporal structures properly using depth cameras, and how to extract and represent meaningful features from the discriminative gesture representations. This thesis aims to solve these challenges with effective and efficient solutions, and evaluate the proposed approaches on four challenging gesture recognition benchmarks. Extensive experiments demonstrate the superior performances of the proposed approaches.The thesis consists of three parts, in which the proposed approaches are self-contained but highly correlative. In the first part of the thesis, three novel frameworks are introduced to aid the recognition of human gestures using depth maps. Firstly, aneffective approach based on DMHT-PHOG is presented to recognize human gestures in depth videos. For gesture representation, a depth motion history template (DMHT) is proposed to encode the temporal motion along with structural information in a compact and discriminative way. Pyramid histogram of oriented gradients (PHOG) is calculated with different levels of details according to the selected pyramid levels. Secondly, a framework based on spatio-temporal pyramid matching (STPCM) is put forward for gesture recognition using the discriminative motion information from both spatial and temporal aspects. In order to retain the inherent 3D spatial information, a novel cuboid fusion scheme is developed by grouping spatially dependent grids from projected planes of pyramid DMHT to construct spatio-temporal pyramid cuboids. Thirdly, to overcome the difficulties in correlation discovery between multiple views from depth maps, a novel method, specificity and latent correlation learning (SLCL), is proposed to learn the view-specific dictionaries (specificity) and the latent information between multiple views (latent correlation) for multi-view gesture recognition. The combination of the specificity and the latent correlation can consistently represent the gesture from multiple views for classification.In the second part, in addition to depth maps, skeletal joints are adopted to learn the part-based skeleton representation for gesture recognition. A human body is represented as a set of body parts, each of which consists of multiple skeletaljoints. Part-based skeleton features of each body part are proposed to deal with four types of variations, i.e., viewpoint, anthropometry, execution rate, and personal style. Given the part-based features, a dictionary learning approach is proposed to learn sub-dictionaries for each body part and correlation between them.In the last part of this thesis, multi-modal fusion schemes are illustrated for gesture recognition. Given that different modalities have their own relative strength, their fusion yields a multi-modal semantic representation and improves the performance of gesture recognition. Various multi-modal fusion schemes are investigated at representation level and classifier level. In order to reach a stable and robust performance, a weight-learning classifier-level fusion method is proposed.
|Qualification||Doctor of Philosophy|
|Award date||01 Sep 2016|
|Place of Publication||Australia|
|Publication status||Published - 2016|