Abstract
This thesis presents solutions for human action analysis by learning local visual features as structured data. Human action in video, represented in set of local features, is a structured data type. Hence, learning and inferring on human action can be considered a task of structured analysis, in which either the input or output domain is non-vectorial. Existing approaches for human action analysis only rely on either orderless or holistic representation of human action to carry out learning. These two types have their own drawbacks, while orderless representation gives limited information about action characteristics, holistic representation requires unpractical preassumption of the action context. The work in this thesis targets human action analysis from a local feature perspective to create a framework invariant of local contexts, yet it also integrates the overall action structure to produce a generic model for each action type. These two goals are accomplished using different instances of structured learning on local invariant space-time features.