Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems

Download files
Access & Terms of Use
open access
Copyright: Wataraka Gamage, Kalani
Altmetric
Abstract
Speech-based emotion recognition is a research field of growing interest, which aims to identify human emotions based on speech. The main contributions of this thesis revolve around the use of verbal and non-verbal vocalisation cues for speech-based emotion recognition, which is complementary to popularly used acoustic features for both emotion classification and continuous emotion prediction tasks. This thesis initially explores the supra-segmental feature representations generated by the vectorisation of the Mel-frequency cepstral coefficient frame level feature distribution models for emotion classification, which is an alternative to the default acoustic supra-segmental features. Next, the thesis focuses on the development of approaches for incorporating the emotional saliency and pronunciation of verbal cues (lexical features) for emotion classification. Apart from lexical features, non-verbal vocal events such as laughter, sighs, and expressions such as “grrr!”, “oh!”, and disfluency patterns including filled pauses such as “hmm” are identified within the linguistic feature domain. These elements of speech are instrumental in portraying both voluntary and involuntary emotions in human communication. Despite this, they have not been used for automatic emotion recognition in a completely automatic manner, and their effect on emotion recognition has not yet been adequately analysed. This thesis proposes and develops several models to utilise emotionally salient linguistic cues, including non-verbal gestures and disfluencies, implicitly for emotion classification and continuous emotion prediction tasks. This is achieved without the need for tagged and time aligned non-verbal vocalisation labels. The proposed novel approaches allow emotion recognition systems to utilise linguistic information independent of manual transcripts or automatic speech recognition. Inspired by the analysis of the influence of non-verbal vocalisations for continuous emotion prediction, as well as emotion psychology concepts related to the symbolic reference function of such expressions, this thesis proposes a novel view of continuous emotion prediction leading to the development of a transparent framework for continuous emotion prediction. This framework is modelled as a time-invariant filter array for continuous emotion prediction, and distinct from the pointwise regression mapping taken by traditional approaches. All proposed approaches are extensively evaluated on state-of-the-art emotion databases.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Wataraka Gamage, Kalani
Supervisor(s)
Ambikairajah, Eliathamby
sethu, vidhyasaharan
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2018
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 5.21 MB Adobe Portable Document Format
Related dataset(s)