Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling

Download files
Access & Terms of Use
open access
Copyright: Sriskandaraja, Kaavya
Altmetric
Abstract
The ability to employ automatic speaker verification systems without face-to-face contact makes them more prone to spoofing attacks compared to other biometric systems. The study of spoofing countermeasures has become increasingly important and is currently a critical area of research, which is the principal objective of this thesis. Additionally, as a preliminary work, this thesis aimed to make the automatic speaker verification system robust to adverse noise conditions, by proposing a self-adaptive voice activity detector, which combines cepstral modelling and smoothed energy with the effective post processing stages. Thus, the overarching goal of this thesis is to significantly advance the state-of-the-art in automatic speaker verification systems by making them more secure and robust. Spoofing attacks can be categorised into one of four types: impersonation, replay, voice conversion or speech synthesis. Among these, speech synthesis (SS), voice conversion (VC) and replay attacks have been identified as the most effective and accessible. Accordingly, this thesis investigates and develops a framework to extract the discriminative features to deflect these three attacks. Investigations are undertaken to analyse the discrimination between spoofed and genuine speech as a function of frequency bands across the speech bandwidth, which in turn informed some novel filter bank designs for spoofing detection. In order to capture a richer representation of the spectral content of speech, novel hierarchical scattering decomposition technique based features are proposed to implement effective front-ends for stand-alone spoofing detection. The results showed that the proposed scattering features were superior to all other front-ends that had previously been benchmarked on the VC, SS and replay corpora. Consequently, a hybrid network consisting of a scattering followed by a convolutional network is also investigated. Finally, a novel approach to evaluate the similarities between pairs of speech samples is proposed to detect replayed speech based on a suitable embedding learned by deep Siamese architectures. Siamese networks are particularly suited to this task and have been shown to be effective in problems where intra-class variability is large and the number of training samples per class is relatively small. The proposed Siamese architecture produces state-of-the-art performance when evaluated on the ASVspoof2017 challenge corpus.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Sriskandaraja, Kaavya
Supervisor(s)
Eliathamby, Ambikairajah
Sethu, Vidhyasaharan
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2018
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 4.48 MB Adobe Portable Document Format
Related dataset(s)