Innovative methods for the analysis of complex and non-standard data

Download files
Access & Terms of Use
open access
Copyright: Whitaker, Thomas
Symbolic Data Analysis (SDA) is an emerging branch of statistics that addresses some of the issues associated with the analysis of non-standard (symbolic) datasets, such as intervals, histograms and lists. Datasets of this nature are useful in preserving the privacy of individual observations, and also for reducing the size and dimension of big datasets. This leads to significant computational benefits if an appropriate symbolic analysis can be derived. The rapidly increasing computational power that is becoming more and more readily available has also led to increasingly common non-standard datasets. Data arriving in a non-standard form often possesses internal variation not seen in pointwise classical observations. This means that existing classical methods of analysis are unsuitable if results are desired that possess an underlying classical interpretation. Currently, most developed SDA methods focus on an exploratory analysis of the data, with the subsequent results only useful at the symbolic level, and not directly comparable to the complete analysis of the true latent underlying dataset unless some specific assumptions concerning the uniformity of the data within each symbol are met. A common existing symbolic methodology is to perform a classical analysis of features of the non-standard data, such as interval end-points. In this thesis methods of analysis for non-standard data are developed that are interpretable at the underlying classical level. Further, if enough information is retained during the aggregation process, the methods derived for the analyses of non-standard datasets obtain comparable results to the complete classical analysis of the underlying latent dataset. As a result, big datasets that pose computational problems can be analysed using the proposed symbolic methodologies instead of the classical analyses, at a cheaper computational cost. These methods are highly flexible, meaning they don't rely on a uniformity assumption within each symbol, and can be applied to a range of symbolic data. The utility of each symbolic method is demonstrated via simulation studies illustrating the convergence of the results towards the complete analysis with increasing information retention during the aggregation process. Further, each derived method has then been applied to a real dataset in order to demonstrate their real-life application.
Persistent link to this record
Link to Publisher Version
Additional Link
Whitaker, Thomas
Sisson, Scott
Beranger, Boris
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
Resource Type
Degree Type
PhD Doctorate
UNSW Faculty
download public version.pdf 2.7 MB Adobe Portable Document Format
Related dataset(s)