Information theoretic methods for clustering with applications to microarray data

Download files
Access & Terms of Use
open access
Copyright: Nguyen, Xuan Vinh
Altmetric
Abstract
This thesis addresses selected aspects of cluster analysis, mainly for microarray data, that include: distance measures for clustering, measures for clustering comparison, estimation of the number of clusters and generation of multiple clustering solutions for a given data set. The primary contribution of this thesis is a comprehensive investigation of the class of information theoretic measures for clustering comparison. These measures are widely employed in the clustering literature, but their applications have been somewhat scattered in our observation. As clustering comparison plays a very important role in contemporary clustering research, our work provides insight on how to choose a suitable measure to suit particular needs. We propose the Normalized Information Distance, a normalized, true metric on the space of clusterings, as a general clustering comparison measure, and the Adjusted Mutual Information, a corrected-for-chance version of the popular Normalized Mutual Information, as a measure for data with few data items relative to the number of clusters, such as in the problem of microarray sample clustering. We then demonstrate the usefulness of the proposed measures in estimating the number of clusters in microarray data via a novel index that we have developed, namely the Consensus Index, which assesses the stability of the clustering structure obtained with regard to each candidate number of clusters. Additionally, this thesis provides the theoretical and empirical justification for using the information theoretic based Kullback-Leibler divergence for microarray data clustering, complementing previous research that showed interesting connections between the KL divergence and biological phenomena. This was accomplished by comparing the KL divergence to the more popular normalized squared Euclidean distance, within the frameworks of Bregman clustering and GlobalRSC, a novel shared-neighbor similarity clustering formulation that we develop. We also present minCEntropy, a novel information theoretic clustering formulation for discovering alternative clusterings of a given data set, which performs competitively with existing methods for alternative clustering for a variety of different data sets.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Nguyen, Xuan Vinh
Supervisor(s)
Epps, Julien
Ambikairajah, Eliathamby
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2010
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download whole.pdf 1.92 MB Adobe Portable Document Format
Related dataset(s)