Information theoretic methods for clustering with applications to microarray data

Nguyen, Xuan Vinh

doi:10.26190/unsworks/23455

Information theoretic methods for clustering with applications to microarray data

Download files

Access & Terms of Use

open access
Copyright: Nguyen, Xuan Vinh

CC BY-NC-ND 3.0

Abstract

This thesis addresses selected aspects of cluster analysis, mainly for microarray data, that include: distance measures for clustering, measures for clustering comparison, estimation of the number of clusters and generation of multiple clustering solutions for a given data set. The primary contribution of this thesis is a comprehensive investigation of the class of information theoretic measures for clustering comparison. These measures are widely employed in the clustering literature, but their applications have been somewhat scattered in our observation. As clustering comparison plays a very important role in contemporary clustering research, our work provides insight on how to choose a suitable measure to suit particular needs. We propose the Normalized Information Distance, a normalized, true metric on the space of clusterings, as a general clustering comparison measure, and the Adjusted Mutual Information, a corrected-for-chance version of the popular Normalized Mutual Information, as a measure for data with few data items relative to the number of clusters, such as in the problem of microarray sample clustering. We then demonstrate the usefulness of the proposed measures in estimating the number of clusters in microarray data via a novel index that we have developed, namely the Consensus Index, which assesses the stability of the clustering structure obtained with regard to each candidate number of clusters. Additionally, this thesis provides the theoretical and empirical justification for using the information theoretic based Kullback-Leibler divergence for microarray data clustering, complementing previous research that showed interesting connections between the KL divergence and biological phenomena. This was accomplished by comparing the KL divergence to the more popular normalized squared Euclidean distance, within the frameworks of Bregman clustering and GlobalRSC, a novel shared-neighbor similarity clustering formulation that we develop. We also present minCEntropy, a novel information theoretic clustering formulation for discovering alternative clusterings of a given data set, which performs competitively with existing methods for alternative clustering for a variety of different data sets.

Persistent link to this record

http://hdl.handle.net/1959.4/50275

DOI

https://doi.org/10.26190/unsworks/23455

Author(s)

Nguyen, Xuan Vinh

Supervisor(s)

Epps, Julien

Ambikairajah, Eliathamby

Publication Year

2010

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

whole.pdf

1.92 MB

Adobe Portable Document Format

View full record Show statistics

Library

Information theoretic methods for clustering with applications to microarray data

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)