Abstract
This thesis addresses selected aspects of cluster analysis, mainly for microarray data, that include: distance measures for clustering, measures for clustering comparison, estimation of the number of clusters and generation of multiple clustering solutions for a given data set.
The primary contribution of this thesis is a comprehensive investigation of the class of information theoretic measures for clustering comparison. These measures are widely employed in the clustering literature, but their applications have been somewhat scattered in our observation. As clustering comparison plays a very important role in contemporary clustering research, our work provides insight on how to choose a suitable measure to suit particular needs. We propose the Normalized Information Distance, a normalized, true metric on the space of clusterings, as a general clustering comparison measure, and the Adjusted Mutual Information, a corrected-for-chance version of the popular Normalized Mutual Information, as a measure for data with few data items relative to the number of clusters, such as in the problem of microarray sample clustering. We then demonstrate the usefulness of the proposed measures in estimating the number of clusters in microarray data via a novel index that we have developed, namely the Consensus Index, which assesses the stability of the clustering structure obtained with regard to each candidate number of clusters.
Additionally, this thesis provides the theoretical and empirical justification for using the information theoretic based Kullback-Leibler divergence for microarray data clustering, complementing previous research that showed interesting connections between the KL divergence and biological phenomena. This was accomplished by comparing the KL divergence to the more popular normalized squared Euclidean distance, within the frameworks of Bregman clustering and GlobalRSC, a novel shared-neighbor similarity clustering formulation that we develop. We also present minCEntropy, a novel information theoretic clustering formulation for discovering alternative clusterings of a given data set, which performs competitively with existing methods for alternative clustering for a variety of different data sets.