A Probabilistic Graphical Model for Structured Prediction over Heterogeneous Data

Download files
Access & Terms of Use
open access
Copyright: Ye, Pengjie
Altmetric
Abstract
Advances in sensor and instrumentation technology, together with cost reductions and capacity increases in computing and communication technologies, have led to the rapid accumulation of large amounts of data, additional to that collected by traditional methods. These sources form data called heterogeneous since it does not conform to a single type of data structure. A notable example is Electronic Health Record (EHR) data. Given the size and complexity of heterogeneous data there is a growing need to apply machine learning to predict, for example, patient outcomes from EHR data. Such data is inherently uncertain, so learning algorithms based on the framework of probabilistic graphical models for classification are appropriate. Despite the popularity of structured prediction, its capability in utilising domain knowledge and modelling on the source of structure is limited. This thesis identifies the connection between the mechanism of abstract domain knowledge and the structural setting of a graphical model. A clique-based mapping method is proposed to develop a structural-binding and knowledge embedding set of feature functions. A general discriminatively-trained probabilistic graphical model, the transitional random field (TRF), is proposed for modelling heterogeneous input data without the locality preserving property, which is widely seen in conditional random field(CRF) problem settings. We also introduce a novel ontology-based probabilistic similarity measurement for heterogeneous data which simplifies probabilistic computation in TRFs and enables efficient inference. The TRF framework identifies and maps information from the input structure to the non-isomorphic format determined by the output structure, while at the same time utilising structurally embedded existing knowledge implicit in the structure of the input and output. This ability to represent dependencies as features denoting transitional relations between input and output gives TRF the potential to learn models from a wide range of heterogeneous data and make predictions about structured domain knowledge. Our experiments on a large real-world data set demonstrate that TRF can be successfully applied to a demanding structured prediction problem over heterogeneous EHR data, with the proposed TRF training and inference algorithms obtaining good accuracy and efficiency.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Ye, Pengjie
Supervisor(s)
Bain, Michael
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2017
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 8.87 MB Adobe Portable Document Format
Related dataset(s)