Analysis and optimisation of selected genomic algorithms

Download files
Access & Terms of Use
open access
Copyright: Bayat, Arash
Altmetric
Abstract
The importance of genomic applications in the fields of medicine, agriculture, environment etc., has focused attention in the area of genomic computation in the last two decades. New technologies make it affordable to extract genomic information (sequencing) on a scale hitherto unknown. This has resulted in decreasing the price of sequencing and has increased the number of areas in which the sequencing data is utilised. Thus there is a need to assemble more and more genomes. A significant computational effort is needed to process this sequenced data (assembly) to assemble data and search for variations. It has been predicted that genomic data will exceed the amount of astronomical data in a near future. The growth in computational capacity, based on Moore’s law, cannot continue to respond to this increased computational demand. This thesis is motivated in response to the extensive demand for processing of sequenced data. The author identifies several important related processes and aims to improve each of those methods. First, a comprehensive review has been done on recent assembly pipelines to evaluate them. The result of the study reveals important facts which are used to design an efficient assembly practice. Second, a novel assembly pipeline is introduced that successfully balances the trade-off between speed and accuracy. Third, a fast and accurate sequence alignment algorithm is proposed that is the core of several steps in the assembly workflow, as well as a wide range of other related analysis. Finally, a new data normalisation method is designed. Due to the probabilistic nature of genome assembly, evaluating accuracy is critical. The normalisation is a vital part of the evaluation process. Along with normalisation method, the author has proposed a metric to measure how well the data is normalised. Such a metric has been proposed for the first time. The proposed assembly pipeline is 6 times faster than Spades and results in 100 times larger contiguity than SOAPdenovo2. The proposed alignment algorithm is 14 times faster than the Smith-Waterman algorithm. Yet, for the 99.99% of input sequence pairs, the proposed alignment algorithm results in the same alignment as the one that Smith-Waterman algorithm produces. Finally, the proposed normalisation method is 949 times more accurate than vt-Normalize.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Bayat, Arash
Supervisor(s)
Parameswaran, Sri
Ignjatovic, Aleksandar
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2018
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 5.19 MB Adobe Portable Document Format
Related dataset(s)