Analysis and optimisation of selected genomic algorithms

Bayat, Arash

doi:10.26190/unsworks/21174

Analysis and optimisation of selected genomic algorithms

Download files

Access & Terms of Use

open access
Copyright: Bayat, Arash

CC BY-NC-ND 3.0

Abstract

The importance of genomic applications in the fields of medicine, agriculture, environment etc., has focused attention in the area of genomic computation in the last two decades. New technologies make it affordable to extract genomic information (sequencing) on a scale hitherto unknown. This has resulted in decreasing the price of sequencing and has increased the number of areas in which the sequencing data is utilised. Thus there is a need to assemble more and more genomes. A significant computational effort is needed to process this sequenced data (assembly) to assemble data and search for variations. It has been predicted that genomic data will exceed the amount of astronomical data in a near future. The growth in computational capacity, based on Moore’s law, cannot continue to respond to this increased computational demand. This thesis is motivated in response to the extensive demand for processing of sequenced data. The author identifies several important related processes and aims to improve each of those methods. First, a comprehensive review has been done on recent assembly pipelines to evaluate them. The result of the study reveals important facts which are used to design an efficient assembly practice. Second, a novel assembly pipeline is introduced that successfully balances the trade-off between speed and accuracy. Third, a fast and accurate sequence alignment algorithm is proposed that is the core of several steps in the assembly workflow, as well as a wide range of other related analysis. Finally, a new data normalisation method is designed. Due to the probabilistic nature of genome assembly, evaluating accuracy is critical. The normalisation is a vital part of the evaluation process. Along with normalisation method, the author has proposed a metric to measure how well the data is normalised. Such a metric has been proposed for the first time. The proposed assembly pipeline is 6 times faster than Spades and results in 100 times larger contiguity than SOAPdenovo2. The proposed alignment algorithm is 14 times faster than the Smith-Waterman algorithm. Yet, for the 99.99% of input sequence pairs, the proposed alignment algorithm results in the same alignment as the one that Smith-Waterman algorithm produces. Finally, the proposed normalisation method is 949 times more accurate than vt-Normalize.

Persistent link to this record

http://hdl.handle.net/1959.4/61762

DOI

https://doi.org/10.26190/unsworks/21174

Author(s)

Bayat, Arash

Supervisor(s)

Parameswaran, Sri

Ignjatovic, Aleksandar

Publication Year

2018

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

public version.pdf

5.19 MB

Adobe Portable Document Format

View full record Show statistics

Library

Analysis and optimisation of selected genomic algorithms

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)