Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

Sunarso, Freddie

doi:10.26190/unsworks/16936

Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

Download files

Access & Terms of Use

open access
Copyright: Sunarso, Freddie

CC BY-NC-ND 3.0

Abstract

Metagenomic studies produce large datasets that are estimated to grow at a faster rate than the available computational capacity. A key step in the metagenomic workflows is protein sequence similarity search, which is computationally intensive over large datasets. Several techniques and tools have been developed for this purpose. However, these were designed and implemented for single-node computing infrastructure and are not able to handle the exponential increase in data. Several efforts, directed to improve the performance of protein sequence search through parallel and distributed computing, have enormously improved the performance of searching over large bioinformatic datasets. However, these are best suited for static deployment over large-scale computing infrastructure such as clusters or supercomputers. Access to such infrastructure is not available to most of the active scientists in the world. The recent arrival of cloud computing that enables availing computing as a utility, has resulted in broad access to distributed system infrastructure. Recently, popular protein sequence similarity search tools, such as BLAST, have been re-engineered to capitalise on cloud computing. Most of these efforts, however, follow a naive, static, ‘embarrassingly parallel’ approach and are not able to take advantage of scaling up through resource elasticity in cloud computing. This thesis presents a novel approach for protein sequence similarity search that builds on Locality-Sensitive Hashing (LSH), a well-known methodology for high-performance similarity search over large-scale, multi-dimensional sets of documents. LSH has been demonstrated to be easily adaptable for distribution over cloud infrastructure through Key-Value schemes such as MapReduce. Firstly, this thesis introduces a set of algorithms based on LSH that solves the challenges outlined previously. Secondly, the thesis details the design of ScalLoPS, a search tool similar to BLAST, that implements these and subsequent sequence alignment algorithms, using the MapReduce distributed programming paradigm. Finally, the performance, sensitivity, and scalability of ScalLoPS is compared to current-state tools such BLAST using datasets derived from actual metagenomic workflows, and on resources of a cloud provider. ScalLoPS is shown to be able to scale up in an inverse exponential distribution over the number of processing nodes, while approximating the quality of BLAST results.

Persistent link to this record

http://hdl.handle.net/1959.4/53681

DOI

https://doi.org/10.26190/unsworks/16936

Author(s)

Sunarso, Freddie

Supervisor(s)

Venugopal, Srikumar

Publication Year

2014

Resource Type

Thesis

Degree Type

Masters Thesis

UNSW Faculty

Files

public version.pdf

2.09 MB

Adobe Portable Document Format

View full record Show statistics

Library

Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)