From organism diversity to micro-heterogeneity: confident assessment of fine-scale variation within metagenomic data

Download files
Access & Terms of Use
open access
Copyright: Amos, Timothy
The metagenome of a microbial community contains a large quantity of information about the inter-strain genetic variation present in that community. Genome assemblers using algorithms designed for use with isolate genomes obscure the inter-strain variation within metagenomic data. Analysing this variation in metagenomic data is further complicated by sequencing errors that add noise to the system by making base assignments ambiguous. In order to develop improved computational methods for metagenome analysis, simulations were performed using genome data of individual species. A software program, MetaSim, was used to generate simulated reads. Assemblies of these reads were used to investigate the development of an error model to confidently identify SNPs (Single Nucleotide Polymorphisms). This approach proved limited due to the nature of the MetaSim software and the insufficient availability of consistent, well-documented data. As an alternative approach, a graphical analysis of unitigs (high confidence contigs) was developed. This approach provided accurate predictions of whether each unitig in an assembly of simulated reads consisted of only one strain, or more. The approach included developing a system of rules describing the relationship between the number and proportions of strains in an assembly and the positioning of clusters in scatter plots. The differences in densities of clusters were used to help distinguish between ambiguous cluster patterns. Idealised assemblies of simulated reads without sequencing errors were produced, to examine how sequence quality affects the ability to make inferences about inter-strain variation. Computational clustering was investigated as a means of automating the analysis. Having established an approach to analyse unitigs, environmental metagenome data was analysed. This graphical analysis provided a well-supported and parsimonious interpretation of the number of strains present in metagenome data of an Antarctic lake community, and their proportions.
Persistent link to this record
Link to Publisher Version
Additional Link
Amos, Timothy
Cavicchioli, Ricardo
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
Resource Type
Degree Type
Masters Thesis
UNSW Faculty
download whole.pdf 1.16 MB Adobe Portable Document Format
Related dataset(s)