Precise document retrieval in structured domains

Download files
Access & Terms of Use
open access
Copyright: Tam, Wai Lap Vincent
Altmetric
Abstract
In this thesis, we propose a new approach for effective searching in websites which are domain-specific and highly-structured (e.g. university course websites and documentation for software systems). The context for this work is information retrieval systems where we are given a collection of Web pages organized in a structurally meaningful way by their authors or through tools such as a content management system. While conventional Web search techniques could be used in this context, they are less effective than the proposed approach because they make use of assumptions that, while appropriate for effective Web-scale filtering and ranking, do not apply to more-focused websites. The main thrust of our retrieval approach is to consider the relationships between pages within the site and combine the content of a given page with the content of related pages to improve the retrieval chances of the original page. We find that the retrieval target can be better represented if we also consider content extracted from related pages and use it to supplement the content of the original page. The main research problem for the thesis is to define a suitable set of related pages for such a purpose, and to derive a method to use the content from these related pages in order to improve search quality. We achieve this by first considering the directory structure (and thus URL hierarchy) of typical focused websites to determine relatedness. Then, for websites which do not have such a structure visible in their URL hierarchy, we consider hyperlinks between pages to infer such a structure. In addition to keyword queries, we also consider queries involving significant amounts of text such as questions embedded in emails. These queries usually contain a large number of words, some of which are not relevant to the embedded query. Experimental results show that our retrieval system can improve the interpretation of document context within a structured domain and thus yield better retrieval results.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Tam, Wai Lap Vincent
Supervisor(s)
Shepherd, John
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2009
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download whole.pdf 1.18 MB Adobe Portable Document Format
Related dataset(s)