Publication:
Precise document retrieval in structured domains

dc.contributor.advisor Shepherd, John en_US
dc.contributor.author Tam, Wai Lap Vincent en_US
dc.date.accessioned 2022-03-21T10:50:00Z
dc.date.available 2022-03-21T10:50:00Z
dc.date.issued 2009 en_US
dc.description.abstract In this thesis, we propose a new approach for effective searching in websites which are domain-specific and highly-structured (e.g. university course websites and documentation for software systems). The context for this work is information retrieval systems where we are given a collection of Web pages organized in a structurally meaningful way by their authors or through tools such as a content management system. While conventional Web search techniques could be used in this context, they are less effective than the proposed approach because they make use of assumptions that, while appropriate for effective Web-scale filtering and ranking, do not apply to more-focused websites. The main thrust of our retrieval approach is to consider the relationships between pages within the site and combine the content of a given page with the content of related pages to improve the retrieval chances of the original page. We find that the retrieval target can be better represented if we also consider content extracted from related pages and use it to supplement the content of the original page. The main research problem for the thesis is to define a suitable set of related pages for such a purpose, and to derive a method to use the content from these related pages in order to improve search quality. We achieve this by first considering the directory structure (and thus URL hierarchy) of typical focused websites to determine relatedness. Then, for websites which do not have such a structure visible in their URL hierarchy, we consider hyperlinks between pages to infer such a structure. In addition to keyword queries, we also consider queries involving significant amounts of text such as questions embedded in emails. These queries usually contain a large number of words, some of which are not relevant to the embedded query. Experimental results show that our retrieval system can improve the interpretation of document context within a structured domain and thus yield better retrieval results. en_US
dc.identifier.uri http://hdl.handle.net/1959.4/51701
dc.language English
dc.language.iso EN en_US
dc.publisher UNSW, Sydney en_US
dc.rights CC BY-NC-ND 3.0 en_US
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/3.0/au/ en_US
dc.subject.other Document Relationships en_US
dc.subject.other Information Retrieval en_US
dc.subject.other Structured Document Retrieval en_US
dc.subject.other Link Analysis en_US
dc.subject.other Document Re-ranking en_US
dc.subject.other Web Search en_US
dc.subject.other URL Structure en_US
dc.subject.other Hyperlink Structure en_US
dc.title Precise document retrieval in structured domains en_US
dc.type Thesis en_US
dcterms.accessRights open access
dcterms.rightsHolder Tam, Wai Lap Vincent
dspace.entity.type Publication en_US
unsw.accessRights.uri https://purl.org/coar/access_right/c_abf2
unsw.identifier.doi https://doi.org/10.26190/unsworks/15321
unsw.relation.faculty Engineering
unsw.relation.originalPublicationAffiliation Tam, Wai Lap Vincent, Computer Science & Engineering, Faculty of Engineering, UNSW en_US
unsw.relation.originalPublicationAffiliation Shepherd, John, Computer Science & Engineering, Faculty of Engineering, UNSW en_US
unsw.relation.school School of Computer Science and Engineering *
unsw.thesis.degreetype PhD Doctorate en_US
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
whole.pdf
Size:
1.18 MB
Format:
application/pdf
Description:
Resource type