Precise document retrieval in structured domains

Tam, Wai Lap Vincent

doi:10.26190/unsworks/15321

Publication:

Precise document retrieval in structured domains

dc.contributor.advisor	Shepherd, John	en_US
dc.contributor.author	Tam, Wai Lap Vincent	en_US
dc.date.accessioned	2022-03-21T10:50:00Z
dc.date.available	2022-03-21T10:50:00Z
dc.date.issued	2009	en_US
dc.description.abstract	In this thesis, we propose a new approach for effective searching in websites which are domain-specific and highly-structured (e.g. university course websites and documentation for software systems). The context for this work is information retrieval systems where we are given a collection of Web pages organized in a structurally meaningful way by their authors or through tools such as a content management system. While conventional Web search techniques could be used in this context, they are less effective than the proposed approach because they make use of assumptions that, while appropriate for effective Web-scale filtering and ranking, do not apply to more-focused websites. The main thrust of our retrieval approach is to consider the relationships between pages within the site and combine the content of a given page with the content of related pages to improve the retrieval chances of the original page. We find that the retrieval target can be better represented if we also consider content extracted from related pages and use it to supplement the content of the original page. The main research problem for the thesis is to define a suitable set of related pages for such a purpose, and to derive a method to use the content from these related pages in order to improve search quality. We achieve this by first considering the directory structure (and thus URL hierarchy) of typical focused websites to determine relatedness. Then, for websites which do not have such a structure visible in their URL hierarchy, we consider hyperlinks between pages to infer such a structure. In addition to keyword queries, we also consider queries involving significant amounts of text such as questions embedded in emails. These queries usually contain a large number of words, some of which are not relevant to the embedded query. Experimental results show that our retrieval system can improve the interpretation of document context within a structured domain and thus yield better retrieval results.	en_US
dc.identifier.uri	http://hdl.handle.net/1959.4/51701
dc.language	English
dc.language.iso	EN	en_US
dc.publisher	UNSW, Sydney	en_US
dc.rights	CC BY-NC-ND 3.0	en_US
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/3.0/au/	en_US
dc.subject.other	Document Relationships	en_US
dc.subject.other	Information Retrieval	en_US
dc.subject.other	Structured Document Retrieval	en_US
dc.subject.other	Link Analysis	en_US
dc.subject.other	Document Re-ranking	en_US
dc.subject.other	Web Search	en_US
dc.subject.other	URL Structure	en_US
dc.subject.other	Hyperlink Structure	en_US
dc.title	Precise document retrieval in structured domains	en_US
dc.type	Thesis	en_US
dcterms.accessRights	open access
dcterms.rightsHolder	Tam, Wai Lap Vincent
dspace.entity.type	Publication	en_US
unsw.accessRights.uri	https://purl.org/coar/access_right/c_abf2
unsw.identifier.doi	https://doi.org/10.26190/unsworks/15321
unsw.relation.faculty	Engineering
unsw.relation.originalPublicationAffiliation	Tam, Wai Lap Vincent, Computer Science & Engineering, Faculty of Engineering, UNSW	en_US
unsw.relation.originalPublicationAffiliation	Shepherd, John, Computer Science & Engineering, Faculty of Engineering, UNSW	en_US
unsw.relation.school	School of Computer Science and Engineering	*
unsw.thesis.degreetype	PhD Doctorate	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: whole.pdf
Size:: 1.18 MB
Format:: application/pdf
Description:

Download

Resource type

Thesis

Publication: Precise document retrieval in structured domains

Files

Original bundle

Resource type

Publication:

Precise document retrieval in structured domains