Using Search Term Positions for Determining Document Relevance

The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing wi...

全面介绍

Gespeichert in:
书目详细资料
主要作者: Galeas, Patricio
其他作者: Freisleben, Bernd (Prof. Dr.) (BetreuerIn (Doktorarbeit))
格式: Dissertation
语言:英语
出版: Philipps-Universität Marburg 2010
主题:
在线阅读:PDF-Volltext
标签: 添加标签
没有标签, 成为第一个标记此记录!
实物特征
总结:The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing with a certain topic is not a simple task. Users need tools to sift through non-relevant information and retrieve only pieces of information relevant to their needs. The traditional methods of information retrieval (IR) based on search term frequency have somehow reached their limitations, and novel ranking methods based on hyperlink information are not applicable to unlinked documents. The retrieval of documents based on the positions of search terms in a document has the potential of yielding improvements, because other terms in the environment where a search term appears (i.e. the neighborhood) are considered. That is to say, the grammatical type, position and frequency of other words help to clarify and specify the meaning of a given search term. However, the required additional analysis task makes position-based methods slower than methods based on term frequency and requires more storage to save the positions of terms. These drawbacks directly affect the performance of the most user critical phase of the retrieval process, namely query evaluation time, which explains the scarce use of positional information in contemporary retrieval systems. This thesis explores the possibility of extending traditional information retrieval systems with positional information in an efficient manner that permits us to optimize the retrieval performance by handling term positions at query evaluation time. To achieve this task, several abstract representation of term positions to efficiently store and operate on term positional data are investigated. In the Gauss model, descriptive statistics methods are used to estimate term positional information, because they minimize outliers and irregularities in the data. The Fourier model is based on Fourier series to represent positional information. In the Hilbert model, functional analysis methods are used to provide reliable term position estimations and simple mathematical operators to handle positional data. The proposed models are experimentally evaluated using standard resources of the IR research community (Text Retrieval Conference). All experiments demonstrate that the use of positional information can enhance the quality of search results. The suggested models outperform state-of-the-art retrieval utilities. The term position models open new possibilities to analyze and handle textual data. For instance, document clustering and compression of positional data based on these models could be interesting topics to be considered in future research.
DOI:10.17192/z2010.0463