Using Search Term Positions for Determining Document Relevance

The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing wi...

詳細記述

保存先:
書誌詳細
第一著者: Galeas, Patricio
その他の著者: Freisleben, Bernd (Prof. Dr.) (論文の指導者)
フォーマット: Dissertation
言語:英語
出版事項: Philipps-Universität Marburg 2010
主題:
オンライン・アクセス:PDFフルテキスト
タグ: タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!
その他の書誌記述
要約:The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing with a certain topic is not a simple task. Users need tools to sift through non-relevant information and retrieve only pieces of information relevant to their needs. The traditional methods of information retrieval (IR) based on search term frequency have somehow reached their limitations, and novel ranking methods based on hyperlink information are not applicable to unlinked documents. The retrieval of documents based on the positions of search terms in a document has the potential of yielding improvements, because other terms in the environment where a search term appears (i.e. the neighborhood) are considered. That is to say, the grammatical type, position and frequency of other words help to clarify and specify the meaning of a given search term. However, the required additional analysis task makes position-based methods slower than methods based on term frequency and requires more storage to save the positions of terms. These drawbacks directly affect the performance of the most user critical phase of the retrieval process, namely query evaluation time, which explains the scarce use of positional information in contemporary retrieval systems. This thesis explores the possibility of extending traditional information retrieval systems with positional information in an efficient manner that permits us to optimize the retrieval performance by handling term positions at query evaluation time. To achieve this task, several abstract representation of term positions to efficiently store and operate on term positional data are investigated. In the Gauss model, descriptive statistics methods are used to estimate term positional information, because they minimize outliers and irregularities in the data. The Fourier model is based on Fourier series to represent positional information. In the Hilbert model, functional analysis methods are used to provide reliable term position estimations and simple mathematical operators to handle positional data. The proposed models are experimentally evaluated using standard resources of the IR research community (Text Retrieval Conference). All experiments demonstrate that the use of positional information can enhance the quality of search results. The suggested models outperform state-of-the-art retrieval utilities. The term position models open new possibilities to analyze and handle textual data. For instance, document clustering and compression of positional data based on these models could be interesting topics to be considered in future research.
DOI:10.17192/z2010.0463