Advanced Indexing and Query Processing for Multidimensional Databases
Many new applications, such as multimedia databases, employ the so-called feature transformation which transforms important features or properties of data objects into high-dimensional points. Searching for 'similar' or 'non-dominated' objects based on these features is thus a se...
|PDF Full Text
No Tags, Be the first to tag this record!
|Many new applications, such as multimedia databases, employ the so-called feature transformation which transforms important features or properties of data objects into high-dimensional points. Searching for 'similar' or 'non-dominated' objects based on these features is thus a search of points in this feature space. To support efficient query processing in these high dimensional databases, high-dimensional indexes are required to prune the search space and efficient query processing strategies employing these indexes have to be designed. Based on an analysis of typical advanced database systems - such as multimedia databases, electronic market places, and decision support systems - the following four challenging characteristics of complexity are detected: high-dimensionality nature of data, re-usability of existing index structures, novel (more expressive) query operators for advanced database systems and efficient analysis of complex high-dimensional data. Therefore, the general goal of this thesis is the improvement of the efficiency of index based query processing in high-dimensional data spaces and the development of novel query operators.
The first part of this thesis deals with similarity query processing techniques. We introduce a new approach to indexing multidimensional data that is particularly suitable for the efficient incremental processing of nearest neighbor queries. The basic idea is to split the data space vertically into multiple low- and medium-dimensional data spaces. The data from each of these lower-dimensional subspaces is organized by using a standard multi-dimensional index structure. In order to perform incremental NN-queries on top of index-striping efficiently, we first develop an algorithm for merging the results received from the underlying indexes. Then, an accurate cost model relying on a power law is presented that determines an appropriate number of indexes. Moreover, we consider the problem of dimension assignment, where each dimension is assigned to a lower-dimensional subspace, such that the cost of nearest neighbor queries is minimized. Furthermore, a generalization of the iDistance technique, called Multidimensional iDistance (MiD), for k nearest neighbor query processing is presented. Three main steps are performed for building MiD. In agreement with iDistance, firstly, data points are partitioned into clusters and, secondly, a reference point is determined for every cluster. However, the third step substantially differs from iDistance as a data object is mapped to a m-dimensional distance vector where m > 1 generally holds. The m dimensions are generated by splitting the original data space into m subspaces and computing the partial distance between the object and the reference point for every subspace. The resulting m-dimensional points can be indexed by an arbitrary point access method like an R-tree. Our crucial parameter m is derived from a cost model that is based on a power law. We present range and k-NN query processing algorithms for MiD.
The second part of this thesis deals with skyline query processing techniques. We first introduce the problem of Constrained Subspace Skyline Queries (CSSQ). We present a query processing algorithm which builds on multiple low-dimensional index structures. Due to the usage of well performing low dimensional indices, constrained subspace skyline queries for arbitrary large subspaces are efficiently supported. Effective pruning strategies are applied to discard points from dominated regions. An important ingredient of our approach is the workload-adaptive strategy for determining the number of indexes and the assignment of dimensions to the indexes. Furthermore, we introduce the concept of Reverse Skyline Queries (RSQ). Given a set of data points P and a query point q, an RSQ returns the data objects that have the query object in the set of their 'dynamic' skyline. Such kind of dynamic skyline corresponds to the skyline of a transformed data space where point q becomes the origin and all points are represented by their distance to q.In order to compute the reverse skyline of an arbitrary query point, we first propose a Branch and Bound algorithm (called BBRS), which is an improved customization of the original BBS algorithm. To reduce even more the computational cost of determining if a point belongs to the reverse skyline, we propose an enhanced algorithm (called RSSA), that is based on accurate pre-computed approximations of the skylines. These approximations are used to identify whether a point belongs to the reverse skyline or not.
The effectiveness and efficiency of all proposed techniques are discussed and verified by comparison with conventional approaches in versatile experimental evaluations on real-world datasets.