Efficient bulk-loading methods for temporal and multidimensional index structures

Nahezu alle naturwissenschaftlichen Bereiche profitieren von neuesten Analyse- und Verarbeitungsmethoden für große Datenmengen. Diese Verfahren setzten eine effiziente Verarbeitung von geo- und zeitbezogenen Daten voraus, da die Zeit und die Position wichtige Attribute vieler Daten sind. Die effizi...

Full description

Saved in:
Bibliographic Details
Main Author: Achakeev, Daniar
Contributors: Seeger, Bernhard (Prof. Dr.) (Thesis advisor)
Format: Doctoral Thesis
Language:English
Published: Philipps-Universität Marburg 2013
Subjects:
Online Access:PDF Full Text
Tags: Add Tag
No Tags, Be the first to tag this record!

The recent increase of spatial and temporal data requires efficient algorithms for index construction and for bulk updates. Many big data applications exhibit not only a high volume of static data but also inherit data growth. Moreover, some of them display high update rates. Incoming collected or produced data arrive in batches in order to reduce transportation and update costs. Therefore, updating an index using one record at a time is found to be inefficient for sufficiently large batch sizes. In this work, we investigate the problem of efficient bulk-loading of temporal and spatial index structures. We design novel loading strategies for multiversion B-tree (MVBT) and R-tree. We introduce a novel loading algorithm for MVBT with the asymptotic optimal I/O complexity. We show that the previously developed technique based on the buffer tree solves the loading problem only for the special case (if an input file consists only of insert operations). In this work, we propose the first loading algorithm for MVBT that meets the lower-bound of external sorting. We also proposed an efficient algorithm for bulk updates. These results are achieved by a combination of two algorithmic techniques: buffer tree and weight balancing. In this work, we designed a loading algorithm for the R-tree that optimizes it according to a widely used cost model. Extensive experimental results show that R-trees built using our novel approach exhibit substantially better query performance than sort-based counterparts. The novelty of our technique is that if a query profile is available the algorithm is able to build better R-trees. We proposed the following heuristic: first, we define the best sorting order using the average shape of the query rectangle. Second, we reduce the bulk-loading problem to a one-dimensional partitioning problem by sorting the input set. Third, we find an optimal solution for this problem according to the proposed cost models. The motivation of our heuristic is NP-hardness of the optimization problem. Based on the result obtained from query adaptive loading, we observed that construction of spatial histograms resembles the problem of optimal loading of R-trees. Therefore, we developed a spatial histogram construction method based on the partitioning framework developed for R-tree loading. Spatial histograms built using our novel method exhibit high accuracy for different data and query distributions. Our method has only one parameter, minimal page capacity. Moreover, our extensive experimental results show robustness of our method for different query and data distributions.