Robust Stream Indexing

Kontinuierliche Datenströme stehen im Zentrum von vielen anspruchsvollen und komplexen Anwendungen. Neben der Online-Verarbeitung durch ein Datenstromsystem müssen Datenströme auch langfristig in einer Datenbank gespeichert werden. Moderne Hardware kann Datenströme meist mit sehr hohem Durchsatz un...

Full description

Saved in:
Bibliographic Details
Main Author: Glombiewski, Nikolaus
Contributors: Seeger, Bernhard (Prof. Dr.) (Thesis advisor)
Format: Doctoral Thesis
Language:German
Published: Philipps-Universität Marburg 2023
Subjects:
Online Access:PDF Full Text
Tags: Add Tag
No Tags, Be the first to tag this record!

In many of today's most challenging use cases, data is continuously produced and processed in never-ending data streams. In addition to being processed online by a stream processing system, in most instances, data streams also need to be stored long-term in a database. Modern hardware exhibits excellent write speeds and thus can store streams with high throughput and low latency. However, parts of the stream must also be retrieved efficiently to extract knowledge from data. Decades of research have led to an incredible variety of index structures that can be used to query data for a large variety of use cases. While stream indexes have seen significant improvements in their efficiency, increasing their robustness is still a challenging and vital topic because streams result in never-ending index maintenance. This maintenance requires resources, leading to decreased or fluctuating performance for regular insertions and query operations. Thus, more robust stream indexing can significantly reduce operational costs and improve the user experience. As a result, this thesis's main objective is to improve the robustness of stream indexing. B-trees are well-researched and widely adopted index structures. Since they are a core part of many database systems, improving B-tree robustness has a wide-reaching effect. When used for streaming data, insertions into B-trees can result in many node splits. After bulk loading B-trees, these splits can occur in waves, affecting insertions and queries. However, this thesis shows that these waves of split activity can be reduced or eliminated by changing the bulk loading strategies used to create the B-tree. Specialized stream index structures, such as log-structured merge trees, stepped-merge forests, and their variants, avoid waves of node splits present in B-trees. These index structures consist of multiple components, which are occasionally merged to improve query performance. This results in periodic bursts of merge activity. As an alternative, this thesis presents continuous merging with staggered key ranges. The main idea is a perpetual mergesort algorithm, which results in a more robust performance for stream indexing. Stream indexes are usually part of a more complex database system. ChronicleDB is an event database system optimized for writing temporal streaming data. Improvements to B-trees and continuous merging with staggered key ranges will be related to the overall design of ChronicleDB. Furthermore, this thesis covers improvements made to ChronicleDB that use the unique characteristics of temporal data. The results lead to a more robust event database system.