Continuous Queries over Data Streams - Semantics and Implementation
Recent technological advances have pushed the emergence of a new class of data-intensive applications that require continuous processing over sequences of transient data, called data streams, in near real-time. Examples of such applications range from online monitoring and analysis of sensor data fo...
|PDF Full Text
No Tags, Be the first to tag this record!
|Recent technological advances have pushed the emergence of a new class of data-intensive applications that require continuous processing over sequences of transient data, called data streams, in near real-time. Examples of such applications range from online monitoring and analysis of sensor data for traffic management and factory automation to financial applications tracking stock ticker data. Traditional database systems are deemed inadequate to support high-volume, low-latency stream processing because queries are expected to run continuously and return new answers as new data arrives, without the need to store data persistently. The goal of this thesis is to develop a solid and powerful foundation for processing continuous queries over data streams. Resource requirements are kept in bounds by restricting the evaluation of continuous queries to sliding windows over the potentially unbounded data streams. This technique has the advantage that it emphasizes new data, which in the majority of real-world applications is considered more important than older data. Although the presence of continuous queries dictates rethinking the fundamental architecture of database systems, this thesis pursues an approach that adapts the well-established database technology to the data stream computation model, with the aim to facilitate the development and maintenance of stream-oriented applications. Based on a declarative query language inheriting the basic syntax from the prevalent SQL standard, users are able to express and modify complex application logic in an easy and comprehensible manner, without requiring the use of custom code. The underlying semantics assigns an exact meaning to a continuous query at any point in time and is defined by temporal extensions of the relational algebra. By carrying over the well-known algebraic equivalences from relational databases to stream processing, this thesis prepares the ground for powerful query optimizations. A unique time-interval based stream algebra implemented with efficient online algorithms allows for processing data in a push-based fashion. A performance analysis, along with experimental studies, confirms the superiority of the time-interval approach over comparative approaches for the predominant set of continuous queries. Based upon this stream algebra, this thesis addresses architectural issues of an adaptive and scalable runtime environment that can cope with varying query workload and fluctuating data stream characteristics arising from the highly dynamic and long-running nature of streaming applications. In order to control the resource allocation of continuous queries, novel adaptation techniques are investigated, trading off answer quality for lower resource requirements. Moreover, a general migration strategy is developed that enables the query processing engine to re-optimize continuous queries at runtime. Overall, this thesis outlines the salient features and operational functionality of the stream processing infrastructure PIPES (Public Infrastructure for Processing and Exploring Streams), which has already been applied successfully in a variety of stream-oriented applications.