Density Estimation over Data Streams
A growing number of real-world applications share the property that they have to deal with transient data arriving in massive volumes, so-called data streams. The characteristics of these data streams render their analysis by means of conventional techniques extremely difficult, in the majority of c...
|PDF Full Text
No Tags, Be the first to tag this record!
|A growing number of real-world applications share the property that they have to deal with transient data arriving in massive volumes, so-called data streams. The characteristics of these data streams render their analysis by means of conventional techniques extremely difficult, in the majority of cases even impossible. In fact, to be applicable to data streams, a technique has to meet rigid processing requirements such as a single pass over the stream and strict limitations on the available memory. With respect to these requirements, an adequate real-time mining and analysis of a data stream has turned out to be a highly demanding task, which can only be tackled by developing tailored techniques. In this work, we delve more deeply into an important building block of many data mining and analysis approaches, namely density estimation, with the objective to adapt it to the data stream scenario. Density estimation, a vital research topic in mathematical statistics, aims to capture the unknown distribution of a real-valued data set by estimating its underlying probability density function. A suitable density estimate can be exploited to gain insight into the main features of the data; it can also prepare the ground for a variety of further analysis tasks. Due to the importance of density estimation, various methods have been developed in recent decades. Among the most promising methods are the nonparametric ones. They stand out from the parametric approaches as they only make very weak assumptions on the data; they let the data speak for themselves. Two important members of the class of nonparametric methods are based on kernels and wavelets. Kernel as well as wavelet density estimators are theoretically founded and practically approved, but their computational burden severely impedes their applicability. In particular, neither kernel nor wavelet density estimators can be computed over real-valued data streams because their computational complexity collides with the processing requirements for streams. However, as density estimation can be a crucial component of data stream analysis, we devoted ourselves to the adaptation of kernel and wavelet density estimators to data streams in compliance with the processing requirements for streams. Our solution for wavelet density estimators, termed compressed-cumulative WDEs, relies on processing blocks of data of the stream, where each data block is associated with a separate estimator. While processing the stream, the block estimators are successively merged into one estimator. Our solution for kernel density estimators utilizes specific building blocks, termed Cluster Kernels, to derive an estimator for the stream on demand. These Cluster Kernels summarize already processed elements in terms of local statistics, which are used for an approximate resampling of elements. Both solutions exhibit a generic design to facilitate the integration of new strategies for their main processing steps. In order to assess them, we conducted an extensive experimental study, including a thorough comparison with competitive techniques. As the results of this study reveal, our techniques outperform all competitors. Overall, they succeed in estimating the unknown density of a data stream.