Table of Contents:
The development of many small and big software systems is not conceivable without state-of-the-art (object-) relational database management systems. But nowadays, a lot of interesting data is not completely structured and cannot efficiently be managed by existing systems. As a matter of fact, new standardized systems for unstructured and semi-structured data are strongly needed.
The existing gap recently became narrowed by the introduction of native XML database systems. Such systems are based on the standardized data format XML of the World Wide Web consortium (W3C). Native XML database systems support a lot of open standards like XSchema for grammars, XPath and XQuery as query languages, XSLT for transformations and DOM and SAX for connecting to applications.
This dissertation is concerned with the foundations of native XML database systems. Structures from literature are optimized and new structures are proposed. It is set a high value on a solid test-bed for the algorithms. Therefore, a framework has been developed which is publicly available in the Java-library XXL.
The XXL library provides a lot of high level components for a comfortable implementation of database systems. The most important components are a generic query processor and advanced index structures. During this work, a lot of new components have been implemented and integrated into XXL, for example a component for low level disk access, a freely configurable record manager, and a relational database framework.
The central concern of this work is the optimization of the storage level of native XML database systems. Therefore, it is important to keep the tree structure while mapping XML documents to external memory. By this means, query processing needs fewer hard disk accesses, which is most important for the performance of the database. Similar to R-trees, XML storage structures are based on split algorithms, which follow certain heuristics. The proposed so called OneCutSplit with Scaffold algorithm showed to be clearly superior to the well-known algorithms from literature.
For the fast insertion of documents into a native XML database, a bulk loading mechanism has been implemented. It could be shown that the structure of bulk loaded documents is much better than the structure of documents which were constructed by split algorithms. A better structure leads to a noticeable reduction of response times of queries.
The technology of XML database systems is still in its infancy and so, a lot of improvements can be achieved. For speeding up query processing, index structures are indispensable. For this purpose, a new signature index based on aggregates has been developed and also integrated into the XML storage engine. Tests have shown that this technique yields to a large benefit when evaluating XPath queries.
Through the development of the database framework inside XXL, tests comparing the performance of native XML storage with relational XML storage became possible. Currently, no other single system can perform similar fair runtime tests. It could be shown that native XML storage even has a good performance for simple XPath queries. When regarding navigational and update queries, the XML storage engine shows superior performance compared to relational engines.
Query processing on XML data is not only restricted to XPath and XQuery. For the management of large amounts of XML documents, new operators are necessary, which perform mappings from XML documents to new XML documents. This is analogous to the relational algebra, where the base type of the operators is tuple. Compared to the relational model, the reduction to a small set of operators is not possible in the context of XML.
In this thesis, a lot of new operators are presented, that are not specific for query processing inside XML database systems. They can also be used for querying the internet. Through the developed framework, it is possible to pose self-defined queries over internet sources in an easy manner.