Efficient Algorithms for Robust Pattern Mining on Structured Objects with Applications to Structure-Based Drug Design
Many complex real-world objects cannot be mapped onto “flat” feature vector representations without a significant loss of information. Recently, the use of graph-based models gained increasing attention in the field of data mining. This thesis presents efficient methods for fuzzy pattern mining in...
|PDF Full Text
No Tags, Be the first to tag this record!
|Many complex real-world objects cannot be mapped onto “flat” feature vector representations without a significant loss of information. Recently, the use of graph-based models gained increasing attention in the field of data mining. This
thesis presents efficient methods for fuzzy pattern mining in databases of such graph representations. It is assumed that relatively homogeneous sets of graphs originating from common classes or clusters are given. The class assignment can either result from background knowledge or from a previously performed cluster analysis. The aim of the developed methods is to derive patterns that are characteristic for a given cluster (i.e., that occur in all or most of the members of a cluster). Additionally, it is of interest to derive patterns that are discriminative among different clusters (i.e., that occur in most of the members of one cluster but are missing in most of the members of a different cluster). As a certain variability can be observed also among members of the same cluster, one cannot expect that a characteristic pattern is present in all members of the cluster in identical form. The methods presented in this thesis thus allow also for variations of a characteristic pattern in its different occurrences. This is achieved by arranging the different graph representations in the common framework of a multiple graph alignment. In a graph alignment, nodes with different labels can be assigned to each other (“mismatch”) and dummy nodes can be inserted into the graphs to compensate “missing” nodes. This increases the robustness of the approach against variations in the considered graphs. A scoring system is used to penalize such constellations as they should be allowed only if no “valid” match for a certain node exists in the remaining graph instances. The calculation of a graph alignment can thus be interpreted as an optimization problem and a number of different strategies for the calculation of optimal graph alignments are examined. Once an optimal graph alignment has been calculated, characteristic and discriminative patterns can be derived and a number of different analyses can be performed. In particular, the concept of multiple graph alignment yields a mechanism for mapping the aligned graphs onto “flat” feature vectors that preserves the correspondence among the individual features of the graph and vector representation.
Relibase+/Cavbase, a huge database of putative protein binding sites serves as the main motivation for the methods developed in this thesis. In Cavbase, the protein binding pockets are represented through pseudocenters that describe the spatial position of certain physicochemical properties. In a first-order approximation, graphs can be used as descriptors for the Cavbase entries. The methods presented in this thesis can therefore be used to analyze families of related protein binding pockets. The derived patterns can be used to guide the development of novel and selective inhibitors. Regions of a binding pocket that are common to the members of a protein family (i.e., characteristic patterns) could be addressed to gain affinity for the whole protein family. Regions that vary among the different members of a protein family (i.e., discriminative patterns) indicate features that can be addressed to gain selectivity within the examined protein family.