Dissection of Complex Genetic Correlations into Interaction Effects
Living systems are overwhelmingly complex and consist of many interacting parts. Already the quantitative characterization of a single human cell type on genetic level requires at least the measurement of 20000 gene expressions. It remains a big challenge for theoretical approaches to discover patte...
|PDF Full Text
No Tags, Be the first to tag this record!
|Living systems are overwhelmingly complex and consist of many interacting parts. Already the quantitative characterization of a single human cell type on genetic level requires at least the measurement of 20000 gene expressions. It remains a big challenge for theoretical approaches to discover patterns in these signals that represent specific interactions in such systems. A major problem is that available standard procedures summarize gene expressions in a hard-to-interpret way. For example, principal components represent axes of maximal variance in the gene vector space and thus often correspond to a superposition of multiple different gene regulation effects (e.g. I.1.4).
Here, a novel approach to analyze and interpret such complex data is developed (Chapter II). It is based on an extremum principle that identifies an axis in the gene vector space to which as many as possible samples are correlated as highly as possible (II.3). This axis is maximally specific and thus most probably corresponds to exactly one gene regulation effect, making it considerably easier to interpret than principle components. To stabilize and optimize effect discovery, axes in the sample vector space are identified simultaneously. Genes and samples are always handled symmetrically by the algorithm. While sufficient for effect discovery, effect axes can only linearly approximate regulation laws. To represent a broader class of nonlinear regulations, including saturation effects or activity thresholds (e.g. II.1.1.2), a bimonotonic effect model is defined (II.2.1.2). A corresponding regression is realized that is monotonic over projections of samples (or genes) onto discovered gene (or sample) axes. Resulting effect curves can approximate regulation laws precisely (II.4.1).
This enables the dissection of exclusively the discovered effect from the signal (II.4.2). Signal parts from other potentially overlapping effects remain untouched. This continues iteratively. In this way, the high-dimensional initial signal (II.2.1.1) can be dissected into highly specific effects.
Method validation demonstrates that superposed effects of various size, shape and signal strength can be dissected reliably (II.6.2). Simulated laws of regulation are reconstructed with high correlation. Detection limits, e.g. for signal strength or for missing values, lie above practical requirements (II.6.4). The novel approach is systematically compared with standard procedures such as principal component analysis. Signal dissection is shown to have clear advantages, especially for many overlapping effects of comparable size (II.6.3).
An ideal test field for such approaches is cancer cells, as they may be driven by multiple overlapping gene regulation networks that are largely unknown. Additionally, quantification and classification of cancer cells by their particular set of driving gene regulations is a prerequisite towards precision medicine. To validate the novel method against real biological data, it is applied to gene expressions of over 1000 tumor samples from Diffuse Large B-Cell Lymphoma (DLBCL) patients (Chapter III). Two already known subtypes of this disease (cf. I.1.2.1) with significantly different survival following the same chemotherapy were originally also discovered as a gene expression effect. These subtypes can only be precisely determined by this effect on molecular level. Such previous results offer a possibility for method validation and indeed, this effect has been unsupervisedly rediscovered (III.3.2.2).
Several additional biologically relevant effects have been discovered and validated across four patient cohorts. Multivariate analyses (III.2) identify combinations of validated effects that can predict significant differences in patient survival. One novel effect possesses an even higher predictive value (cf. III.2.5.1) than the rediscovered subtype effect and is genetically more specific (cf. III.3.3.1). A trained and validated Cox survival model (III.2.5) can predict significant survival differences within known DLBCL subtypes (III.2.5.6), demonstrating that they are genetically heterogeneous as well. Detailed biostatistical evaluations of all survival effects (III.3.3) may help to clarify the molecular pathogenesis of DLBCL.
Furthermore, the applicability of signal dissection is not limited to biological data. For instance, dissecting spectral energy distributions of stars observed in astrophysics might be useful to discover laws of light emission.