Strategies for Genome-Wide Association Analyses of Raw Copy Number Variation Data
Copy number variations (CNVs), as one type of genetic variation in which a large sequence of nucleotides is repeated in tandem multiple times to a variable extent among different individuals of one population, have gained much attention with regard to human phenotypic diversity. Recent efforts to ma...
|PDF Full Text
No Tags, Be the first to tag this record!
|Copy number variations (CNVs), as one type of genetic variation in which a large sequence of nucleotides is repeated in tandem multiple times to a variable extent among different individuals of one population, have gained much attention with regard to human phenotypic diversity. Recent efforts to map human structural variation have shown that CNVs affect a significantly larger proportion of the human genome than single nucleotide polymorphisms (SNPs). This gave rise to the idea of CNVs playing an important role in explaining some of the large proportion of the phenotypic variance in a population that is due to genetic factors and that could not yet be explained by common SNPs. Current data from SNP genotyping arrays were found to be useful not only for the genome-wide genotyping of SNPs, but also for the detection of CNVs. However, due to the mostly still inadequate accuracy of CNV detection and the rareness of provided methods for association testing, to design a genome-wide CNV association study can be a challenge.
This thesis explored four strategies for the genome-wide association analyses of raw CNV data being derived from the Affymetrix Genome-Wide Human SNP Array 6.0. Initially, the two most commonly used strategic approaches are presented and applied to real data examples for the phenotypes early-onset extreme obesity and childhood attention - deficit / hyperactivity disorder (ADHD). On the one hand, raw intensity values reflecting individual copy numbers are directly tested for an association with the risk of disease, without providing or making use of any information about CNV genotypes. On the other hand, genome-wide CNV analyses are performed as a two-step procedure in first calling individual CNV genotypes and then using these to test for CNV - phenotype associations. Secondly, two extensions of the standard strategies are introduced, which both form its own strategy with a special focus on the intention to overcome problems and weaknesses of the respective widely used strategy. In this sense, one proposed strategy accounts for the fact that thousands of array-provided CNV marker are located in genomic regions without underlying copy number variability, and thus suggests to test only a pre-selected set of relevant and informative intensity values for associations in order to relax the multiple testing issue. Furthermore, the second proposed strategy addresses the known inaccuracy of CNV calling in especially common CNV regions that is often caused to some extent by the high CNV population frequency and the consequent inadequacy of estimating CNV genotypes relative to sample's mean or median hybridization intensity values. Instead, the use of intensity reference values being estimated in a Gaussian mixture model framework, called MCMR, is investigated in application to data examples for the HapMap and replicate samples as well as to the previously analysed obesity data set. The latter obesity sample has been analysed in use of all four genome-wide CNV analyses strategies which allowed a comparison on the strategy's applicability and performance.
The four strategies were observed to greatly vary in terms of computing efforts and genetic results. Whereas one of the two standard strategies was successful in the identification of rare CNVs at the PARK2 locus being genome-wide statistitically significantly associated with ADHD in children, none of these two strategies detected any CNV - obesity association. Contrarily, alternative MCMR reference intensity values showed improved reliability of CNV calls compared to standard calling in terms of stability, reproducibility and false positive rates. As a consequence, a novel common CNV for early-onset extreme obesity on chromosome 11q11 was identified in application of the proposed analyses strategies. Moreover, a common deletion at chromosome 10q11.22, which was previously reported to be associated with body mass index (BMI), was also replicated in use of one the proposed strategies.
The results suggest that the choice of the genome-wide CNV association analyses strategy may greatly influence genetic results. The presented strategic investigations presented here give an overview on aspects to consider when planning a genome-wide CNV analyses pipeline, but do not allow general recommendations towards an optimal design.