Yuanfang Guan, Cheryl L. Ackert-Bicknell, Braden Kell, Olga G. Troyanskaya , Matthew A. Hibbs. Functional Genomics Complements Quantitative Genetics in Identifying Disease-Gene Associations. PLoS Computational Biology, In press, 2010.
An ultimate goal of genetic research is to understand the connection between genotype and phenotype in order to improve the diagnosis and treatment of diseases. The quantitative genetics field has developed a suite of statistical methods to associate genetic loci with diseases and phenotypes, including quantitative trait loci (QTL) linkage mapping and genome-wide association studies (GWAS). However, each of these approaches have technical and biological shortcomings. For example, the amount of heritable variation explained by GWAS is often surprisingly small and the resolution of many QTL linkage mapping studies is poor. The predictive power and interpretation of QTL and GWAS results are consequently limited. In this study, we propose a complementary approach to quantitative genetics by interrogating the vast amount of high-throughput genomic data in model organisms to functionally associate genes with phenotypes and diseases. Our algorithm combines the genome-wide functional relationship network for the laboratory mouse and a state-of-the-art machine learning method. We demonstrate the superior accuracy of this algorithm through predicting genes associated with each of 1157 diverse phenotype ontology terms. Comparison between our prediction results and a meta-analysis of quantitative genetic studies reveals both overlapping candidates and distinct, accurate predictions uniquely identified by our approach. Focusing on bone mineral density (BMD), a phenotype related to osteoporotic fracture, we experimentally validated two of our novel predictions (not observed in any previous GWAS/QTL studies) and found significant bone density defects for both Timp2 and Abcg8 deficient mice. Our results suggest that the integration of functional genomics data into networks, which itself is informative of protein function and interactions, can successfully be utilized as a complementary approach to quantitative genetics to predict disease risks.
Supplemental Files from Manuscript:
- Figure S1 - Well defined, high level MP terms were obtained from MGI, which represent a wild sampling of phenotypes. Precisions at different levels of recall were calculated for both the summed weight method (left) and the network-based SVM method (right), where the latter shows significant improvement.
- Text S1 - Description of methods for integration of diverse data for constructing a functional relationship network
- Text S2 - Training SVMs on raw data as a baseline for performance evaluation
- Table S1 - Supporting evidence for top connectors to the candidate genes Timp2 and Abcg8
- Table S2 - Interaction weights (posteriors) of local networks surrounding Timp2 and Abcg8
- Table S3 - Top 100 genes predicted for association with 'abnormal bone mineralization'
Additional Data Files:
For both the SVM method and the summed weight method, we provide the original score and the estimated probabilities. The original score for the SVM method is the median SVM output for all the bootstrap rounds. The original score for the summed weight method is median of the the sum of all links between the gene and all positive examples from the gold standard. The outputs from both the summed weight and SVM methods are on an arbitrary scale, and are consequently not intuitive to understand. To make the value of these outputs more comprehensible, we also provided the probability of being annotated to a phenotype by fitting the output distribution of positive and negative examples with two normal distributions. see manuscript for descriptions.
- SVM Results: SVM_Raw_Output.tgz [175MB]; SVM_Probabilities.gz [99MB]
- Summed Weight Results: SumWt_Raw_Output.tgz [106MB]
Evaluation of Predictions
Integrated Functional Networks (formatted as pair-wise posteriors)
- Network EXCLUDING phenotype data [567MB] (version used for prediction results)
- Network INCLUDING phenotype data [917MB]
Raw Data and Source Code
- All network integration input data in pairwise format [tgz, 2.1GB]
- Source code, scripts, configuration files, and sample output [tgz, 568M]
Results of Survey for Biological Specificity