Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses ...of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration - including meta-dimensional and multi-staged analyses - which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.
Scleroderma is a clinically heterogeneous disease with a complex phenotype. The disease is characterized by vascular dysfunction, tissue fibrosis, internal organ dysfunction, and immune dysfunction ...resulting in autoantibody production.
We analyzed the genome-wide patterns of gene expression with DNA microarrays in skin biopsies from distinct scleroderma subsets including 17 patients with systemic sclerosis (SSc) with diffuse scleroderma (dSSc), 7 patients with SSc with limited scleroderma (lSSc), 3 patients with morphea, and 6 healthy controls. 61 skin biopsies were analyzed in a total of 75 microarray hybridizations. Analysis by hierarchical clustering demonstrates nearly identical patterns of gene expression in 17 out of 22 of the forearm and back skin pairs of SSc patients. Using this property of the gene expression, we selected a set of 'intrinsic' genes and analyzed the inherent data-driven groupings. Distinct patterns of gene expression separate patients with dSSc from those with lSSc and both are easily distinguished from normal controls. Our data show three distinct patient groups among the patients with dSSc and two groups among patients with lSSc. Each group can be distinguished by unique gene expression signatures indicative of proliferating cells, immune infiltrates and a fibrotic program. The intrinsic groups are statistically significant (p<0.001) and each has been mapped to clinical covariates of modified Rodnan skin score, interstitial lung disease, gastrointestinal involvement, digital ulcers, Raynaud's phenomenon and disease duration. We report a 177-gene signature that is associated with severity of skin disease in dSSc.
Genome-wide gene expression profiling of skin biopsies demonstrates that the heterogeneity in scleroderma can be measured quantitatively with DNA microarrays. The diversity in gene expression demonstrates multiple distinct gene expression programs in the skin of patients with scleroderma.
With the abundance of information and analysis results being collected for genetic loci, user-friendly and flexible data visualization approaches can inform and improve the analysis and dissemination ...of these data. A chromosomal ideogram is an idealized graphic representation of chromosomes. Ideograms can be combined with overlaid points, lines, and/or shapes, to provide summary information from studies of various kinds, such as genome-wide association studies or phenome-wide association studies, coupled with genomic location information. To facilitate visualizing varied data in multiple ways using ideograms, we have developed a flexible software tool called PhenoGram which exists as a web-based tool and also a command-line program.
With PhenoGram researchers can create chomosomal ideograms annotated with lines in color at specific base-pair locations, or colored base-pair to base-pair regions, with or without other annotation. PhenoGram allows for annotation of chromosomal locations and/or regions with shapes in different colors, gene identifiers, or other text. PhenoGram also allows for creation of plots showing expanded chromosomal locations, providing a way to show results for specific chromosomal regions in greater detail. We have now used PhenoGram to produce a variety of different plots, and provide these as examples herein. These plots include visualization of the genomic coverage of SNPs from a genotyping array, highlighting the chromosomal coverage of imputed SNPs, copy-number variation region coverage, as well as plots similar to the NHGRI GWA Catalog of genome-wide association results.
PhenoGram is a versatile, user-friendly software tool fostering the exploration and sharing of genomic information. Through visualization of data, researchers can both explore and share complex results, facilitating a greater understanding of these data.
Phenome-wide association studies (PheWAS) are a high-throughput approach to evaluate comprehensive associations between genetic variants and a wide range of phenotypic measures. PheWAS has varying ...sample sizes for quantitative traits, and variable numbers of cases and controls for binary traits across the many phenotypes of interest, which can affect the statistical power to detect associations. The motivation of this study is to investigate the various parameters which affect the estimation of statistical power in PheWAS, including sample size, case-control ratio, minor allele frequency, and disease penetrance.
We performed a PheWAS simulation study, where we investigated variations in statistical power based on different parameters, such as overall sample size, number of cases, case-control ratio, minor allele frequency, and disease penetrance. The simulation was performed on both binary and quantitative phenotypic measures. Our simulation on binary traits suggests that the number of cases has more impact on statistical power than the case to control ratio; also, we found that a sample size of 200 cases or more maintains the statistical power to identify associations for common variants. For quantitative traits, a sample size of 1000 or more individuals performed best in the power calculations. We focused on common genetic variants (MAF > 0.01) in this study; however, in future studies, we will be extending this effort to perform similar simulations on rare variants.
This study provides a series of PheWAS simulation analyses that can be used to estimate statistical power for some potential scenarios. These results can be used to provide guidelines for appropriate study design for future PheWAS analyses.
Polycystic ovary syndrome is the most common endocrine disorder affecting women of reproductive age. A number of criteria have been developed for clinical diagnosis of polycystic ovary syndrome, with ...the Rotterdam criteria being the most inclusive. Evidence suggests that polycystic ovary syndrome is significantly heritable, and previous studies have identified genetic variants associated with polycystic ovary syndrome diagnosed using different criteria. The widely adopted electronic health record system provides an opportunity to identify patients with polycystic ovary syndrome using the Rotterdam criteria for genetic studies.
To identify novel associated genetic variants under the same phenotype definition, we extracted polycystic ovary syndrome cases and unaffected controls based on the Rotterdam criteria from the electronic health records and performed a discovery-validation genome-wide association study.
We developed a polycystic ovary syndrome phenotyping algorithm on the basis of the Rotterdam criteria and applied it to 3 electronic health record–linked biobanks to identify cases and controls for genetic study. In the discovery phase, we performed an individual genome-wide association study using the Geisinger MyCode and the Electronic Medical Records and Genomics cohorts, which were then meta-analyzed. We attempted validation of the significant association loci (P<1×10−6) in the BioVU cohort. All association analyses used logistic regression, assuming an additive genetic model, and adjusted for principal components to control for population stratification. An inverse-variance fixed-effect model was adopted for meta-analysis. In addition, we examined the top variants to evaluate their associations with each criterion in the phenotyping algorithm. We used the STRING database to characterize protein-protein interaction network.
Using the same algorithm based on the Rotterdam criteria, we identified 2995 patients with polycystic ovary syndrome and 53,599 population controls in total (2742 cases and 51,438 controls from the discovery phase; 253 cases and 2161 controls in the validation phase). We identified 1 novel genome-wide significant variant rs17186366 (odds ratio OR=1.37 1.23, 1.54, P=2.8×10−8) located near SOD2. In addition, 2 loci with suggestive association were also identified: rs113168128 (OR=1.72 1.42, 2.10, P=5.2×10−8), an intronic variant of ERBB4 that is independent from the previously published variants, and rs144248326 (OR=2.13 1.52, 2.86, P=8.45×10−7), a novel intronic variant in WWTR1. In the further association tests of the top 3 single-nucleotide polymorphisms with each criterion in the polycystic ovary syndrome algorithm, we found that rs17186366 (SOD2) was associated with polycystic ovaries and hyperandrogenism, whereas rs11316812 (ERBB4) and rs144248326 (WWTR1) were mainly associated with oligomenorrhea or infertility. We also validated the previously reported association with DENND1A1. Using the STRING database to characterize protein-protein interactions, we found both ERBB4 and WWTR1 can interact with YAP1, which has been previously associated with polycystic ovary syndrome.
Through a discovery-validation genome-wide association study on polycystic ovary syndrome identified from electronic health records using an algorithm based on Rotterdam criteria, we identified and validated a novel genome-wide significant association with a variant near SOD2. We also identified a novel independent variant within ERBB4 and a suggestive association with WWTR1. With previously identified polycystic ovary syndrome gene YAP1, the ERBB4-YAP1-WWTR1 network suggests involvement of the epidermal growth factor receptor and the Hippo pathway in the multifactorial etiology of polycystic ovary syndrome.
Pulmonary arterial hypertension (PAH) is a common complication for individuals with limited systemic sclerosis (lSSc). The identification and characterization of biomarkers for lSSc-PAH should lead ...to less invasive screening, a better understanding of pathogenesis, and improved treatment.
Forty-nine PBMC samples were obtained from 21 lSSc subjects without PAH (lSSc-noPAH), 15 lSSc subjects with PAH (lSSc-PAH), and 10 healthy controls; three subjects provided PBMCs one year later. Genome-wide gene expression was measured for each sample. The levels of 89 cytokines were measured in serum from a subset of subjects by Multi-Analyte Profiling (MAP) immunoassays. Gene expression clearly distinguished lSSc samples from healthy controls, and separated lSSc-PAH from lSSc-NoPAH patients. Real-time quantitative PCR confirmed increased expression of 9 genes (ICAM1, IFNGR1, IL1B, IL13Ra1, JAK2, AIF1, CCR1, ALAS2, TIMP2) in lSSc-PAH patients. Increased circulating cytokine levels of inflammatory mediators such as TNF-alpha, IL1-beta, ICAM-1, and IL-6, and markers of vascular injury such as VCAM-1, VEGF, and von Willebrand Factor were found in lSSc-PAH subjects.
The gene expression and cytokine profiles of lSSc-PAH patients suggest the presence of activated monocytes, and show markers of vascular injury and inflammation. These genes and factors could serve as biomarkers of PAH involvement in lSSc.
Skin biopsy gene expression was analyzed by DNA microarray from 13 diffuse cutaneous systemic sclerosis (dSSc) patients enrolled in an open-label study of rituximab, 9 dSSc patients not treated with ...rituximab, and 9 healthy controls. These data recapitulate the patient “intrinsic” gene expression subsets described previously, including fibroproliferative, inflammatory, and normal-like groups. Serial skin biopsies showed consistent and non-progressing gene expression over time, and importantly, the patients in the inflammatory subset do not move to the fibroproliferative subset, and vice versa. We were unable to detect significant differences in gene expression before and after rituximab treatment, consistent with an apparent lack of clinical response. Serial biopsies from each patient stayed within the same gene expression subset, regardless of treatment regimen or the time point at which they were taken. Collectively, these data emphasize the heterogeneous nature of SSc and demonstrate that the intrinsic subsets are an inherent, reproducible, and stable feature of the disease that is independent of disease duration. Moreover, these data have fundamental importance for the future development of personalized therapy for SSc; drugs targeting inflammation are likely to benefit those patients with an inflammatory signature, whereas drugs targeting fibrosis are likely to benefit those with a fibroproliferative signature.
The MYC oncogene contributes to induction and growth of many cancers but the full spectrum of the MYC transcriptional response remains unclear.
Using microarrays, we conducted a detailed kinetic ...study of genes that respond to MYCN or MYCNDeltaMBII induction in primary human fibroblasts. In parallel, we determined the response to steady state overexpression of MYCN and MYCNDeltaMBII in the same cell type. An overlapping set of 398 genes from the two protocols was designated a 'Core MYC Signature' and used for further analysis. Comparison of the Core MYC Signature to a published study of the genes induced by serum stimulation revealed that only 7.4% of the Core MYC Signature genes are in the Core Serum Response and display similar expression changes to both MYC and serum. Furthermore, more than 50% of the Core MYC Signature genes were not influenced by serum stimulation. In contrast, comparison to a panel of breast cancers revealed a strong concordance in gene expression between the Core MYC Signature and the basal-like breast tumor subtype, which is a subtype with poor prognosis. This concordance was supported by the higher average level of MYC expression in the same tumor samples.
The Core MYC Signature has clinical relevance as this profile can be used to deduce an underlying genetic program that is likely to contribute to a clinical phenotype. Therefore, the presence of the Core MYC Signature may predict clinical responsiveness to therapeutics that are designed to disrupt MYC-mediated phenotypes.
The development of sequencing techniques and statistical methods provides great opportunities for identifying the impact of rare genetic variation on complex traits. However, there is a lack of ...knowledge on the impact of sample size, case numbers, the balance of cases vs controls for both burden and dispersion based rare variant association methods. For example, Phenome-Wide Association Studies may have a wide range of case and control sample sizes across hundreds of diagnoses and traits, and with the application of statistical methods to rare variants, it is important to understand the strengths and limitations of the analyses.
We conducted a large-scale simulation of randomly selected low-frequency protein-coding regions using twelve different balanced samples with an equal number of cases and controls as well as twenty-one unbalanced sample scenarios. We further explored statistical performance of different minor allele frequency thresholds and a range of genetic effect sizes. Our simulation results demonstrate that using an unbalanced study design has an overall higher type I error rate for both burden and dispersion tests compared with a balanced study design. Regression has an overall higher type I error with balanced cases and controls, while SKAT has higher type I error for unbalanced case-control scenarios. We also found that both type I error and power were driven by the number of cases in addition to the case to control ratio under large control group scenarios. Based on our power simulations, we observed that a SKAT analysis with case numbers larger than 200 for unbalanced case-control models yielded over 90% power with relatively well controlled type I error. To achieve similar power in regression, over 500 cases are needed. Moreover, SKAT showed higher power to detect associations in unbalanced case-control scenarios than regression.
Our results provide important insights into rare variant association study designs by providing a landscape of type I error and statistical power for a wide range of sample sizes. These results can serve as a benchmark for making decisions about study design for rare variant analyses.