Genome-wide association studies (GWASs) have successfully uncovered thousands of robust associations between common variants and complex traits and diseases. Despite these successes, much of the ...heritability of these traits remains unexplained. Because low-frequency and rare variants are not tagged by conventional genome-wide genotyping arrays, they may represent an important and understudied component of complex trait genetics. In contrast to common variant GWASs, there are many different types of study designs, assays and analytic techniques that can be utilized for rare variant association studies (RVASs). In this review, we briefly present the different technologies available to identify rare genetic variants, including novel exome arrays. We also compare the different study designs for RVASs and argue that the best design will likely be phenotype-dependent. We discuss the main analytical issues relevant to RVASs, including the different statistical methods that can be used to test genetic associations with rare variants and the various bioinformatic approaches to predicting in silico biological functions for variants. Finally, we describe recent rare variant association findings, highlighting the unexpected conclusion that most rare variants have modest-to-small effect sizes on phenotypic variation. This observation has major implications for our understanding of the genetic architecture of complex traits in the context of the unexplained heritability challenge.
Next-generation sequencing technologies are quickly becoming the preferred approach for characterizing and quantifying entire genomes. Even though data produced from these technologies are proving to ...be the most informative of any thus far, very little attention has been paid to fundamental design aspects of data collection and analysis, namely sampling, randomization, replication, and blocking. We discuss these concepts in an RNA sequencing framework. Using simulations we demonstrate the benefits of collecting replicated RNA sequencing data according to well known statistical designs that partition the sources of biological and technical variation. Examples of these designs and their corresponding models are presented with the goal of testing differential expression.
Analysis of imputed genotypes is an important and routine component of genome-wide association studies and the increasing size of imputation reference panels has facilitated the ability to impute and ...test low-frequency variants for associations. In the context of genotype imputation, the true genotype is unknown and genotypes are inferred with uncertainty using statistical models. Here, we present a novel method for integrating imputation uncertainty into statistical association tests using a fully conditional multiple imputation (MI) approach which is implemented using the Substantive Model Compatible Fully Conditional Specification (SMCFCS). We compared the performance of this method to an unconditional MI and two additional approaches that have been shown to demonstrate excellent performance: regression with dosages and a mixture of regression models (MRM).
Our simulations considered a range of allele frequencies and imputation qualities based on data from the UK Biobank. We found that the unconditional MI was computationally costly and overly conservative across a wide range of settings. Analyzing data with Dosage, MRM, or MI SMCFCS resulted in greater power, including for low frequency variants, compared to unconditional MI while effectively controlling type I error rates. MRM andl MI SMCFCS are both more computationally intensive then using Dosage.
The unconditional MI approach for association testing is overly conservative and we do not recommend its use in the context of imputed genotypes. Given its performance, speed, and ease of implementation, we recommend using Dosage for imputed genotypes with MAF Formula: see text 0.001 and Rsq Formula: see text 0.3.
Polyploidy is generally not tolerated in animals, but is widespread in plant genomes and may result in extensive genetic redundancy. The fate of duplicated genes is poorly understood, both ...functionally and evolutionarily. Soybean (Glycine max L.) has undergone two separate polyploidy events (13 and 59 million years ago) that have resulted in 75% of its genes being present in multiple copies. It therefore constitutes a good model to study the impact of whole‐genome duplication on gene expression. Using RNA‐seq, we tested the functional fate of a set of approximately 18 000 duplicated genes. Across seven tissues tested, approximately 50% of paralogs were differentially expressed and thus had undergone expression sub‐functionalization. Based on gene ontology and expression data, our analysis also revealed that only a small proportion of the duplicated genes have been neo‐functionalized or non‐functionalized. In addition, duplicated genes were often found in collinear blocks, and several blocks of duplicated genes were co‐regulated, suggesting some type of epigenetic or positional regulation. We also found that transcription factors and ribosomal protein genes were differentially expressed in many tissues, suggesting that the main consequence of polyploidy in soybean may be at the regulatory level.
C-reactive protein (CRP) is a systemic inflammation marker that predicts future cardiovascular risk. CRP levels are higher in African Americans and Hispanic Americans than in European Americans, but ...the genetic determinants of CRP in these admixed United States minority populations are largely unknown. We performed genome-wide association studies (GWASs) of 8,280 African American (AA) and 3,548 Hispanic American (HA) postmenopausal women from the Women's Health Initiative SNP Health Association Resource. We discovered and validated a CRP-associated variant of triggering receptors expressed by myeloid cells 2 (TREM2) in chromosomal region 6p21 (p = 10−10). The TREM2 variant associated with higher CRP is common in Africa but rare in other ancestral populations. In AA women, the CRP region in 1q23 contained a strong admixture association signal (p = 10−17), which appears to be related to several independent CRP-associated alleles; the strongest of these is present only in African ancestral populations and is associated with higher CRP. Of the other genomic loci previously associated with CRP through GWASs of European populations, most loci (LEPR, IL1RN, IL6R, GCKR, NLRP3, HNF1A, HNF4A, and APOC1) showed consistent patterns of association with CRP in AA and HA women. In summary, we have identified a common TREM2 variant associated with CRP in United States minority populations. The genetic architecture underlying the CRP phenotype in AA women is complex and involves genetic variants shared across populations, as well as variants specific to populations of African descent.
In this study, the asymptotic distributions of the likelihood ratio test (LRT), the restricted likelihood ratio test (RLRT), the F and the sequence kernel association test (SKAT) statistics for ...testing an additive effect of the expected familial relatedness (FR) in a linear mixed model are examined based on an eigenvalue approach. First, the covariance structure for modeling the FR effect in a LMM is presented. Then, the multiplicity of eigenvalues for the log‐likelihood and restricted log‐likelihood is established under a replicate family setting and extended to a more general replicate family setting (GRFS) as well. After that, the asymptotic null distributions of LRT, RLRT, F and SKAT statistics under GRFS are derived. The asymptotic null distribution of SKAT for testing genetic rare variants is also constructed. In addition, a simple formula for sample size calculation is provided based on the restricted maximum likelihood estimate of the effect size for the expected FR. Finally, a power comparison of these test statistics on hypothesis test of the expected FR effect is made via simulation. The four test statistics are also applied to a data set from the UK Biobank.
Familial relatedness (FR) and population structure (PS) are two major sources for genetic correlation. In the human population, both FR and PS can further break down into additive and dominant ...components to account for potential additive and dominant genetic effects. In this study, besides the classical additive genomic relationship matrix, a dominant genomic relationship matrix is introduced. A link between the additive/dominant genomic relationship matrices and the coancestry (or kinship)/double coancestry coefficients is also established. In addition, a way to separate the FR and PS correlations based on the estimates of coancestry and double coancestry coefficients from the genomic relationship matrices is proposed. A unified linear mixed model is also developed, which can account for both the additive and dominance effects of FR and PS correlations as well as their possible random interactions. Finally, this unified linear mixed model is applied to analyze two study cohorts from UK Biobank.
Abstract
Motivation
The Cancer Genome Atlas (TCGA) has greatly advanced cancer research by generating, curating and publicly releasing deeply measured molecular data from thousands of tumor samples. ...In particular, gene expression measures, both within and across cancer types, have been used to determine the genes and proteins that are active in tumor cells.
Results
To more thoroughly investigate the behavior of gene expression in TCGA tumor samples, we introduce a statistical framework for partitioning the variation in gene expression due to a variety of molecular variables including somatic mutations, transcription factors (TFs), microRNAs, copy number alternations, methylation and germ-line genetic variation. As proof-of-principle, we identify and validate specific TFs that influence the expression of PTPN14 in breast cancer cells.
Availability and implementation
We provide a freely available, user-friendly, browseable interactive web-based application for exploring the results of our transcriptome-wide analyses across 17 different cancers in TCGA at http://ls-shiny-prod.uwm.edu/edge_in_tcga. All TCGA Open Access tier data are available at the Broad Institute GDAC Firehose and were downloaded using the TCGA2STAT R package. TCGA Controlled Access tier data are available via controlled access through the Genomic Data Commons (GDC). R scripts used to download, format and analyze the data and produce the interactive R/Shiny web app have been made available on GitHub at https://github.com/andreamrau/EDGE-in-TCGA.