In RNA-seq differential expression analysis, investigators aim to detect those genes with changes in expression level across conditions, despite technical and biological variability in the ...observations. A common task is to accurately estimate the effect size, often in terms of a logarithmic fold change (LFC).
When the read counts are low or highly variable, the maximum likelihood estimates for the LFCs has high variance, leading to large estimates not representative of true differences, and poor ranking of genes by effect size. One approach is to introduce filtering thresholds and pseudocounts to exclude or moderate estimated LFCs. Filtering may result in a loss of genes from the analysis with true differences in expression, while pseudocounts provide a limited solution that must be adapted per dataset. Here, we propose the use of a heavy-tailed Cauchy prior distribution for effect sizes, which avoids the use of filter thresholds or pseudocounts. The proposed method, Approximate Posterior Estimation for generalized linear model, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference.
The apeglm package is available as an R/Bioconductor package at https://bioconductor.org/packages/apeglm, and the methods can be called from within the DESeq2 software.
Supplementary data are available at Bioinformatics online.
The power prior: theory and applications Ibrahim, Joseph G.; Chen, Ming-Hui; Gwon, Yeongjin ...
Statistics in medicine,
10 December 2015, Volume:
34, Issue:
28
Journal Article
On the normalized power prior Carvalho, Luiz Max; Ibrahim, Joseph G.
Statistics in medicine,
30 October 2021, Volume:
40, Issue:
24
Journal Article
Peer reviewed
The power prior is a popular tool for constructing informative prior distributions based on historical data. The method consists of raising the likelihood to a discounting factor in order to control ...the amount of information borrowed from the historical data. However, one often wishes to assign this discounting factor a prior distribution and estimate it jointly with the parameters, which in turn necessitates the computation of a normalizing constant. In this article, we are concerned with how to approximately sample from joint posterior of the parameters and the discounting factor. We first show a few important properties of the normalizing constant and then use these results to motivate a bisection‐type algorithm for computing it on a fixed budget of evaluations. We give a large array of illustrations and discuss cases where the normalizing constant is known in closed‐form and where it is not. We show that the proposed method produces approximate posteriors that are very close to the exact distributions and also produces posteriors that cover the data‐generating parameters with higher probability in the intractable case. Our results suggest that the proposed method is an accurate and easy to implement technique to include this normalization, being applicable to a large class of models. They also reinforce the notion that proper inclusion of the normalizing constant is crucial to the drawing of correct inferences and appropriate quantification of uncertainty.
Joint models for longitudinal and survival data are particularly relevant to many cancer clinical trials and observational studies in which longitudinal biomarkers (eg, circulating tumor cells, ...immune response to a vaccine, and quality-of-life measurements) may be highly associated with time to event, such as relapse-free survival or overall survival. In this article, we give an introductory overview on joint modeling and present a general discussion of a broad range of issues that arise in the design and analysis of clinical trials using joint models. To demonstrate our points throughout, we present an analysis from the Eastern Cooperative Oncology Group trial E1193, as well as examine some operating characteristics of joint models through simulation studies.
Missing data are a prevailing problem in any type of data analyses. A participant variable is considered missing if the value of the variable (outcome or covariate) for the participant is not ...observed. In this article, various issues in analyzing studies with missing data are discussed. Particularly, we focus on missing response and/or covariate data for studies with discrete, continuous, or time-to-event end points in which generalized linear models, models for longitudinal data such as generalized linear mixed effects models, or Cox regression models are used. We discuss various classifications of missing data that may arise in a study and demonstrate in several situations that the commonly used method of throwing out all participants with any missing data may lead to incorrect results and conclusions. The methods described are applied to data from an Eastern Cooperative Oncology Group phase II clinical trial of liver cancer and a phase III clinical trial of advanced non-small-cell lung cancer. Although the main area of application discussed here is cancer, the issues and methods we discuss apply to any type of study.
We consider the problem of Bayesian sample size determination for a clinical trial in the presence of historical data that inform the treatment effect. Our broadly applicable, simulation-based ...methodology provides a framework for calibrating the informativeness of a prior while simultaneously identifying the minimum sample size required for a new trial such that the overall design has appropriate power to detect a non-null treatment effect and reasonable type I error control. We develop a comprehensive strategy for eliciting null and alternative sampling prior distributions which are used to define Bayesian generalizations of the traditional notions of type I error control and power. Bayesian type I error control requires that a weighted-average type I error rate not exceed a prespecified threshold. We develop a procedure for generating an appropriately sized Bayesian hypothesis test using a simple partial-borrowing power prior which summarizes the fraction of information borrowed from the historical trial. We present results from simulation studies that demonstrate that a hypothesis test procedure based on this simple power prior is as efficient as those based on more complicated meta-analytic priors, such as normalized power priors or robust mixture priors, when all are held to precise type I error control requirements. We demonstrate our methodology using a real data set to design a follow-up clinical trial with time-to-event endpoint for an investigational treatment in high-risk melanoma.
Abstract
A primary challenge in the analysis of RNA-seq data is to identify differentially expressed genes or transcripts while controlling for technical biases. Ideally, a statistical testing ...procedure should incorporate the inherent uncertainty of the abundance estimates arising from the quantification step. Most popular methods for RNA-seq differential expression analysis fit a parametric model to the counts for each gene or transcript, and a subset of methods can incorporate uncertainty. Previous work has shown that nonparametric models for RNA-seq differential expression may have better control of the false discovery rate, and adapt well to new data types without requiring reformulation of a parametric model. Existing nonparametric models do not take into account inferential uncertainty, leading to an inflated false discovery rate, in particular at the transcript level. We propose a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty. We compare our method, Swish, with popular differential expression analysis methods. Swish has improved control of the false discovery rate, in particular for transcripts with high inferential uncertainty. We apply Swish to a single-cell RNA-seq dataset, assessing differential expression between sub-populations of cells, and compare its performance to the Wilcoxon test.
Expression quantitative trait loci (eQTL) studies are used to understand the regulatory function of non-coding genome-wide association study (GWAS) risk loci, but colocalization alone does not ...demonstrate a causal relationship of gene expression affecting a trait. Evidence for mediation, that perturbation of gene expression in a given tissue or developmental context will induce a change in the downstream GWAS trait, can be provided by two-sample Mendelian Randomization (MR). Here, we introduce a new statistical method, MRLocus, for Bayesian estimation of the gene-to-trait effect from eQTL and GWAS summary data for loci with evidence of allelic heterogeneity, that is, containing multiple causal variants. MRLocus makes use of a colocalization step applied to each nearly-LD-independent eQTL, followed by an MR analysis step across eQTLs. Additionally, our method involves estimation of the extent of allelic heterogeneity through a dispersion parameter, indicating variable mediation effects from each individual eQTL on the downstream trait. Our method is evaluated against other state-of-the-art methods for estimation of the gene-to-trait mediation effect, using an existing simulation framework. In simulation, MRLocus often has the highest accuracy among competing methods, and in each case provides more accurate estimation of uncertainty as assessed through interval coverage. MRLocus is then applied to five candidate causal genes for mediation of particular GWAS traits, where gene-to-trait effects are concordant with those previously reported. We find that MRLocus's estimation of the causal effect across eQTLs within a locus provides useful information for determining how perturbation of gene expression or individual regulatory elements will affect downstream traits. The MRLocus method is implemented as an R package available at https://mikelove.github.io/mrlocus.
We consider a functional linear Cox regression model for characterizing the association between time-to-event data and a set of functional and scalar predictors. The functional linear Cox regression ...model incorporates a functional principal component analysis for modeling the functional predictors and a high-dimensional Cox regression model to characterize the joint effects of both functional and scalar predictors on the time-to-event data. We develop an algorithm to calculate the maximum approximate partial likelihood estimates of unknown finite and infinite dimensional parameters. We also systematically investigate the rate of convergence of the maximum approximate partial likelihood estimates and a score test statistic for testing the nullity of the slope function associated with the functional predictors. We demonstrate our estimation and testing procedures by using simulations and the analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) data. Our real data analyses show that high-dimensional hippocampus surface data may be an important marker for predicting time to conversion to Alzheimer's disease. Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu).