High-throughput post-genomic studies are now routinely and promisingly investigated in biological and biomedical research. The main statistical approach to select genes differentially expressed ...between two groups is to apply a t-test, which is subject of criticism in the literature. Numerous alternatives have been developed based on different and innovative variance modeling strategies. However, a critical issue is that selecting a different test usually leads to a different gene list. In this context and given the current tendency to apply the t-test, identifying the most efficient approach in practice remains crucial. To provide elements to answer, we conduct a comparison of eight tests representative of variance modeling strategies in gene expression data: Welch's t-test, ANOVA 1, Wilcoxon's test, SAM 2, RVM 3, limma 4, VarMixt 5 and SMVar 6. Our comparison process relies on four steps (gene list analysis, simulations, spike-in data and re-sampling) to formulate comprehensive and robust conclusions about test performance, in terms of statistical power, false-positive rate, execution time and ease of use. Our results raise concerns about the ability of some methods to control the expected number of false positives at a desirable level. Besides, two tests (limma and VarMixt) show significant improvement compared to the t-test, in particular to deal with small sample sizes. In addition limma presents several practical advantages, so we advocate its application to analyze gene expression data.
Joint estimation of causal effects from observational and intervention gene expression data Rau , Andrea(auteur de correspondance) (INRA , Jouy-En-Josas (France). UMR 1313 Génétique Animale et Biologie Intégrative); Jaffrezic , Florence (INRA , Jouy-En-Josas (France). UMR 1313 Génétique Animale et Biologie Intégrative); Nuel , Gregory (Centre National de la Recherche ScientifiqueSorbonne Paris Cité, ParisParis(France). UMR 8145, MAP5, Mathématiques appliquées à Paris 5)
http://www.biomedcentral.com/bmcsystbiol,
2013
Publication
Background: In recent years, there has been great interest in using transcriptomic data to infer gene regulatory networks. For the time being, methodological development in this area has primarily ...made use of graphical Gaussian models for observational wild-type data, resulting in undirected graphs that are not able to accurately highlight causal relationships among genes. In the present work, we seek to improve the estimation of causal effects among genes by jointly modeling observational transcriptomic data with arbitrarily complex intervention data obtained by performing partial, single, or multiple gene knock-outs or knock-downs. br/Results: Using the framework of causal Gaussian Bayesian networks, we propose a Markov chain Monte Carlo algorithm with a Mallows proposal model and analytical likelihood maximization to sample from the posterior distribution of causal node orderings, and in turn, to estimate causal effects. The main advantage of the proposed algorithm over previously proposed methods is its flexibility to accommodate any kind of intervention design, including partial or multiple knock-out experiments. Using simulated data as well as data from the Dialogue for Reverse Engineering Assessments and Methods (DREAM) 2007 challenge, the proposed method was compared to two alternative approaches: one requiring a complete, single knock-out design, and one able to model only observational data.br/Conclusions: The proposed algorithm was found to perform as well as, and in most cases better, than the alternative methods in terms of accuracy for the estimation of causal effects. In addition, multiple knock-outs proved to contribute valuable additional information compared to single knock-outs. Finally, the simulation study confirmed that it is not possible to estimate the causal ordering of genes from observational data alone. In all cases, we found that the inclusion of intervention experiments enabled more accurate estimation of causal regulatory relationships than the use of wild-type data alone.
Should we abandon the t-test in the analysis of gene expression microarray data: a comparison of variance modeling strategies Jeanmougin , Marine (Department of BiostatisticsCentre National de la Recherche ScientifiqueInstitut National de la Recherche Agronomique, PARIS (France). Department of Biostatistics, Pharnext - FRMAP5 Mathématiques appliquées Paris 5 - UFR de Maths et informatique 45 rue des Saints Pères 75270 PARIS CEDEX 06 FRSG Laboratoire Statistique et génomes - FR); De Reynies , Aurélien (Ligue contre le cancer(France). Ligue contre le cancer - FR); Marisa , Laetitia (Ligue contre le cancer (France). Ligue contre le cancer - FR) ...
http://www.plosone.org,
2010
Publication
High-throughput post-genomic studies are now routinely and promisingly investigated in biological and biomedical research. The main statistical approach to select genes differentially expressed ...between two groups is to apply a t-test, which is subject of criticism in the literature. Numerous alternatives have been developed based on different and innovative variance modeling strategies. However, a critical issue is that selecting a different test usually leads to a different gene list. In this context and given the current tendency to apply the t-test, identifying the most efficient approach in practice remains crucial. To provide elements to answer, we conduct a comparison of eight tests representative of variance modeling strategies in gene expression data: Welch's t-test, ANOVA 1, Wilcoxon's test, SAM 2, RVM 3, limma 4, VarMixt 5 and SMVar 6. Our comparison process relies on four steps (gene list analysis, simulations, spike-in data and re-sampling) to formulate comprehensive and robust conclusions about test performance, in terms of statistical power, false-positive rate, execution time and ease of use. Our results raise concerns about the ability of some methods to control the expected number of false positives at a desirable level. Besides, two tests (limma and VarMixt) show significant improvement compared to the t-test, in particular to deal with small sample sizes. In addition limma presents several practical advantages, so we advocate its application to analyze gene expression data.
Patterns with “unusual” frequencies are new functional candidate patterns. Their identification is usually achieved by considering an homogeneous m-order Markov model (m≥ 1) of the sequence, allowing ...the computation of p-values. For practical reasons, stationarity of the model is often assumed. This approximation can result in some artifacts especially when a large set of small sequences is considered. In this work, an exact method, able to take into account both nonstationarity and fragmentary structure of sequences, is applied on a simulated and a real set of sequences. This illustrates that pattern statistics can be very sensitive to the stationary assumption.
Background: The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the ...multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true. Results: In this context we present a semi-parametric approach based on kernel estimators which is applied to different high-throughput biological data such as patterns in DNA sequences, genes expression and genome-wide association studies. Conclusion: The proposed method has the practical advantages, over existing approaches, to consider complex heterogeneities in the alternative hypothesis, to take into account prior information (from an expert judgment or previous studies) by allowing a semi-supervised mode, and to deal with truncated distributions such as those obtained in Monte-Carlo simulations. This method has been implemented and is available through the R package kerfdr via the CRAN or at http://stat.genopole.cnrs.fr/software/kerfdr.