Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present ...scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.
Differences in cell-type composition across subjects and conditions often carry biological significance. Recent advancements in single cell sequencing technologies enable cell-types to be identified ...at the single cell level, and as a result, cell-type composition of tissues can now be studied in exquisite detail. However, a number of challenges remain with cell-type composition analysis - none of the existing methods can identify cell-type perfectly and variability related to cell sampling exists in any single cell experiment. This necessitates the development of method for estimating uncertainty in cell-type composition.
We developed a novel single cell differential composition (scDC) analysis method that performs differential cell-type composition analysis via bootstrap resampling. scDC captures the uncertainty associated with cell-type proportions of each subject via bias-corrected and accelerated bootstrap confidence intervals. We assessed the performance of our method using a number of simulated datasets and synthetic datasets curated from publicly available single cell datasets. In simulated datasets, scDC correctly recovered the true cell-type proportions. In synthetic datasets, the cell-type compositions returned by scDC were highly concordant with reference cell-type compositions from the original data. Since the majority of datasets tested in this study have only 2 to 5 subjects per condition, the addition of confidence intervals enabled better comparisons of compositional differences between subjects and across conditions.
scDC is a novel statistical method for performing differential cell-type composition analysis for scRNA-seq data. It uses bootstrap resampling to estimate the standard errors associated with cell-type proportion estimates and performs significance testing through GLM and GLMM models. We have made this method available to the scientific community as part of the scdney package (Single Cell Data Integrative Analysis) R package, available from https://github.com/SydneyBioX/scdney.
Abstract
Motivation
High parameter histological techniques have allowed for the identification of a variety of distinct cell types within an image, providing a comprehensive overview of the tissue ...environment. This allows the complex cellular architecture and environment of diseased tissue to be explored. While spatial analysis techniques have revealed how cell–cell interactions are important within the disease pathology, there remains a gap in exploring changes in these interactions within the disease process. Specifically, there are currently few established methods for performing inference on cell-type co-localization changes across images, hindering an understanding of how cellular environments change with a disease pathology.
Results
We have developed the spicyR R package to perform inference on changes in the spatial co-localization of types across groups of images. Application to simulated data demonstrates a high sensitivity and specificity. We the utility of spicyR by applying it to a type 1 diabetes imaging mass cytometry dataset, revealing changes in cellular associations that were relevant to the disease progression. Ultimately, spicyR allows changes in cellular environments to be explored under different pathologies or disease states.
Availability and implementation
R package is freely available at http://bioconductor.org/packages/release/bioc/html/spicyR.html and shiny app implementation at http://shiny.maths.usyd.edu.au/spicyR/.
Supplementary information
Supplementary data are available at Bioinformatics online.
To analyse the effects of maternal diabetes mellitus (DM) and body mass Index (BMI) on central and peripheral fat accretion of large for gestational age (LGA) offspring.
This retrospective study ...included LGA fetuses (n = 595) with ultrasound scans at early (19.23 ± 0.68 weeks), mid (28.98 ± 1.62 weeks) and late (36.20 ± 1.59 weeks) stages of adipogenesis and measured abdominal (AFT) and mid-thigh (TFT) fat as surrogates for central and peripheral adiposity. Women were categorised according to BMI and DM status pre-gestational (P-DM; n = 59), insulin managed (I-GDM; n = 132) and diet managed gestational diabetes (D-GDM; n = 29). Analysis of variance and linear regressions were applied.
AFT and TFT did not differ significantly between BMI categories (normal, overweight and obese). In contrast, AFT was significantly higher in pregnancies affected by D-GDM compared to non-DM pregnancies from mid stage (0.44 mm difference, p = 0.002) and for all DM categories in late stage of adipogenesis (≥ 0.49 mm difference, p < 0.008). Late stage TFT accretion was higher than controls for P-DM and I-GDM but not for D-GDM (0.67 mm difference, p < 0.001; 0.49 mm difference, p = 0.001, 0.56 mm difference, p = 0.22 respectively). In comparison to the early non-DM group with an AFT to TFT ratio of 1.07, the I-GDM group ratio was 1.25 (p < 0.001), which normalised by 28 weeks becoming similar to control ratios.
DM, independent of BMI, was associated with higher abdominal and mid-thigh fat accretion in fetuses. Use of insulin improved central to peripheral fat ratios in fetuses of GDM mothers.
A fast Bayesian method that seamlessly fuses classification and hypothesis testing via discriminant analysis is developed. Building upon the original discriminant analysis classifier, modelling ...components are added to identify discriminative variables. A combination of cake priors and a novel form of variational Bayes we call reverse collapsed variational Bayes gives rise to variable selection that can be directly posed as a multiple hypothesis testing approach using likelihood ratio statistics. Some theoretical arguments are presented showing that Chernoff-consistency (asymptotically zero type I and type II error) is maintained across all hypotheses. We apply our method on some publicly available genomics datasets and show that our method performs well in practice for its computational cost. An R package VaDA has also been made available on Github.
Tractable skew-normal approximations via matching Zhou, Jackson; Grazian, Clara; Ormerod, John T.
Journal of statistical computation and simulation,
03/2024, Letnik:
94, Številka:
5
Journal Article
Recenzirano
Odprti dostop
Many approximate Bayesian inference methods assume a particular parametric form for approximating the posterior distribution. A Gaussian distribution provides a convenient density for such ...approaches; examples include the Laplace, penalized quasi-likelihood, Gaussian variational, and expectation propagation methods. Unfortunately, these all ignore potential posterior skewness. The recent work of Durante et al. Skewed Bernstein-von Mises theorem and skew-modal approximations; 2023. ArXiv preprint arXiv:2301.03038. addresses this using skew-modal (SM) approximations, and is theoretically justified by a skewed Bernstein-von Mises theorem. However, the SM approximation can be impractical to work with in terms of tractability and storage costs, and uses only local posterior information. We introduce a variety of matching-based approximation schemes using the standard skew-normal distribution to resolve these issues. Experiments were conducted to compare the performance of this skew-normal matching method (both as a standalone approximation and as a post-hoc skewness adjustment) with the SM and existing Gaussian approximations. We show that for small and moderate dimensions, skew-normal matching can be much more accurate than these other approaches. For post-hoc skewness adjustments, this comes at very little cost in additional computational time.
Generalized linear latent variable models (GLLVMs) are a powerful class of models for understanding the relationships among multiple, correlated responses. Estimation, however, presents a major ...challenge, as the marginal likelihood does not possess a closed form for nonnormal responses. We propose a variational approximation (VA) method for estimating GLLVMs. For the common cases of binary, ordinal, and overdispersed count data, we derive fully closed-form approximations to the marginal log-likelihood function in each case. Compared to other methods such as the expectation-maximization algorithm, estimation using VA is fast and straightforward to implement. Predictions of the latent variables and associated uncertainty estimates are also obtained as part of the estimation process. Simulations show that VA estimation performs similar to or better than some currently available methods, both at predicting the latent variables and estimating their corresponding coefficients. They also show that VA estimation offers dramatic reductions in computation time particularly if the number of correlated responses is large relative to the number of observational units. We apply the variational approach to two datasets, estimating GLLVMs to understanding the patterns of variation in youth gratitude and for constructing ordination plots in bird abundance data.
R
code for performing VA estimation of GLLVMs is available online. Supplementary materials for this article are available online.
Detection-nondetection data are often used to investigate species range dynamics using Bayesian occupancy models which rely on the use of Markov chain Monte Carlo (MCMC) methods to sample from the ...posterior distribution of the parameters of the model. In this article we develop two Variational Bayes (VB) approximations to the posterior distribution of the parameters of a single-season site occupancy model which uses logistic link functions to model the probability of species occurrence at sites and of species detection probabilities. This task is accomplished through the development of iterative algorithms that do not use MCMC methods. Simulations and small practical examples demonstrate the effectiveness of the proposed technique. We specifically show that (under certain circumstances) the variational distributions can provide accurate approximations to the true posterior distributions of the parameters of the model when the number of visits per site (K) are as low as three and that the accuracy of the approximations improves as K increases. We also show that the methodology can be used to obtain the posterior distribution of the predictive distribution of the proportion of sites occupied (PAO).
In this work, we propose a novel approximated collapsed variational Bayes approach to model selection in linear regression. The approximated collapsed variational Bayes algorithm offers improvements ...over mean field variational Bayes by marginalizing over a subset of parameters and using mean field variational Bayes over the remaining parameters in an analogous fashion to collapsed Gibbs sampling. We have shown that the proposed algorithm, under typical regularity assumptions, (a) includes variables in the true underlying model at an exponential rate in the sample size, or (b) excludes the variables at least at the first order rate in the sample size if the variables are not in the true model. Simulation studies show that the performance of the proposed method is close to that of a particular Markov chain Monte Carlo sampler and a path search based variational Bayes algorithm, but requires an order of magnitude less time. The proposed method is also highly competitive with penalized methods, expectation propagation, stepwise AIC/BIC, BMS, and EMVS under various settings.
Supplementary materials
for the article are available online.
Abstract
Motivation
Genes act as a system and not in isolation. Thus, it is important to consider coordinated changes of gene expression rather than single genes when investigating biological ...phenomena such as the aetiology of cancer. We have developed an approach for quantifying how changes in the association between pairs of genes may inform the outcome of interest called Differential Correlation across Ranked Samples (DCARS). Modelling gene correlation across a continuous sample ranking does not require the dichotomisation of samples into two distinct classes and can identify differences in gene correlation across early, mid or late stages of the outcome of interest.
Results
When we evaluated DCARS against the typical Fisher Z-transformation test for differential correlation, as well as a typical approach testing for interaction within a linear model, on real TCGA data, DCARS significantly ranked gene pairs containing known cancer genes more highly across several cancers. Similar results are found with our simulation study. DCARS was applied to 13 cancers datasets in TCGA, revealing several distinct relationships for which survival ranking was found to be associated with a change in correlation between genes. Furthermore, we demonstrated that DCARS can be used in conjunction with network analysis techniques to extract biological meaning from multi-layered and complex data.
Availability and implementation
DCARS R package and sample data are available at https://github.com/shazanfar/DCARS. Publicly available data from The Cancer Genome Atlas (TCGA) was used using the TCGABiolinks R package. Supplementary Files and DCARS R package is available at https://github.com/shazanfar/DCARS.
Supplementary information
Supplementary data are available at Bioinformatics online.