Abstract
Summary
Accurate clustering of mixed data, encompassing binary, categorical, and continuous variables, is vital for effective patient stratification in clinical questionnaire analysis. To ...address this need, we present longmixr, a comprehensive R package providing a robust framework for clustering mixed longitudinal data using finite mixture modeling techniques. By incorporating consensus clustering, longmixr ensures reliable and stable clustering results. Moreover, the package includes a detailed vignette that facilitates cluster exploration and visualization.
Availability and implementation
The R package is freely available at https://cran.r-project.org/package=longmixr with detailed documentation, including a case vignette, at https://cellmapslab.github.io/longmixr/.
Independent component analysis (ICA) is a blind source separation method for linear disentanglement of independent latent sources from observed data. We investigate the special setting of noisy ...linear ICA where the observations are split among different views, each receiving a mixture of shared and individual sources. We prove that the corresponding linear structure is identifiable, and the source distribution can be recovered. To computationally estimate the sources, we optimize a constrained form of the joint log-likelihood of the observed data among all views. We also show empirically that our objective recovers the sources also in the case when the measurements are corrupted by noise. Furthermore, we propose a model selection procedure for recovering the number of shared sources which we verify empirically. Finally, we apply the proposed model in a challenging real-life application, where the estimated shared sources from two large transcriptome datasets (observed data) provided by two different labs (two different views) lead to recovering (shared) sources utilized for finding a plausible representation of the underlying graph structure.
Over the past years, Generative Adversarial Networks (GANs) have shown a remarkable generation performance especially in image synthesis. Unfortunately, they are also known for having an unstable ...training process and might loose parts of the data distribution for heterogeneous input data. In this paper, we propose a novel GAN extension for multi-modal distribution learning (MMGAN). In our approach, we model the latent space as a Gaussian mixture model with a number of clusters referring to the number of disconnected data manifolds in the observation space, and include a clustering network, which relates each data manifold to one Gaussian cluster. Thus, the training gets more stable. Moreover, MMGAN allows for clustering real data according to the learned data manifold in the latent space. By a series of benchmark experiments, we illustrate that MMGAN outperforms competitive state-of-the-art models in terms of clustering performance.
We propose a general framework for constructing powerful, sequential hypothesis tests for a large class of nonparametric testing problems. The null hypothesis for these problems is defined in an ...abstract form using the action of two known operators on the data distribution. This abstraction allows for a unified treatment of several classical tasks, such as two-sample testing, independence testing, and conditional-independence testing, as well as modern problems, such as testing for adversarial robustness of machine learning (ML) models. Our proposed framework has the following advantages over classical batch tests: 1) it continuously monitors online data streams and efficiently aggregates evidence against the null, 2) it provides tight control over the type I error without the need for multiple testing correction, 3) it adapts the sample size requirement to the unknown hardness of the problem. We develop a principled approach of leveraging the representation capability of ML models within the testing-by-betting framework, a game-theoretic approach for designing sequential tests. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines on several tasks.
We introduce a powerful deep classifier two-sample test for high-dimensional data based on E-values, called E-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on ...split likelihood ratio tests and predictive independence tests. The resulting E-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches beyond the conventional two-split (training and testing) approach of standard classifier two-sample tests. This strategy increases the power of the test while keeping the type I error well below the desired significance level.
To examine the possible association of maternal serum a disintegrin and metalloprotease (ADAM12) in the first trimester of pregnancy and subsequent development of preeclampsia, delivery of small for ...gestational age (SGA) neonates, and spontaneous preterm delivery.
The maternal serum concentration of ADAM12 at 11 0/7 to 13 6/7 weeks was measured in 128 cases of preeclampsia, 88 cases of gestational hypertension, 296 cases with SGA neonates, 58 cases of spontaneous preterm delivery, and 570 controls. Regression analysis was used to determine which of the maternal factors and fetal crown rump length were significant predictors of ADAM12 in the control group, and from the regression model the value in each case and control was expressed as a multiple of median (MoM). The levels of ADAM12 MoM were compared in cases and controls.
In the control group the concentration of ADAM12 increased with fetal crown rump length, decreased with maternal weight and was higher in African-American than in white women. There was a significant association between ADAM12 and pregnancy-associated plasma protein A (r=0.417, P<.001) and between each metabolite and birth weight percentile (r=0.176, P<.001 and r=0.109, P=.009). In the SGA group, the median ADAM12 concentration (0.848 MoM) was lower (P<.001), but in pregnancies complicated by preeclampsia (0.954 MoM), gestational hypertension (1.013 MoM), and spontaneous preterm delivery (1.048 MoM) the levels were not significantly different from controls (1.011 MoM).
There is a good correlation between the maternal serum ADAM12 and pregnancy-associated plasma protein A concentration. Measurement of ADAM12 does not provide useful prediction of SGA, preeclampsia, or spontaneous preterm delivery.
II.