Since their introduction in the 50's, variance component mixed models have been widely used in many application fields. In this context, ReML estimation is by far the most popular procedure to infer ...the variance components of the model. Although many implementations of the ReML procedure are readily available, there is still need for computational improvements due to the ever-increasing size of the datasets to be handled, and to the complexity of the models to be adjusted. In this paper, we present a Min-Max (MM) algorithm for ReML inference and combine it with several speed-up procedures. The ReML MM algorithm we present is compared to 5 state-of-the-art publicly available algorithms used in statistical genetics. The computational performance of the different algorithms are evaluated on several datasets representing different plant breeding experimental designs. The MM algorithm ranks among the top 2 methods in almost all settings and is more versatile than many of its competitors. The MM algorithm is a promising alternative to the classical AI-ReML algorithm in the context of variance component mixed models. It is available in the MM4LMM R-package.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Model-based unsupervised learning, as any learning task, stalls as soon as missing data occurs. This is even more true when the missing data are informative, or said missing not at random (MNAR). In ...this paper, we propose model-based clustering algorithms designed to handle very general types of missing data, including MNAR data. To do so, we introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism, remaining vigilant to the relative degrees of freedom of each. Several MNAR models are discussed, for which the cause of the missingness can depend on both the values of the missing variable themselves and on the class membership. However, we focus on a specific MNAR model, called MNAR
z
, for which the missingness only depends on the class membership. We first underline its ease of estimation, by showing that the statistical inference can be carried out on the data matrix concatenated with the missing mask considering finally a standard MAR mechanism. Consequently, we propose to perform clustering using the Expectation Maximization algorithm, specially developed for this simplified reinterpretation. Finally, we assess the numerical performances of the proposed methods on synthetic data and on the real medical registry TraumaBase
as well.
The effective sample size (ESS) is a metric used to summarize in a single term the amount of correlation in a sample. It is of particular interest when predicting the statistical power of genome-wide ...association studies (GWAS) based on linear mixed models. Here, we introduce an analytical form of the ESS for mixed-model GWAS of quantitative traits and relate it to empirical estimators recently proposed. Using our framework, we derived approximations of the ESS for analyses of related and unrelated samples and for both marginal genetic and gene-environment interaction tests. We conducted simulations to validate our approximations and to provide a quantitative perspective on the statistical power of various scenarios, including power loss due to family relatedness and power gains due to conditioning on the polygenic signal. Our analyses also demonstrate that the power of gene-environment interaction GWAS in related individuals strongly depends on the family structure and exposure distribution. Finally, we performed a series of mixed-model GWAS on data from the UK Biobank and confirmed the simulation results. We notably found that the expected power drop due to family relatedness in the UK Biobank is negligible.
Abstract only Introduction: Calcific aortic stenosis (CAS) is a common, progressive fibrocalcific pathology of the aortic valve without medical therapy. The genetics of CAS remain only partially ...understood. Methods: We performed a genome wide association study (GWAS) of CAS among 2,799,598 individuals from the International Aortic Valve Genetics Consortium (IAVGC), comprising 28 cohorts. CAS was identified using a common ICD/CPT based phenotype. GWAS were meta-analyzed using inverse variance weighting with adjustment by linkage-disequilibrium score regression (LDSR) intercept. Unique genome-wide significant (GWS) loci and causal genes were annotated by nearest gene and eQTL colocalization. Genetic correlations were performed against atherosclerotic, adiposity, and lipid traits using LDSR with publicly available GWAS (CARDIoGRAMplusC4D for coronary artery disease CAD, Million Veteran Program for peripheral artery disease PAD, MEGASTROKE for ischemic stroke IS, GIANT for body mass index BMI, and GLGC for lipids). Results: There were 85,329 individuals with CAS (79,397 White, 3,126 Black, 1,403 Hispanic, and 1,403 East Asian) among 2,799,598 individuals. Meta-analysis of GWAS resulted in 224 unique GWS genomic regions, of which 205 were novel. The majority of the GWS genomic regions (134) did not overlap with prior risk loci for CAD, PAD, IS, BMI, or lipids. Genetic correlation demonstrated modest but significant correlations between CAS and CAD ( r =0.26, p=3.3x10 -18 ), PAD ( r =0.41, p=1.4x10 -30 ), IS ( r =0.18,p=2.8x10 -7 ), BMI ( r =0.22,p=8.6x10 -30 ), and lipids (LDL-C r =0.17,p=3.6x10 -10 ;triglycerides r =0.10,p=1.9x10 -5 ; HDL-C r =-0.07,p=8.0x10 -4 ). Conclusions: This largest to-date multi-ancestry GWAS of CAS identified 205 novel genomic regions. We demonstrate that CAS is genetically distinct from cardiometabolic traits, with only modest genetic correlations and with a majority of CAS risk loci having no overlap with cardiometabolic GWAS risk loci.
The problem of inferring the relatedness distribution between two individuals from biallelic marker data is considered. This problem can be cast as an estimation task in a mixture model: at each ...marker the latent variable is the relatedness state, and the observed variable is the genotype of the two individuals. In this model, only the prior proportions are unknown, and can be obtained via ML estimation using the EM algorithm. When the markers are biallelic and the data unphased, the identifiability of the model is known not to be guaranteed. In this article, model identifiability is investigated in the case of phased data generated from a crossing design, a classical situation in plant genetics. It is shown that identifiability can be guaranteed under some conditions on the crossing design. The adapted ML estimator is implemented in an R package called Relatedness. The performance of the ML estimator is evaluated and compared to that of the benchmark moment estimator, both on simulated and real data. Compared to its competitor, the ML estimator is shown to be more robust and to provide more realistic estimates.
Gain-of-function mutations in the EPAS1/HIF2A gene have been identified in patients with hereditary erythrocytosis that can be associated with the development of paraganglioma, pheochromocytoma and ...somatostatinoma. In the present study, we describe a unique European collection of 41 patients and 28 relatives diagnosed with an erythrocytosis associated with a germline genetic variant in EPAS1. In addition we identified two infants with severe erythrocytosis associated with a mosaic mutation present in less than 2% of the blood, one of whom later developed a paraganglioma. The aim of this study was to determine the causal role of these genetic variants, to establish pathogenicity, and to identify potential candidates eligible for the new hypoxia-inducible factor-2 α (HIF-2α) inhibitor treatment. Pathogenicity was predicted with in silico tools and the impact of 13 HIF-2b variants has been studied by using canonical and real-time reporter luciferase assays. These functional assays consisted of a novel edited vector containing an expanded region of the erythropoietin promoter combined with distal regulatory elements which substantially enhanced the HIF-2α-dependent induction. Altogether, our studies allowed the classification of 11 mutations as pathogenic in 17 patients and 23 relatives. We described four new mutations (D525G, L526F, G527K, A530S) close to the key proline P531, which broadens the spectrum of mutations involved in erythrocytosis. Notably, we identified patients with only erythrocytosis associated with germline mutations A530S and Y532C previously identified at somatic state in tumors, thereby raising the complexity of the genotype/phenotype correlations. Altogether, this study allows accurate clinical follow-up of patients and opens the possibility of benefiting from HIF-2α inhibitor treatment, so far the only targeted treatment in hypoxia-related erythrocytosis disease.
Model-based unsupervised learning, as any learning task, stalls as soon as missing data occurs. This is even more true when the missing data are informative, or said missing not at random (MNAR). In ...this paper, we propose model-based clustering algorithms designed to handle very general types of missing data, including MNAR data. To do so, we introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism, remaining vigilant to the relative degrees of freedom of each. Several MNAR models are discussed, for which the cause of the missingness can depend on both the values of the missing variable themselves and on the class membership. However, we focus on a specific MNAR model, called MNARz, for which the missingness only depends on the class membership. We first underline its ease of estimation, by showing that the statistical inference can be carried out on the data matrix concatenated with the missing mask considering finally a standard MAR mechanism. Consequently, we propose to perform clustering using the Expectation Maximization algorithm, specially developed for this simplified reinterpretation. Finally, we assess the numerical performances of the proposed methods on synthetic data and on the real medical registry TraumaBase as well.