Principal component analysis (PCA) is a standard method to correct for population stratification in ancestry-specific genome-wide association studies (GWASs) and is used to cluster individuals by ...ancestry. Using the 1000 genomes project data, we examine how non-linear dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE) or generative topographic mapping (GTM) can be used to provide improved ancestry maps by accounting for a higher percentage of explained variance in ancestry, and how they can help to estimate the number of principal components necessary to account for population stratification. GTM generates posterior probabilities of class membership which can be used to assess the probability of an individual to belong to a given population - as opposed to t-SNE, GTM can be used for both clustering and classification.
PCA only partially identifies population clusters and does not separate most populations within a given continent, such as Japanese and Han Chinese in East Asia, or Mende and Yoruba in Africa. t-SNE and GTM, taking into account more data variance, can identify more fine-grained population clusters. GTM can be used to build probabilistic classification models, and is as efficient as support vector machine (SVM) for classifying 1000 Genomes Project populations.
The main interest of probabilistic GTM maps is to attain two objectives with only one map: provide a better visualization that separates populations efficiently, and infer genetic ancestry for individuals or populations. This paper is a first application of GTM for ancestry classification models. Our code ( https://github.com/hagax8/ancestry_viz ) and interactive visualizations ( https://lovingscience.com/ancestries ) are available online.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Body composition is often altered in psychiatric disorders. Using genome-wide common genetic variation data, we calculate sex-specific genetic correlations amongst body fat %, fat mass, fat-free ...mass, physical activity, glycemic traits and 17 psychiatric traits (up to N = 217,568). Two patterns emerge: (1) anorexia nervosa, schizophrenia, obsessive-compulsive disorder, and education years are negatively genetically correlated with body fat % and fat-free mass, whereas (2) attention-deficit/hyperactivity disorder (ADHD), alcohol dependence, insomnia, and heavy smoking are positively correlated. Anorexia nervosa shows a stronger genetic correlation with body fat % in females, whereas education years is more strongly correlated with fat mass in males. Education years and ADHD show genetic overlap with childhood obesity. Mendelian randomization identifies schizophrenia, anorexia nervosa, and higher education as causal for decreased fat mass, with higher body fat % possibly being a causal risk factor for ADHD and heavy smoking. These results suggest new possibilities for targeted preventive strategies.
With few exceptions, the marked advances in knowledge about the genetic basis of schizophrenia have not converged on findings that can be confidently used for precise experimental modeling. By ...applying knowledge of the cellular taxonomy of the brain from single-cell RNA sequencing, we evaluated whether the genomic loci implicated in schizophrenia map onto specific brain cell types. We found that the common-variant genomic results consistently mapped to pyramidal cells, medium spiny neurons (MSNs) and certain interneurons, but far less consistently to embryonic, progenitor or glial cells. These enrichments were due to sets of genes that were specifically expressed in each of these cell types. We also found that many of the diverse gene sets previously associated with schizophrenia (genes involved in synaptic function, those encoding mRNAs that interact with FMRP, antipsychotic targets, etc.) generally implicated the same brain cell types. Our results suggest a parsimonious explanation: the common-variant genetic results for schizophrenia point at a limited set of neurons, and the gene sets point to the same cells. The genetic risk associated with MSNs did not overlap with that of glutamatergic pyramidal cells and interneurons, suggesting that different cell types have biologically distinct roles in schizophrenia.
Anorexia nervosa (AN) occurs nine times more often in females than in males. Although environmental factors likely play a role, the reasons for this imbalanced sex ratio remain unresolved. AN ...displays high genetic correlations with anthropometric and metabolic traits. Given sex differences in body composition, we investigated the possible metabolic underpinnings of female propensity for AN. We conducted sex‐specific GWAS in a healthy and medication‐free subsample of the UK Biobank (n = 155,961), identifying 77 genome‐wide significant loci associated with body fat percentage (BF%) and 174 with fat‐free mass (FFM). Partitioned heritability analysis showed an enrichment for central nervous tissue‐associated genes for BF%, which was more prominent in females than males. Genetic correlations of BF% and FFM with the largest GWAS of AN by the Psychiatric Genomics Consortium were estimated to explore shared genomics. The genetic correlations of BF%male and BF%female with AN differed significantly from each other (p < .0001, δ = −0.17), suggesting that the female preponderance in AN may, in part, be explained by sex‐specific anthropometric and metabolic genetic factors increasing liability to AN.
The predictive utility of polygenic scores is increasing, and many polygenic scoring methods are available, but it is unclear which method performs best. This study evaluates the predictive utility ...of polygenic scoring methods within a reference-standardized framework, which uses a common set of variants and reference-based estimates of linkage disequilibrium and allele frequencies to construct scores. Eight polygenic score methods were tested: p-value thresholding and clumping (pT+clump), SBLUP, lassosum, LDpred1, LDpred2, PRScs, DBSLMM and SBayesR, evaluating their performance to predict outcomes in UK Biobank and the Twins Early Development Study (TEDS). Strategies to identify optimal p-value thresholds and shrinkage parameters were compared, including 10-fold cross validation, pseudovalidation and infinitesimal models (with no validation sample), and multi-polygenic score elastic net models. LDpred2, lassosum and PRScs performed strongly using 10-fold cross-validation to identify the most predictive p-value threshold or shrinkage parameter, giving a relative improvement of 16-18% over pT+clump in the correlation between observed and predicted outcome values. Using pseudovalidation, the best methods were PRScs, DBSLMM and SBayesR. PRScs pseudovalidation was only 3% worse than the best polygenic score identified by 10-fold cross validation. Elastic net models containing polygenic scores based on a range of parameters consistently improved prediction over any single polygenic score. Within a reference-standardized framework, the best polygenic prediction was achieved using LDpred2, lassosum and PRScs, modeling multiple polygenic scores derived using multiple parameters. This study will help researchers performing polygenic score studies to select the most powerful and predictive analysis methods.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
This paper is devoted to the analysis and visualization in 2-dimensional space of large data sets of millions of compounds using the incremental version of generative topographic mapping (iGTM). The ...iGTM algorithm implemented in the in-house ISIDA-GTM program was applied to a database of more than 2 million compounds combining data sets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis were proposed. The chemical space coverage was evaluated using the normalized Shannon entropy. Different views of the data (property landscapes) were obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helped to identify the regions in the chemical space populated by compounds with desirable physicochemical profiles and the suppliers providing them. The data sets similarity in the latent space was assessed by applying several metrics (Euclidean distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data sets were compared by considering them as individual objects on a meta-GTM map, built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way to analyze and visualize large chemical databases.
Anxiety disorders are common, complex psychiatric disorders with twin heritabilities of 30-60%. We conducted a genome-wide association study of Lifetime Anxiety Disorder (n
= 25 453, n
= 58 113) ...and an additional analysis of Current Anxiety Symptoms (n
= 19 012, n
= 58 113). The liability scale common variant heritability estimate for Lifetime Anxiety Disorder was 26%, and for Current Anxiety Symptoms was 31%. Five novel genome-wide significant loci were identified including an intergenic region on chromosome 9 that has previously been associated with neuroticism, and a locus overlapping the BDNF receptor gene, NTRK2. Anxiety showed significant positive genetic correlations with depression and insomnia as well as coronary artery disease, mirroring findings from epidemiological studies. We conclude that common genetic variation accounts for a substantive proportion of the genetic architecture underlying anxiety.
Earlier (Kireeva et al. Mol. Inf. 2012, 31, 301–312), we demonstrated that generative topographic mapping (GTM) can be efficiently used both for data visualization and building of classification ...models in the initial D-dimensional space of molecular descriptors. Here, we describe the modeling in two-dimensional latent space for the four classes of the BioPharmaceutics Drug Disposition Classification System (BDDCS) involving VolSurf descriptors. Three new definitions of the applicability domain (AD) of models have been suggested: one class-independent AD which considers the GTM likelihood and two class-dependent ADs considering respectively, either the predominant class in a given node of the map or informational entropy. The class entropy AD was found to be the most efficient for the BDDCS modeling. The predominant class AD can be directly visualized on GTM maps, which helps the interpretation of the model.
Predicting the activity profile of a molecule or discovering structures possessing a specific activity profile are two important goals in chemoinformatics, which could be achieved by bridging ...activity and molecular descriptor spaces. In this paper, we introduce the “Stargate” version of the Generative Topographic Mapping approach (S-GTM) in which two different multidimensional spaces (e.g., structural descriptor space and activity space) are linked through a common 2D latent space. In the S-GTM algorithm, the manifolds are trained simultaneously in two initial spaces using the probabilities in the 2D latent space calculated as a weighted geometric mean of probability distributions in both spaces. S-GTM has the following interesting features: (1) activities are involved during the training procedure; therefore, the method is supervised, unlike conventional GTM; (2) using molecular descriptors of a given compound as input, the model predicts a whole activity profile, and (3) using an activity profile as input, areas populated by relevant chemical structures can be detected. To assess the performance of S-GTM prediction models, a descriptor space (ISIDA descriptors) of a set of 1325 GPCR ligands was related to a B-dimensional (B = 1 or 8) activity space corresponding to pK i values for eight different targets. S-GTM outperforms conventional GTM for individual activities and performs similarly to the Lasso multitask learning algorithm, although it is still slightly less accurate than the Random Forest method.
Genome-wide association studies (GWAS) in psychiatry, once they reach sufficient sample size and power, have been enormously successful. The Psychiatric Genomics Consortium (PGC) aims for ...mega-analyses with sample sizes that will grow to >1 million individuals in the next 5 years. This should lead to hundreds of new findings for common genetic variants across nine psychiatric disorders studied by the PGC. The new targets discovered by GWAS have the potential to restart largely stalled psychiatric drug development pipelines, and the translation of GWAS findings into the clinic is a key aim of the recently funded phase 3 of the PGC. This is not without considerable technical challenges. These approaches complement the other main aim of GWAS studies, risk prediction approaches for improving detection, differential diagnosis, and clinical trial design. This paper outlines the motivations, technical and analytical issues, and the plans for translating PGC phase 3 findings into new therapeutics.