Typical data in a microbiome study consist of the operational taxonomic unit (OTU) counts that have the characteristic of excess zeros, which are often ignored by investigators. In this paper, we ...compare the performance of different competing methods to model data with zero inflated features through extensive simulations and application to a microbiome study. These methods include standard parametric and non-parametric models, hurdle models, and zero inflated models. We examine varying degrees of zero inflation, with or without dispersion in the count component, as well as different magnitude and direction of the covariate effect on structural zeros and the count components. We focus on the assessment of type I error, power to detect the overall covariate effect, measures of model fit, and bias and effectiveness of parameter estimations. We also evaluate the abilities of model selection strategies using Akaike information criterion (AIC) or Vuong test to identify the correct model. The simulation studies show that hurdle and zero inflated models have well controlled type I errors, higher power, better goodness of fit measures, and are more accurate and efficient in the parameter estimation. Besides that, the hurdle models have similar goodness of fit and parameter estimation for the count component as their corresponding zero inflated models. However, the estimation and interpretation of the parameters for the zero components differs, and hurdle models are more stable when structural zeros are absent. We then discuss the model selection strategy for zero inflated data and implement it in a gut microbiome study of > 400 independent subjects.
An unexpectedly high proportion of SNPs on the X chromosome in the 1000 Genomes Project phase 3 data were identified with significant sex differences in minor allele frequencies (sdMAF). sdMAF ...persisted for many of these SNPs in the recently released high coverage whole genome sequence of the 1000 Genomes Project that was aligned to GRCh38, and it was consistent between the five super-populations. Among the 245,825 common (MAF>5%) biallelic X-chromosomal SNPs in the phase 3 data presumed to be of high quality, 2,039 have genome-wide significant sdMAF (p-value <5e-8). sdMAF varied by location: non-pseudo-autosomal region (NPR) = 0.83%, pseudo-autosomal regions (PAR1) = 0.29%, PAR2 = 13.1%, and X-transposed region (XTR)/PAR3 = 0.85% of SNPs had sdMAF, and they were clustered at the NPR-PAR boundaries, among others. sdMAF at the NPR-PAR boundaries are biologically expected due to sex-linkage, but have generally been ignored in association studies. For comparison, similar analyses found only 6, 1 and 0 SNPs with significant sdMAF on chromosomes 1, 7 and 22, respectively. Similar sdMAF results for the X chromosome were obtained from the high coverage whole genome sequence data from gnomAD V 3.1.2 for both the non-Finnish European and African/African American samples. Future X chromosome analyses need to take sdMAF into account.
Abstract
Motivation
Research supports the potential use of microbiome as a predictor of some diseases. Motivated by the findings that microbiome data is complex in nature, and there is an inherent ...correlation due to hierarchical taxonomy of microbial Operational Taxonomic Units (OTUs), we propose a novel machine learning method incorporating a stratified approach to group OTUs into phylum clusters. Convolutional Neural Networks (CNNs) were used to train within each of the clusters individually. Further, through an ensemble learning approach, features obtained from each cluster were then concatenated to improve prediction accuracy. Our two-step approach comprising stratification prior to combining multiple CNNs, aided in capturing the relationships between OTUs sharing a phylum efficiently, as compared to using a single CNN ignoring OTU correlations.
Results
We used simulated datasets containing 168 OTUs in 200 cases and 200 controls for model testing. Thirty-two OTUs, potentially associated with risk of disease were randomly selected and interactions between three OTUs were used to introduce non-linearity. We also implemented this novel method in two human microbiome studies: (i) Cirrhosis with 118 cases, 114 controls; (ii) type 2 diabetes (T2D) with 170 cases, 174 controls; to demonstrate the model’s effectiveness. Extensive experimentation and comparison against conventional machine learning techniques yielded encouraging results. We obtained mean AUC values of 0.88, 0.92, 0.75, showing a consistent increment (5%, 3%, 7%) in simulations, Cirrhosis and T2D data, respectively, against the next best performing method, Random Forest.
Availability and implementation
https://github.com/divya031090/TaxoNN_OTU.
Supplementary information
Supplementary data are available at Bioinformatics online.
Estimating the prevalence of autosomal dominant polycystic kidney disease (ADPKD) is challenging because of age-dependent penetrance and incomplete clinical ascertainment. Early studies estimated the ...lifetime risk of ADPKD to be about one per 1000 in the general population, whereas recent epidemiologic studies report a point prevalence of three to five cases per 10,000 in the general population.
To measure the frequency of high-confidence mutations presumed to be causative in ADPKD and autosomal dominant polycystic liver disease (ADPLD) and estimate lifetime ADPKD prevalence, we used two large, population sequencing databases, gnomAD (15,496 whole-genome sequences; 123,136 exome sequences) and BRAVO (62,784 whole-genome sequences). We used stringent criteria for defining rare variants in genes involved in ADPKD (
,
), ADPLD (
,
,
,
,
,
), and potential cystic disease modifiers; evaluated variants for quality and annotation; compared variants with data from an ADPKD mutation database; and used bioinformatic tools to predict pathogenicity.
Identification of high-confidence pathogenic mutations in whole-genome sequencing provided a lower boundary for lifetime ADPKD prevalence of 9.3 cases per 10,000 sequenced. Estimates from whole-genome and exome data were similar. Truncating mutations in ADPLD genes and genes of potential relevance as cyst modifiers were found in 20.2 cases and 103.9 cases per 10,000 sequenced, respectively.
Population whole-genome sequencing suggests a higher than expected prevalence of ADPKD-associated mutations. Loss-of-function mutations in ADPLD genes are also more common than expected, suggesting the possibility of unrecognized cases and incomplete penetrance. Substantial rare variation exists in genes with potential for phenotype modification in ADPKD.
People with type 2 diabetes frequently use low-calorie sweeteners to manage glycemia and reduce caloric intake. Use of erythritol, a low-calorie sweetener, has increased recently. Higher circulating ...concentration associates with major cardiac events and metabolic disease in observational data, prompting some concern. As observational data may be prone to confounding and reverse causality, we undertook bidirectional Mendelian randomization (MR) to investigate potential causal associations between erythritol and coronary artery disease (CAD), BMI, waist-hip-ratio (WHR), and glycemic and renal traits in cohorts of European ancestry. Analyses were undertaken using instruments comprising genome-wide significant variants from three cohorts with erythritol measurement. Across instruments, we did not find supportive evidence that increased erythritol increases CAD (b = -0.033 ± 0.02, P = 0.14; b = 0.46 ± 0.37, P = 0.23). MR indicates erythritol may decrease BMI (b = -0.04 ± 0.018, P = 0.03; b = -0.04 ± 0.0085, P = 1.23 × 10-5; b = -0.083 ± 0.092, P = 0.036), with potential evidence from one instrument of increased BMI adjusted for WHR (b = 0.046 ± 0.022, P = 0.035). No evidence of causal association was found with other traits. In conclusion, we did not find supportive evidence from MR that erythritol increases cardiometabolic disease. These findings await confirmation in well-designed prospective studies.
Andrew Paterson discusses findings from a new study that shows HbA1c screening for diabetes will leave 2% of African Americans undiagnosed and how personalised medicine is needed.
Relationship estimation and segment detection between individuals is an important aspect of disease gene mapping. Existing methods are either tailored for computational efficiency or require phasing ...to improve accuracy. We developed TRUFFLE, a method that integrates computational techniques and statistical principles for the identification and visualization of identity-by-descent (IBD) segments using un-phased data. By skipping the haplotype phasing step and, instead, relying on a simpler region-based approach, our method is computationally efficient while maintaining inferential accuracy. In addition, an error model corrects for segment break-ups that occur as a consequence of genotyping errors. TRUFFLE can estimate relatedness for 3.1 million pairs from the 1000 Genomes Project data in a few minutes on a typical laptop computer. Consistent with expectation, we identified only three second cousin or closer pairs across different populations, while commonly used methods identified a large number of such pairs. Similarly, within populations, we identified many fewer related pairs. Compared to methods relying on phased data, TRUFFLE has comparable accuracy but is drastically faster and has fewer broken segments. We also identified specific local genomic regions that are commonly shared within populations, suggesting selection. When applied to pedigree data, we observed 99.6% accuracy in detecting 1st to 5th degree relationships. As genomic datasets become much larger, TRUFFLE can enable disease gene mapping through implicit shared haplotypes by accurate IBD segment detection.
Genetic effects can be sex-specific, particularly for traits such as testosterone, a sex hormone. While sex-stratified analysis provides easily interpretable sex-specific effect size estimates, the ...presence of sex-differences in SNP effect implies a SNP×sex interaction. This suggests the usage of the often overlooked joint test, testing for an SNP's main and SNP×sex interaction effects simultaneously. Notably, even without individual-level data, the joint test statistic can be derived from sex-stratified summary statistics through an omnibus meta-analysis. Utilizing the available sex-stratified summary statistics of the UK Biobank, we performed such omnibus meta-analyses for 290 quantitative traits. Results revealed that this approach is robust to genetic effect heterogeneity and can outperform the traditional sex-stratified or sex-combined main effect-only tests. Therefore, we advocate using the omnibus meta-analysis that captures both the main and interaction effects. Subsequent sex-stratified analysis should be conducted for sex-specific effect size estimation and interpretation.
Next-generation sequencing has led to an explosion of genetic findings for many rare diseases. However, most of the variants identified are very rare and were also identified in small pedigrees, ...which creates challenges in terms of penetrance estimation and translation into genetic counselling in the setting of cascade testing. We use simulations to show that for a rare (dominant) disorder where a variant is identified in a small number of small pedigrees, the penetrance estimate can both have large uncertainty and be drastically inflated, due to underlying ascertainment bias. We have developed PenEst, an app that allows users to investigate the phenomenon across ranges of parameter settings. We also illustrate robust ascertainment corrections via the LOD (logarithm of the odds) score, and recommend a LOD-based approach to assessing pathogenicity of rare variants in the presence of reduced penetrance.