RNA sequencing has been proposed as a means of increasing diagnostic rates in studies of undiagnosed rare inherited disease. Recent studies have reported diagnostic improvements in the range of ...7.5-35% by profiling splicing, gene expression quantification and allele specific expression. To-date however, no study has systematically assessed the presence of gene-fusion transcripts in cases of germline disease. Fusion transcripts are routinely identified in cancer studies and are increasingly recognized as having diagnostic, prognostic or therapeutic relevance. Isolated reports exist of fusion transcripts being detected in cases of developmental and neurological phenotypes, and thus, systematic application of fusion detection to germline conditions may further increase diagnostic rates. However, current fusion detection methods are unsuited to the investigation of germline disease due to performance biases arising from their development using tumor, cell-line or in-silico data.
We describe a tailored approach to fusion candidate identification and prioritization in a cohort of 47 undiagnosed, suspected inherited disease patients. We modify an existing fusion transcript detection algorithm by eliminating its cell line-derived filtering steps, and instead, prioritize candidates using a custom workflow that integrates genomic and transcriptomic sequence alignment, biological and technical annotations, customized categorization logic, and phenotypic prioritization.
We demonstrate that our approach to fusion transcript identification and prioritization detects genuine fusion events excluded by standard analyses and efficiently removes phenotypically unimportant candidates and false positive events, resulting in a reduced candidate list enriched for events with potential phenotypic relevance. We describe the successful genetic resolution of two previously undiagnosed disease cases through the detection of pathogenic fusion transcripts. Furthermore, we report the experimental validation of five additional cases of fusion transcripts with potential phenotypic relevance.
The approach we describe can be implemented to enable the detection of phenotypically relevant fusion transcripts in studies of rare inherited disease. Fusion transcript detection has the potential to increase diagnostic rates in rare inherited disease and should be included in RNA-based analytical pipelines aimed at genetic diagnosis.
The processes by which disease spreads in a population of individuals are inherently stochastic. The master equation has proven to be a useful tool for modeling such processes. Unfortunately, solving ...the master equation analytically is possible only in limited cases (e.g., when the model is linear), and thus numerical procedures or approximation methods must be employed. Available approximation methods, such as the system size expansion method of van Kampen, may fail to provide reliable solutions, whereas current numerical approaches can induce appreciable computational cost. In this paper, we propose a new numerical technique for solving the master equation. Our method is based on a more informative stochastic process than the population process commonly used in the literature. By exploiting the structure of the master equation governing this process, we develop a novel technique for calculating the exact solution of the master equation--up to a desired precision--in certain models of stochastic epidemiology. We demonstrate the potential of our method by solving the master equation associated with the stochastic SIR epidemic model. MATLAB software that implements the methods discussed in this paper is freely available as Supporting Information S1.
Abstract
Motivation
Next-generation sequencing is rapidly improving diagnostic rates in rare Mendelian diseases, but even with whole genome or whole exome sequencing, the majority of cases remain ...unsolved. Increasingly, RNA sequencing is being used to solve many cases that evade diagnosis through sequencing alone. Specifically, the detection of aberrant splicing in many rare disease patients suggests that identifying RNA splicing outliers is particularly useful for determining causal Mendelian disease genes. However, there is as yet a paucity of statistical methodologies to detect splicing outliers.
Results
We developed LeafCutterMD, a new statistical framework that significantly improves the previously published LeafCutter in the context of detecting outlier splicing events. Through simulations and analysis of real patient data, we demonstrate that LeafCutterMD has better power than the state-of-the-art methodology while controlling false-positive rates. When applied to a cohort of disease-affected probands from the Mayo Clinic Center for Individualized Medicine, LeafCutterMD recovered all aberrantly spliced genes that had previously been identified by manual curation efforts.
Availability and implementation
The source code for this method is available under the opensource Apache 2.0 license in the latest release of the LeafCutter software package available online at http://davidaknowles.github.io/leafcutter.
Supplementary information
Supplementary data are available at Bioinformatics online.
DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for ...studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads.
We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data.
This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods.
DNA methylation is one of the most stable and well-characterized epigenetic alterations in humans. Accordingly, it has already found clinical utility as a molecular biomarker in a variety of disease ...contexts. Existing methods for clinical diagnosis of methylation-related disorders focus on outlier detection in a small number of CpG sites using standardized cutoffs which differentiate healthy from abnormal methylation levels. The standardized cutoff values used in these methods do not take into account methylation patterns which are known to differ between the sexes and with age.
Here we profile genome-wide DNA methylation from blood samples drawn from within a cohort composed of healthy controls of different age and sex alongside patients with Prader-Willi syndrome (PWS), Beckwith-Wiedemann syndrome, Fragile-X syndrome, Angelman syndrome, and Silver-Russell syndrome. We propose a Generalized Additive Model to perform age and sex adjusted outlier analysis of around 700,000 CpG sites throughout the human genome. Utilizing z-scores among the cohort for each site, we deployed an ensemble based machine learning pipeline and achieved a combined prediction accuracy of 0.96 (Binomial 95% Confidence Interval 0.868Formula: see text0.995).
We demonstrate a method for age and sex adjusted outlier detection of differentially methylated loci based on a large cohort of healthy individuals. We present a custom machine learning pipeline utilizing this outlier analysis to classify samples for potential methylation associated congenital disorders. These methods are able to achieve high accuracy when used with machine learning methods to classify abnormal methylation patterns.
Several recent studies have demonstrated the utility of RNA-Seq in the diagnosis of rare inherited disease. Diagnostic rates 35% higher than those previously achievable with DNA-Seq alone have been ...attained. These studies have primarily profiled gene expression and splicing defects, however, some have also shown that fusion transcripts are diagnostic or phenotypically relevant in patients with constitutional disorders. Fusion transcripts have traditionally been studied as oncogenic phenomena, with relevance only to cancer testing. Consequently, fusion detection algorithms were biased toward the detection of well-known oncogenic fusions, hindering their application to rare Mendelian genetic disease studies. A recent methodology published by the authors successfully tailored a traditional algorithm to the detection of pathogenic fusion events in inherited disease. A key mechanism of decreasing false positive or biologically benign events was comparison to a database of events detected in normal tissues. This approach is akin to population frequency-based filtering of genetic variants. It is predicated on the idea that pathogenic fusion transcripts are absent from normal tissue. We report on an analysis of RNA-Seq data from the genotype-tissue expression (GTEx) project in which known pathogenic fusions are computationally detected at low levels in normal tissues unassociated with the disease phenotype. Examples include archetypal cancer fusion transcripts, as well as fusions responsible for rare inherited disease. We consider potential explanations for the detectability of such transcripts and discuss the bearing such results have on the future profiling of genetic disease patients for pathogenic gene fusions.
X-chromosome inactivation (XCI) is an epigenetic process that occurs during early development in mammalian females by randomly silencing one of two copies of the X chromosome in each cell. The ...preferential inactivation of either the maternal or paternal copy of the X chromosome in a majority of cells results in a skewed or non-random pattern of X inactivation and is observed in over 25% of adult females. Identifying skewed X inactivation is of clinical significance in patients with suspected rare genetic diseases due to the possibility of biased expression of disease-causing genes present on the active X chromosome. The current clinical test for the detection of skewed XCI relies on the methylation status of the methylation-sensitive restriction enzyme (Hpall) binding site present in proximity of short tandem polymorphic repeats on the androgen receptor (AR) gene. This approach using one locus results in uninformative or inconclusive data for 10-20% of tests. Further, recent studies have shown inconsistency between methylation of the AR locus and the state of inactivation of the X chromosome. Herein, we develop a method for estimating X inactivation status, using exome and transcriptome sequencing data derived from blood in 227 female samples. We built a reference model for evaluation of XCI in 135 females from the GTEx consortium. We tested and validated the model on 11 female individuals with different types of undiagnosed rare genetic disorders who were clinically tested for X-skew using the AR gene assay and compared results to our outlier-based analysis technique.
In comparison to the AR clinical test for identification of X inactivation, our method was concordant with the AR method in 9 samples, discordant in 1, and provided a measure of X inactivation in 1 sample with uninformative clinical results. We applied this method on an additional 81 females presenting to the clinic with phenotypes consistent with different hereditary disorders without a known genetic diagnosis.
This study presents the use of transcriptome and exome sequencing data to provide an accurate and complete estimation of X-inactivation and skew status in a cohort of female patients with different types of suspected rare genetic disease.
Establishment and maintenance of DNA methylation throughout the genome is an important epigenetic mechanism that regulates gene expression whose disruption has been implicated in human diseases like ...cancer. It is therefore crucial to know which genes, or other genomic features of interest, exhibit significant discordance in DNA methylation between two phenotypes. We have previously proposed an approach for ranking genes based on methylation discordance within their promoter regions, determined by centering a window of fixed size at their transcription start sites. However, we cannot use this method to identify statistically significant genomic features and handle features of variable length and with missing data.
We present a new approach for computing the statistical significance of methylation discordance within genomic features of interest in single and multiple test/reference studies. We base the proposed method on a well-articulated hypothesis testing problem that produces p- and q-values for each genomic feature, which we then use to identify and rank features based on the statistical significance of their epigenetic dysregulation. We employ the information-theoretic concept of mutual information to derive a novel test statistic, which we can evaluate by computing Jensen-Shannon distances between the probability distributions of methylation in a test and a reference sample. We design the proposed methodology to simultaneously handle biological, statistical, and technical variability in the data, as well as variable feature lengths and missing data, thus enabling its wide-spread use on any list of genomic features. This is accomplished by estimating, from reference data, the null distribution of the test statistic as a function of feature length using generalized additive regression models. Differential assessment, using normal/cancer data from healthy fetal tissue and pediatric high-grade glioma patients, illustrates the potential of our approach to greatly facilitate the exploratory phases of clinically and biologically relevant methylation studies.
The proposed approach provides the first computational tool for statistically testing and ranking genomic features of interest based on observed DNA methylation discordance in comparative studies that accounts, in a rigorous manner, for biological, statistical, and technical variability in methylation data, as well as for variability in feature length and for missing data.
Background
COVID-19 is caused by the SARS-CoV-2 virus and has strikingly heterogeneous clinical manifestations, with most individuals contracting mild disease but a substantial minority experiencing ...fulminant cardiopulmonary symptoms or death. The clinical covariates and the laboratory tests performed on a patient provide robust statistics to guide clinical treatment. Deep learning approaches on a data set of this nature enable patient stratification and provide methods to guide clinical treatment.
Objective
Here, we report on the development and prospective validation of a state-of-the-art machine learning model to provide mortality prediction shortly after confirmation of SARS-CoV-2 infection in the Mayo Clinic patient population.
Methods
We retrospectively constructed one of the largest reported and most geographically diverse laboratory information system and electronic health record of COVID-19 data sets in the published literature, which included 11,807 patients residing in 41 states of the United States of America and treated at medical sites across 5 states in 3 time zones. Traditional machine learning models were evaluated independently as well as in a stacked learner approach by using AutoGluon, and various recurrent neural network architectures were considered. The traditional machine learning models were implemented using the AutoGluon-Tabular framework, whereas the recurrent neural networks utilized the TensorFlow Keras framework. We trained these models to operate solely using routine laboratory measurements and clinical covariates available within 72 hours of a patient’s first positive COVID-19 nucleic acid test result.
Results
The GRU-D recurrent neural network achieved peak cross-validation performance with 0.938 (SE 0.004) as the area under the receiver operating characteristic (AUROC) curve. This model retained strong performance by reducing the follow-up time to 12 hours (0.916 SE 0.005 AUROC), and the leave-one-out feature importance analysis indicated that the most independently valuable features were age, Charlson comorbidity index, minimum oxygen saturation, fibrinogen level, and serum iron level. In the prospective testing cohort, this model provided an AUROC of 0.901 and a statistically significant difference in survival (P<.001, hazard ratio for those predicted to survive, 95% CI 0.043-0.106).
Conclusions
Our deep learning approach using GRU-D provides an alert system to flag mortality for COVID-19–positive patients by using clinical covariates and laboratory values within a 72-hour window after the first positive nucleic acid test result.
The role intrinsic statistical fluctuations play in creating avalanches--patterns of complex bursting activity with scale-free properties--is examined in leaky Markovian networks. Using this broad ...class of models, we develop a probabilistic approach that employs a potential energy landscape perspective coupled with a macroscopic description based on statistical thermodynamics. We identify six important thermodynamic quantities essential for characterizing system behavior as a function of network size: the internal potential energy, entropy, free potential energy, internal pressure, pressure, and bulk modulus. In agreement with classical phase transitions, these quantities evolve smoothly as a function of the network size until a critical value is reached. At that value, a discontinuity in pressure is observed that leads to a spike in the bulk modulus demarcating loss of thermodynamic robustness. We attribute this novel result to a reallocation of the ground states (global minima) of the system's stationary potential energy landscape caused by a noise-induced deformation of its topographic surface. Further analysis demonstrates that appreciable levels of intrinsic noise can cause avalanching, a complex mode of operation that dominates system dynamics at near-critical or subcritical network sizes. Illustrative examples are provided using an epidemiological model of bacterial infection, where avalanching has not been characterized before, and a previously studied model of computational neuroscience, where avalanching was erroneously attributed to specific neural architectures. The general methods developed here can be used to study the emergence of avalanching (and other complex phenomena) in many biological, physical and man-made interaction networks.