Modeling the properties and functions of DNA sequences is an important, but challenging task in the broad field of genomics. This task is particularly difficult for non-coding DNA, the vast majority ...of which is still poorly understood in terms of function. A powerful predictive model for the function of non-coding DNA can have enormous benefit for both basic science and translational research because over 98% of the human genome is non-coding and 93% of disease-associated variants lie in these regions. To address this need, we propose DanQ, a novel hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting non-coding function de novo from sequence. In the DanQ model, the convolution layer captures regulatory motifs, while the recurrent layer captures long-term dependencies between the motifs in order to learn a regulatory 'grammar' to improve predictions. DanQ improves considerably upon other models across several metrics. For some regulatory markers, DanQ can achieve over a 50% relative improvement in the area under the precision-recall curve metric compared to related models. We have made the source code available at the github repository http://github.com/uci-cbcl/DanQ.
•Open source method, FactorNet, for predicting cell type-specific transcription factor binding.•One of the top performing methods in the ENCODE-DREAM Challenge.•Transcription factor binding ...prediction problem is far from solved.
Due to the large numbers of transcription factors (TFs) and cell types, querying binding profiles of all valid TF/cell type pairs is not experimentally feasible. To address this issue, we developed a convolutional-recurrent neural network model, called FactorNet, to computationally impute the missing binding data. FactorNet trains on binding data from reference cell types to make predictions on testing cell types by leveraging a variety of features, including genomic sequences, genome annotations, gene expression, and signal data, such as DNase I cleavage. FactorNet implements several convenient strategies to reduce runtime and memory consumption. By visualizing the neural network models, we can interpret how the model predicts binding. We also investigate the variables that affect cross-cell type accuracy, and offer suggestions to improve upon this field. Our method ranked among the top teams in the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge, achieving first place on six of the 13 final round evaluation TF/cell type pairs, the most of any competing team. The FactorNet source code is publicly available, allowing users to reproduce our methodology from the ENCODE-DREAM Challenge.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Annotating genetic variants, especially non-coding variants, for the purpose of identifying pathogenic variants remains a challenge. Combined annotation-dependent depletion (CADD) is an algorithm ...designed to annotate both coding and non-coding variants, and has been shown to outperform other annotation algorithms. CADD trains a linear kernel support vector machine (SVM) to differentiate evolutionarily derived, likely benign, alleles from simulated, likely deleterious, variants. However, SVMs cannot capture non-linear relationships among the features, which can limit performance. To address this issue, we have developed DANN. DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features. We exploit Compute Unified Device Architecture-compatible graphics processing units and deep learning techniques such as dropout and momentum training to accelerate the DNN training. DANN achieves about a 19% relative reduction in the error rate and about a 14% relative increase in the area under the curve (AUC) metric over CADD's SVM methodology.
All data and source code are available at https://cbcl.ics.uci.edu/public_data/DANN/.
Purpose
Radiation therapy (RT) is a common treatment option for head and neck (HaN) cancer. An important step involved in RT planning is the delineation of organs‐at‐risks (OARs) based on HaN ...computed tomography (CT). However, manually delineating OARs is time‐consuming as each slice of CT images needs to be individually examined and a typical CT consists of hundreds of slices. Automating OARs segmentation has the benefit of both reducing the time and improving the quality of RT planning. Existing anatomy autosegmentation algorithms use primarily atlas‐based methods, which require sophisticated atlas creation and cannot adequately account for anatomy variations among patients. In this work, we propose an end‐to‐end, atlas‐free three‐dimensional (3D) convolutional deep learning framework for fast and fully automated whole‐volume HaN anatomy segmentation.
Methods
Our deep learning model, called AnatomyNet, segments OARs from head and neck CT images in an end‐to‐end fashion, receiving whole‐volume HaN CT images as input and generating masks of all OARs of interest in one shot. AnatomyNet is built upon the popular 3D U‐net architecture, but extends it in three important ways: (a) a new encoding scheme to allow autosegmentation on whole‐volume CT images instead of local patches or subsets of slices, (b) incorporating 3D squeeze‐and‐excitation residual blocks in encoding layers for better feature representation, and (c) a new loss function combining Dice scores and focal loss to facilitate the training of the neural model. These features are designed to address two main challenges in deep learning‐based HaN segmentation: (a) segmenting small anatomies (i.e., optic chiasm and optic nerves) occupying only a few slices, and (b) training with inconsistent data annotations with missing ground truth for some anatomical structures.
Results
We collected 261 HaN CT images to train AnatomyNet and used MICCAI Head and Neck Auto Segmentation Challenge 2015 as a benchmark dataset to evaluate the performance of AnatomyNet. The objective is to segment nine anatomies: brain stem, chiasm, mandible, optic nerve left, optic nerve right, parotid gland left, parotid gland right, submandibular gland left, and submandibular gland right. Compared to previous state‐of‐the‐art results from the MICCAI 2015 competition, AnatomyNet increases Dice similarity coefficient by 3.3% on average. AnatomyNet takes about 0.12 s to fully segment a head and neck CT image of dimension 178 × 302 × 225, significantly faster than previous methods. In addition, the model is able to process whole‐volume CT images and delineate all OARs in one pass, requiring little pre‐ or postprocessing.
Conclusion
Deep learning models offer a feasible solution to the problem of delineating OARs from CT images. We demonstrate that our proposed model can improve segmentation accuracy and simplify the autosegmentation pipeline. With this method, it is possible to delineate OARs of a head and neck CT within a fraction of a second.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
Abstract
It is well known that strong low-mode internal tides generated in Luzon Strait propagate westward to impinge continental slopes in the northeastern South China Sea (SCS). The reflection and ...scattering of these internal tides including diurnal and semidiurnal components on the slopes are quantitatively investigated using two sets of mooring data and a linear internal tide model with realistic topography and stratification. Flux reflections computed from mooring data collected on the continental slopes are consistent with the linear model. Based on the results of the observations and simulations, a map of low-mode internal tide reflection and scattering coefficients along the continental margin in the northeastern SCS is revealed. On average, diurnal internal tides lose 38% of their energy to high modes (≥mode 4) that are assumed to dissipate on the slopes, transmit 28% onto the continental shelf, and reflect 31% back to the deep ocean. On the contrary, most of the semidiurnal energy (89%) transmits onto the continental shelf, and only 11% is scattered to high modes (7%) and reflected back to the deep ocean (4%). For diurnal internal tides, a large fraction of energy that is scattered to high modes and reflected back to the deep sea can be attributed to the critical–supercritical slopes, while the weak reflection for the semidiurnal energy is due to the subcritical slopes. These quantitative descriptions for evolutions of low-mode internal tides incident to the slopes provide an energy budget map on the continental slopes in the northeastern SCS.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Many biological processes are governed by protein-ligand interactions. One such example is the recognition of self and non-self cells by the immune system. This immune response process is regulated ...by the major histocompatibility complex (MHC) protein which is encoded by the human leukocyte antigen (HLA) complex. Understanding the binding potential between MHC and peptides can lead to the design of more potent, peptide-based vaccines and immunotherapies for infectious autoimmune diseases.
We apply machine learning techniques from the natural language processing (NLP) domain to address the task of MHC-peptide binding prediction. More specifically, we introduce a new distributed representation of amino acids, name HLA-Vec, that can be used for a variety of downstream proteomic machine learning tasks. We then propose a deep convolutional neural network architecture, name HLA-CNN, for the task of HLA class I-peptide binding prediction. Experimental results show combining the new distributed representation with our HLA-CNN architecture achieves state-of-the-art results in the majority of the latest two Immune Epitope Database (IEDB) weekly automated benchmark datasets. We further apply our model to predict binding on the human genome and identify 15 genes with potential for self binding.
Codes to generate the HLA-Vec and HLA-CNN are publicly available at: https://github.com/uci-cbcl/HLA-bind .
xhx@ics.uci.edu.
Supplementary data are available at Bioinformatics online.
Background: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing ...reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.
Methods: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.
Results: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.
Conclusions: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
Full text
Available for:
FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
•A complete solution for OAR contouring in any site on CT images with approximate accuracy closed to human experts.•A novel architecture which is capable of recognizing anatomical site and utilizing ...corresponding OAR segmentation model automatically.•A web-based online platform opened to the public for free research use at: https://irvine.deep-voxel.com/.•Considerable time saving for OARs delineation and dose accuracy for treatment planning.
Delineating organs at risk (OARs) on computed tomography (CT) images is an essential step in radiation therapy; however, it is notoriously time-consuming and prone to inter-observer variation. Herein, we report a deep learning-based automatic segmentation (AS) algorithm (WBNet) that can accurately and efficiently delineate all major OARs in the entire body directly on CT scans.
We collected 755 CT scans of the head and neck, thorax, abdomen, and pelvis and manually delineated 50 OARs on the CT images. The CT images with contours were split into training and test sets consisting of 505 and 250 cases, respectively, to develop and validate WBNet. The volumetric Dice similarity coefficient (DSC) and 95th-percentile Hausdorff distance (95% HD) were calculated to evaluate delineation quality for each OAR. We compared the performance of WBNet with three AS algorithms: one commercial multi-atlas-based automatic segmentation (ABAS) software, and two deep learning-based AS algorithms, namely, AnatomyNet and nnU-Net. We have also evaluated the time saving and dose accuracy of WBNet.
WBNet achieved average DSCs of 0.84 and 0.81 on in-house and public datasets, respectively, which outperformed ABAS, AnatomyNet, and nnU-Net. WBNet could reduce the delineation time significantly and perform well in treatment planning, with clinically acceptable dose differences compared with those in manual delineation.
This study shows the feasibility and benefits of using WBNet in clinical practice.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
A central challenge of biology is to map and understand gene regulation on a genome-wide scale. For any given genome, only a small fraction of the regulatory elements embedded in the DNA sequence ...have been characterized, and there is great interest in developing computational methods to systematically map all these elements and understand their relationships. Such computational efforts, however, are significantly hindered by the overwhelming size of non-coding regions and the statistical variability and complex spatial organizations of regulatory elements and interactions. Genome-wide catalogs of regulatory elements for all model species simply do not yet exist.
The MotifMap system uses databases of transcription factor binding motifs, refined genome alignments, and a comparative genomic statistical approach to provide comprehensive maps of candidate regulatory elements encoded in the genomes of model species. The system is used to derive new genome-wide maps for yeast, fly, worm, mouse, and human. The human map contains 519,108 sites for 570 matrices with a False Discovery Rate of 0.1 or less. The new maps are assessed in several ways, for instance using high-throughput experimental ChIP-seq data and AUC statistics, providing strong evidence for their accuracy and coverage. The maps can be usefully integrated with many other kinds of omic data and are available at http://motifmap.igb.uci.edu/.
MotifMap and its integration with other data provide a foundation for analyzing gene regulation on a genome-wide scale, and for automatically generating regulatory pathways and hypotheses. The power of this approach is demonstrated and discussed using the P53 apoptotic pathway and the Gli hedgehog pathways as examples.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Abstract
Recent mooring observations at a cross-channel section in Chesapeake Bay showed that internal solitary waves regularly appeared during certain phases of a tidal cycle and propagated from the ...deep channel to the shallow shoal. It was hypothesized that these waves resulted from the nonlinear steepening of internal lee waves generated by lateral currents over channel-shoal topography. In this study numerical modeling is conducted to investigate the interaction between lateral circulation and cross-channel topography and discern the generation mechanism of the internal lee waves. During ebb tides, lateral bottom Ekman forcing drives a counterclockwise (looking into estuary) lateral circulation, with strong currents advecting stratified water over the western flank of the deep channel and producing large isopycnal displacements. When the lateral flow becomes supercritical with respect to mode-2 internal waves, a mode-2 internal lee wave is generated on the flank of the deep channel and subsequently propagates onto the western shoal. When the bottom lateral flow becomes near-critical or supercritical with respect to mode-1 internal waves, the lee wave evolves into an internal hydraulic jump. On the shallow shoal, the lee waves or jumps evolve into internal bores of elevation.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK