Abstract
Motivation
RNA-binding proteins (RBPs) take over 5–10% of the eukaryotic proteome and play key roles in many biological processes, e.g. gene regulation. Experimental detection of RBP binding ...sites is still time-intensive and high-costly. Instead, computational prediction of the RBP binding sites using patterns learned from existing annotation knowledge is a fast approach. From the biological point of view, the local structure context derived from local sequences will be recognized by specific RBPs. However, in computational modeling using deep learning, to our best knowledge, only global representations of entire RNA sequences are employed. So far, the local sequence information is ignored in the deep model construction process.
Results
In this study, we present a computational method iDeepE to predict RNA–protein binding sites from RNA sequences by combining global and local convolutional neural networks (CNNs). For the global CNN, we pad the RNA sequences into the same length. For the local CNN, we split a RNA sequence into multiple overlapping fixed-length subsequences, where each subsequence is a signal channel of the whole sequence. Next, we train deep CNNs for multiple subsequences and the padded sequences to learn high-level features, respectively. Finally, the outputs from local and global CNNs are combined to improve the prediction. iDeepE demonstrates a better performance over state-of-the-art methods on two large-scale datasets derived from CLIP-seq. We also find that the local CNN runs 1.8 times faster than the global CNN with comparable performance when using GPUs. Our results show that iDeepE has captured experimentally verified binding motifs.
Availability and implementation
https://github.com/xypan1232/iDeepE
Supplementary information
Supplementary data are available at Bioinformatics online.
RNA regulation is significantly dependent on its binding protein partner, known as the RNA-binding proteins (RBPs). Unfortunately, the binding preferences for most RBPs are still not well ...characterized. Interdependencies between sequence and secondary structure specificities is challenging for both predicting RBP binding sites and accurate sequence and structure motifs detection.
In this study, we propose a deep learning-based method, iDeepS, to simultaneously identify the binding sequence and structure motifs from RNA sequences using convolutional neural networks (CNNs) and a bidirectional long short term memory network (BLSTM). We first perform one-hot encoding for both the sequence and predicted secondary structure, to enable subsequent convolution operations. To reveal the hidden binding knowledge from the observed sequences, the CNNs are applied to learn the abstract features. Considering the close relationship between sequence and predicted structures, we use the BLSTM to capture possible long range dependencies between binding sequence and structure motifs identified by the CNNs. Finally, the learned weighted representations are fed into a classification layer to predict the RBP binding sites. We evaluated iDeepS on verified RBP binding sites derived from large-scale representative CLIP-seq datasets. The results demonstrate that iDeepS can reliably predict the RBP binding sites on RNAs, and outperforms the state-of-the-art methods. An important advantage compared to other methods is that iDeepS can automatically extract both binding sequence and structure motifs, which will improve our understanding of the mechanisms of binding specificities of RBPs.
Our study shows that the iDeepS method identifies the sequence and structure motifs to accurately predict RBP binding sites. iDeepS is available at https://github.com/xypan1232/iDeepS .
RNA-binding proteins (RBPs) play crucial roles in various biological processes. Deep learning-based methods have been demonstrated powerful on predicting RBP sites on RNAs. However, the training of ...deep learning models is very time-intensive and computationally intensive.
Here we present a deep learning-based RBPsuite, an easy-to-use webserver for predicting RBP binding sites on linear and circular RNAs. For linear RNAs, RBPsuite predicts the RBP binding scores with them using our updated iDeepS. For circular RNAs (circRNAs), RBPsuite predicts the RBP binding scores with them using our developed CRIP. RBPsuite first breaks the input RNA sequence into segments of 101 nucleotides and scores the interaction between the segments and the RBPs. RBPsuite further detects the verified motifs on the binding segments gives the binding scores distribution along the full-length sequence.
RBPsuite is an easy-to-use online webserver for predicting RBP binding sites and freely available at http://www.csbio.sjtu.edu.cn/bioinf/RBPsuite/ .
Protein structure alignment algorithms are often time-consuming, resulting in challenges for large-scale protein structure similarity-based retrieval. There is an urgent need for more efficient ...structure comparison approaches as the number of protein structures increases rapidly. In this paper, we propose an effective graph-based protein structure representation learning method, GraSR, for fast and accurate structure comparison. In GraSR, a graph is constructed based on the intra-residue distance derived from the tertiary structure. Then, deep graph neural networks (GNNs) with a short-cut connection learn graph representations of the tertiary structures under a contrastive learning framework. To further improve GraSR, a novel dynamic training data partition strategy and length-scaling cosine distance are introduced. We objectively evaluate our method GraSR on SCOPe v2.07 and a new released independent test set from PDB database with a designed comprehensive performance metric. Compared with other state-of-the-art methods, GraSR achieves about 7%-10% improvement on two benchmark datasets. GraSR is also much faster than alignment-based methods. We dig into the model and observe that the superiority of GraSR is mainly brought by the learned discriminative residue-level and global descriptors. The web-server and source code of GraSR are freely available at www.csbio.sjtu.edu.cn/bioinf/GraSR/ for academic use.
Antibody-mediated immune responses play a crucial role in the immune defense of human body. The evolution of bioengineering has led the progress of antibody-derived drugs, showing promising efficacy ...in cancer and autoimmune disease therapy. A critical step of this development process is obtaining the affinity between antibodies and their binding antigens. In this study, we introduce a novel sequence-based antigen-antibody affinity prediction method, named DG-Affinity. DG-Affinity uses deep neural networks to efficiently and accurately predict the affinity between antibodies and antigens from sequences, without the need for structural information. The sequences of both the antigen and the antibody are first transformed into embedding vectors by two pre-trained language models, then these embeddings are concatenated into an ConvNeXt framework with a regression task. The results demonstrate the superiority of DG-Affinity over the existing structure-based prediction methods and the sequence-based tools, achieving a Pearson's correlation of over 0.65 on an independent test dataset. Compared to the baseline methods, DG-Affinity achieves the best performance and can advance the development of antibody design. It is freely available as an easy-to-use web server at https://www.digitalgeneai.tech/solution/affinity.
Therapeutic genome editing requires effective and targeted delivery methods. The delivery of Cas9 mRNA using adeno-associated viruses has led to potent in vivo therapeutic efficacy, but can cause ...sustained Cas9 expression, anti-Cas9 immune responses and off-target edits. Lentiviral vectors have been engineered to deliver nucleases that are expressed transiently, but in vivo evidence of their biomedical efficacy is lacking. Here, we show that the lentiviral codelivery of Streptococcus pyogenes Cas9 mRNA and expression cassettes that encode a guide RNA that targets vascular endothelial growth factor A (Vegfa) is efficacious in a mouse model of wet age-related macular degeneration induced by Vegfa. A single subretinal injection of engineered lentiviruses knocked out 44% of Vegfa in retinal pigment epithelium and reduced the area of choroidal neovascularization by 63% without inducing off-target edits or anti-Cas9 immune responses. Engineered lentiviruses for the transient expression of nucleases may form the basis of new treatments for retinal neovascular diseases.
RNA velocity is closely related with cell fate and is an important indicator for the prediction of cell states with elegant physical explanation derived from single-cell RNA-seq data. Most existing ...RNA velocity models aim to extract dynamics from the phase delay between unspliced and spliced mRNA for each individual gene. However, unspliced/spliced mRNA abundance may not provide sufficient signal for dynamic modeling, leading to poor fit in phase portraits. Motivated by the idea that RNA velocity could be driven by the transcriptional regulation, we propose TFvelo, which expands RNA velocity concept to various single-cell datasets without relying on splicing information, by introducing gene regulatory information. Our experiments on synthetic data and multiple scRNA-Seq datasets show that TFvelo can accurately fit genes dynamics on phase portraits, and effectively infer cell pseudo-time and trajectory from RNA abundance data. TFvelo opens a robust and accurate avenue for modeling RNA velocity for single cell data.
Abstract
Motivation
Genetically engineering food crops involves introducing proteins from other species into crop plant species or modifying already existing proteins with gene editing techniques. In ...addition, newly synthesized proteins can be used as therapeutic protein drugs against diseases. For both research and safety regulation purposes, being able to assess the potential toxicity of newly introduced/synthesized proteins is of high importance.
Results
In this study, we present ToxDL, a deep learning-based approach for in silico prediction of protein toxicity from sequence alone. ToxDL consists of (i) a module encompassing a convolutional neural network that has been designed to handle variable-length input sequences, (ii) a domain2vec module for generating protein domain embeddings and (iii) an output module that classifies proteins as toxic or non-toxic, using the outputs of the two aforementioned modules. Independent test results obtained for animal proteins and cross-species transferability results obtained for bacteria proteins indicate that ToxDL outperforms traditional homology-based approaches and state-of-the-art machine-learning techniques. Furthermore, through visualizations based on saliency maps, we are able to verify that the proposed network learns known toxic motifs. Moreover, the saliency maps allow for directed in silico modification of a sequence, thus making it possible to alter its predicted protein toxicity.
Availability and implementation
ToxDL is freely available at http://www.csbio.sjtu.edu.cn/bioinf/ToxDL/. The source code can be found at https://github.com/xypan1232/ToxDL.
Supplementary information
Supplementary data are available at Bioinformatics online.
Abstract
Existing drug-target interaction (DTI) prediction methods generally fail to generalize well to novel (unseen) proteins and drugs. In this study, we propose a protein-specific meta-learning ...framework ZeroBind with subgraph matching for predicting protein-drug interactions from their structures. During the meta-training process, ZeroBind formulates training a protein-specific model, which is also considered a learning task, and each task uses graph neural networks (GNNs) to learn the protein graph embedding and the molecular graph embedding. Inspired by the fact that molecules bind to a binding pocket in proteins instead of the whole protein, ZeroBind introduces a weakly supervised subgraph information bottleneck (SIB) module to recognize the maximally informative and compressive subgraphs in protein graphs as potential binding pockets. In addition, ZeroBind trains the models of individual proteins as multiple tasks, whose importance is automatically learned with a task adaptive self-attention module to make final predictions. The results show that ZeroBind achieves superior performance on DTI prediction over existing methods, especially for those unseen proteins and drugs, and performs well after fine-tuning for those proteins or drugs with a few known binding partners.
The number of discovered natural miRNA sponges in plants, viruses, and mammals is increasing steadily. Some sponges like ciRS-7 for miR-7 contain multiple nearby miRNA binding sites. We hypothesize ...that such clusters of miRNA binding sites on the genome can function together as a sponge. No systematic effort has been made in search for clusters of miRNA targets. Here, we, to our knowledge, make the first genome-wide target site predictions for clusters of mature human miRNAs. For each miRNA, we predict the target sites on a genome-wide scale, build a graph with edge weights based on the pairwise distances between sites, and apply Markov clustering to identify genomic regions with high binding site density. Significant clusters are then extracted based on cluster size difference between real and shuffled genomes preserving local properties such as the GC content. We then use conservation and binding energy to filter a final set of miRNA target site clusters or sponge candidates. Our pipeline predicts 3673 sponge candidates for 1250 miRNAs, including the experimentally verified miR-7 sponge ciRS-7. In addition, we point explicitly to 19 high-confidence candidates overlapping annotated genomic sequence. The full list of candidates is freely available at http://rth.dk/resources/mirnasponge, where detailed properties for individual candidates can be explored, such as alignment details, conservation, accessibility and target profiles, which facilitates selection of sponge candidates for further context specific analysis.