Abstract
While the number of human miRNA candidates continuously increases, only a few of them are completely characterized and experimentally validated. Toward determining the total number of true ...miRNAs, we employed a combined in silico high- and experimental low-throughput validation strategy. We collected 28 866 human small RNA sequencing data sets containing 363.7 billion sequencing reads and excluded falsely annotated and low quality data. Our high-throughput analysis identified 65% of 24 127 mature miRNA candidates as likely false-positives. Using northern blotting, we experimentally validated miRBase entries and novel miRNA candidates. By exogenous overexpression of 108 precursors that encode 205 mature miRNAs, we confirmed 68.5% of the miRBase entries with the confirmation rate going up to 94.4% for the high-confidence entries and 18.3% of the novel miRNA candidates. Analyzing endogenous miRNAs, we verified the expression of 8 miRNAs in 12 different human cell lines. In total, we extrapolated 2300 true human mature miRNAs, 1115 of which are currently annotated in miRBase V22. The experimentally validated miRNAs will contribute to revising targetomes hypothesized by utilizing falsely annotated miRNAs.
Abstract
We present GeneTrail 3, a major extension of our web service GeneTrail that offers rich functionality for the identification, analysis, and visualization of deregulated biological processes. ...Our web service provides a comprehensive collection of biological processes and signaling pathways for 12 model organisms that can be analyzed with a powerful framework for enrichment and network analysis of transcriptomic, miRNomic, proteomic, and genomic data sets. Moreover, GeneTrail offers novel workflows for the analysis of epigenetic marks, time series experiments, and single cell data. We demonstrate the capabilities of our web service in two case-studies, which highlight that GeneTrail is well equipped for uncovering complex molecular mechanisms. GeneTrail is freely accessible at: http://genetrail.bioinf.uni-sb.de.
Abstract
Machine learning methods trained on cancer cell line panels are intensively studied for the prediction of optimal anti-cancer therapies. While classification approaches distinguish effective ...from ineffective drugs, regression approaches aim to quantify the degree of drug effectiveness. However, the high specificity of most anti-cancer drugs induces a skewed distribution of drug response values in favor of the more drug-resistant cell lines, negatively affecting the classification performance (class imbalance) and regression performance (regression imbalance) for the sensitive cell lines. Here, we present a novel approach called SimultAneoUs Regression and classificatiON Random Forests (SAURON-RF) based on the idea of performing a joint regression and classification analysis. We demonstrate that SAURON-RF improves the classification and regression performance for the sensitive cell lines at the expense of a moderate loss for the resistant ones. Furthermore, our results show that simultaneous classification and regression can be superior to regression or classification alone.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
We present a Lamarckian genetic algorithm (LGA) variant for flexible ligand-receptor docking which allows to handle a large number of degrees of freedom. Our hybrid method combines a multi-deme LGA ...with a recently published gradient-based method for local optimization of molecular complexes. We compared the performance of our new hybrid method to two non gradient-based search heuristics on the Astex diverse set for flexible ligand-receptor docking. Our results show that the novel approach is clearly superior to other LGAs employing a stochastic optimization method. The new algorithm features a shorter run time and gives substantially better results, especially with increasing complexity of the ligands. Thus, it may be used to dock ligands with many rotatable bonds with high efficiency.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
Abstract
Which genes, gene sets or pathways are regulated by certain miRNAs? Which miRNAs regulate a particular target gene or target pathway in a certain physiological context? Answering such common ...research questions can be time consuming and labor intensive. Especially for researchers without computational experience, the integration of different data sources, selection of the right parameters and concise visualization can be demanding. A comprehensive analysis should be central to present adequate answers to complex biological questions. With miRTargetLink 2.0, we develop an all-in-one solution for human, mouse and rat miRNA networks. Users input in the unidirectional search mode either a single gene, gene set or gene pathway, alternatively a single miRNA, a set of miRNAs or an miRNA pathway. Moreover, genes and miRNAs can jointly be provided to the tool in the bidirectional search mode. For the selected entities, interaction graphs are generated from different data sources and dynamically presented. Connected application programming interfaces (APIs) to the tailored enrichment tools miEAA and GeneTrail facilitate downstream analysis of pathways and context-annotated categories of network nodes. MiRTargetLink 2.0 is freely accessible at https://www.ccb.uni-saarland.de/mirtargetlink2.
Graphical Abstract
Graphical abstract
MiRTargetLink 2.0 offers interactive, web-based functionality to dissect networks of miRNAs and their target genes and pathways in three commonly investigated species.
Abstract
Since the initial release of miRPathDB, tremendous progress has been made in the field of microRNA (miRNA) research. New miRNA reference databases have emerged, a vast amount of new miRNA ...candidates has been discovered and the number of experimentally validated target genes has increased considerably. Hence, the demand for a major upgrade of miRPathDB, including extended analysis functionality and intuitive visualizations of query results has emerged. Here, we present the novel release 2.0 of the miRNA Pathway Dictionary Database (miRPathDB) that is freely accessible at https://mpd.bioinf.uni-sb.de/. miRPathDB 2.0 comes with a ten-fold increase of pre-processed data. In total, the updated database provides putative associations between 27 452 (candidate) miRNAs, 28 352 targets and 16 833 pathways for Homo sapiens, as well as interactions of 1978 miRNAs, 24 898 targets and 6511 functional categories for Mus musculus. Additionally, we analyzed publications citing miRPathDB to identify common use-cases and further extensions. Based on this evaluation, we added new functionality for interactive visualizations and down-stream analyses of bulk queries. In summary, the updated version of miRPathDB, with its new custom-tailored features, is one of the most comprehensive and advanced resources for miRNAs and their target pathways.
Phylogenomics with paralogs Hellmuth, Marc; Wieseke, Nicolas; Lechner, Marcus ...
Proceedings of the National Academy of Sciences - PNAS,
02/2015, Volume:
112, Issue:
7
Journal Article
Peer reviewed
Open access
Phylogenomics heavily relies on well-curated sequence data sets that comprise, for each gene, exclusively 1:1 orthologos. Paralogs are treated as a dangerous nuisance that has to be detected and ...removed. We show here that this severe restriction of the data sets is not necessary. Building upon recent advances in mathematical phylogenetics, we demonstrate that gene duplications convey meaningful phylogenetic information and allow the inference of plausible phylogenetic trees, provided orthologs and paralogs can be distinguished with a degree of certainty. Starting from tree-free estimates of orthology, cograph editing can sufficiently reduce the noise to find correct event-annotated gene trees. The information of gene trees can then directly be translated into constraints on the species trees. Although the resolution is very poor for individual gene families, we show that genome-wide data sets are sufficient to generate fully resolved phylogenetic trees, even in the presence of horizontal gene transfer.
Significance We demonstrate that the distribution of paralogs in large gene families contains in itself sufficient phylogenetic signal to infer fully resolved species phylogenies. This source of phylogenetic information is independent of information contained in orthologous sequences and is resilient against horizontal gene transfer. An important consequence is that phylogenomics data sets need not be restricted to 1:1 orthologs.
Full text
Available for:
BFBNIB, NMLJ, NUK, PNG, SAZU, UL, UM, UPUK
In many research disciplines, hypothesis tests are applied to evaluate whether findings are statistically significant or could be explained by chance. The Wilcoxon-Mann-Whitney (WMW) test is among ...the most popular hypothesis tests in medicine and life science to analyze if two groups of samples are equally distributed. This nonparametric statistical homogeneity test is commonly applied in molecular diagnosis. Generally, the solution of the WMW test takes a high combinatorial effort for large sample cohorts containing a significant number of ties. Hence, P value is frequently approximated by a normal distribution. We developed EDISON-WMW, a new approach to calcu- late the exact permutation of the two-tailed unpaired WMW test without any corrections required and allowing for ties. The method relies on dynamic programing to solve the combinatorial problem of the WMW test efficiently. Beyond a straightforward implementation of the algorithm, we pre- sented different optimization strategies and developed a parallel solution. Using our program, the exact P value for large cohorts containing more than 1000 samples with ties can be calculated within minutes. We demonstrate the performance of this novel approach on randomly-generated data, benchmark it against 13 other commonly-applied approaches and moreover evaluate molec- ular biomarkers for lung carcinoma and chronic obstructive pulmonary disease (COPD). We foundthat approximated P values were generally higher than the exact solution provided by EDISON- WMW. Importantly, the algorithm can also be applied to high-throughput omics datasets, where hundreds or thousands of features are included. To provide easy access to the multi-threaded version of EDISON-WMW, a web-based solution of our algorithm is freely available at http:// www.ccb.uni-saarland.de/software/wtest/.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Abstract
Motivation
A major goal of personalized medicine in oncology is the optimization of treatment strategies given measurements of the genetic and molecular profiles of cancer cells. To further ...our knowledge on drug sensitivity, machine learning techniques are commonly applied to cancer cell line panels.
Results
We present a novel integer linear programming formulation, called MEthod for Rule Identification with multi-omics DAta (MERIDA), for predicting the drug sensitivity of cancer cells. The method represents a modified version of the LOBICO method and yields easily interpretable models amenable to a Boolean logic-based interpretation. Since the proposed altered logical rules lead to an enormous acceleration of the running times of MERIDA compared to LOBICO, we cannot only consider larger input feature sets integrated from genetic and molecular omics data but also build more comprehensive models that mirror the complexity of cancer initiation and progression. Moreover, we enable the inclusion of a priori knowledge that can either stem from biomarker databases or can also be newly acquired knowledge gathered iteratively by previous runs of MERIDA. Our results show that this approach does not only lead to an improved predictive performance but also identifies a variety of putative sensitivity and resistance biomarkers. We also compare our approach to state-of-the-art machine learning methods and demonstrate the superior performance of our method. Hence, MERIDA has great potential to deepen our understanding of the molecular mechanisms causing drug sensitivity or resistance.
Availability and implementation
The corresponding code is available on github (https://github.com/unisb-bioinf/MERIDA.git).
Supplementary information
Supplementary data are available at Bioinformatics online.
The application of machine learning (ML) to solve real-world problems does not only bear great potential but also high risk. One fundamental challenge in risk mitigation is to ensure the reliability ...of the ML predictions, i.e., the model error should be minimized, and the prediction uncertainty should be estimated. Especially for medical applications, the importance of reliable predictions can not be understated. Here, we address this challenge for anti-cancer drug sensitivity prediction and prioritization. To this end, we present a novel drug sensitivity prediction and prioritization approach guaranteeing user-specified certainty levels. The developed conformal prediction approach is applicable to classification, regression, and simultaneous regression and classification. Additionally, we propose a novel drug sensitivity measure that is based on clinically relevant drug concentrations and enables a straightforward prioritization of drugs for a given cancer sample.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK