•PREvaIL is a new method for inferring catalytic residues based on a comprehensive set of features.•Benchmarking experiments showed that PREvaIL achieved competitive performance.•It was able to ...capture useful signals to improve catalytic residue prediction.•PREvaIL can facilitate characterization and functional annotation of proteins.
Determining the catalytic residues in an enzyme is critical to our understanding the relationship between protein sequence, structure, function, and enhancing our ability to design novel enzymes and their inhibitors. Although many enzymes have been sequenced, and their primary and tertiary structures determined, experimental methods for enzyme functional characterization lag behind. Because experimental methods used for identifying catalytic residues are resource- and labor-intensive, computational approaches have considerable value and are highly desirable for their ability to complement experimental studies in identifying catalytic residues and helping to bridge the sequence–structure–function gap. In this study, we describe a new computational method called PREvaIL for predicting enzyme catalytic residues. This method was developed by leveraging a comprehensive set of informative features extracted from multiple levels, including sequence, structure, and residue-contact network, in a random forest machine-learning framework. Extensive benchmarking experiments on eight different datasets based on 10-fold cross-validation and independent tests, as well as side-by-side performance comparisons with seven modern sequence- and structure-based methods, showed that PREvaIL achieved competitive predictive performance, with an area under the receiver operating characteristic curve and area under the precision-recall curve ranging from 0.896 to 0.973 and from 0.294 to 0.523, respectively. We demonstrated that this method was able to capture useful signals arising from different levels, leveraging such differential but useful types of features and allowing us to significantly improve the performance of catalytic residue prediction. We believe that this new method can be utilized as a valuable tool for both understanding the complex sequence–structure–function relationships of proteins and facilitating the characterization of novel enzymes lacking functional annotations.
The flowchart of PREvaIL for inferring catalytic residues based on the integration of sequence, structural, and residue-contact-network features using the random forest machine-learning framework. Display omitted
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as
ℓ
p
-norm with
p
>
0
), are data-independent and sensitive to units or scales of ...measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and
m
p
-dissimilarity (
p
>
0
), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise
m
p
-dissimilarity where
p
≥
0
by introducing
m
0
-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and
k
NN classification tasks. Our findings show that the fully data-dependent measure of
m
p
-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.
Purpose
Post-injury health service utilization (HSU) contributes to injury outcomes, but limited studies investigated their relationship. This study aims to group injured patients in transport ...accidents based on minimal historical information of their HSU so that the groups are meaningfully associated with the outcome of interest.
Methods
The data include 20,692 injured patients who had compensation claims over 3 years. We propose a hybrid approach, combining unsupervised and supervised machine learning methods. Based on the first week post-injury data, we identify a proper clustering of patients best associated with total cost to recovery, as well as the discovery of HSU patterns. This allows developing models to accurately predict the outcome of interest using the discovered patterns. Furthermore, we propose to use decision tree classifiers to accurately classify future patients into the discovered clusters using their first week post-injury information.
Results
Our hybrid approach has identified eight patient groups. The compactness of the resulted clusters, assessed by Average Silhouette Width metric, is 0.71 indicating well-defined clusters. The resulted patient groups are highly predictive of injury outcomes. They improve the cost predictability more than twice in comparison with predictors such as gender, age and injury type. These groups also have substantial association with patients’ recovery. The transparency and interpretability of decision trees allow integrating the resulting classification rules conveniently in operational processes.
Conclusions
This study provides a framework to discover knowledge and useful insights for health service providers and policy makers to control injury outcomes, and consequently to reduce the severity of transport accidents.
In this paper, we conduct the first study on spurious correlations for open-domain response generation models based on a corpus
curated by ourselves. The current models indeed suffer from spurious ...correlations and have a tendency to generate irrelevant and generic responses. Inspired by causal discovery algorithms, we propose a novel model-agnostic method for training and inference using a conditional independence classifier. The classifier is trained by a constrained self-training method, coined
, to overcome data sparsity. The experimental results based on both human and automatic evaluation show that our method significantly outperforms the competitive baselines in terms of relevance, informativeness, and fluency.
Abstract
The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a ...number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Data depth is a statistical method which models data distribution in terms of center-outward ranking rather than density or linear ranking. While there are a lot of academic interests, its ...applications are hampered by the lack of a method which is both robust and efficient. This paper introduces
Half-Space Mass
which is a significantly improved version of half-space data depth.
Half-Space Mass
is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of
Half-Space Mass
: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of
Half-Space Mass
through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than half-space depth; and it runs orders of magnitude faster than
L
2
depth, an existing maximally robust location estimator. In clustering, the
Half-Space Mass
version of K-means overcomes three weaknesses of K-means.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Simultaneous interrogation of tumor genomes and transcriptomes is underway in unprecedented global efforts. Yet, despite the essential need to separate driver mutations modulating gene expression ...networks from transcriptionally inert passenger mutations, robust computational methods to ascertain the impact of individual mutations on transcriptional networks are underdeveloped. We introduce a novel computational framework, DriverNet, to identify likely driver mutations by virtue of their effect on mRNA expression networks. Application to four cancer datasets reveals the prevalence of rare candidate driver mutations associated with disrupted transcriptional networks and a simultaneous modulation of oncogenic and metabolic networks, induced by copy number co-modification of adjacent oncogenic and metabolic drivers. DriverNet is available on Bioconductor or at http://compbio.bccrc.ca/software/drivernet/ .
One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones ...should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Machine translation (MT) is an important task in natural language processing (NLP), as it automates the translation process and reduces the reliance on human translators. With the resurgence of ...neural networks, the translation quality surpasses that of the translations obtained using statistical techniques for most language-pairs. Up until a few years ago, almost all of the neural translation models translated sentences
independently
, without incorporating the wider
document-context
and inter-dependencies among the sentences.
The aim of this survey article is to highlight the major works that have been undertaken in the space of document-level machine translation after the neural revolution, so researchers can recognize the current state and future directions of this field.
We provide an organization of the literature based on novelties in modelling and architectures as well as training and decoding strategies. In addition, we cover evaluation strategies that have been introduced to account for the improvements in document MT, including automatic metrics and discourse-targeted test sets. We conclude by presenting possible avenues for future exploration in this research field.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, SAZU, UL, UM, UPUK
Clinical audit of invasive mold disease (IMD) in hematology patients is inefficient due to the difficulties of case finding. This results in antifungal stewardship (AFS) programs preferentially ...reporting drug cost and consumption rather than measures that actually reflect quality of care. We used machine learning-based natural language processing (NLP) to non-selectively screen chest tomography (CT) reports for pulmonary IMD, verified by clinical review against international definitions and benchmarked against key AFS measures. NLP screened 3014 reports from 1 September 2008 to 31 December 2017, generating 784 positives that after review, identified 205 IMD episodes (44% probable-proven) in 185 patients from 50,303 admissions. Breakthrough-probable/proven-IMD on antifungal prophylaxis accounted for 60% of episodes with serum monitoring of voriconazole or posaconazole in the 2 weeks prior performed in only 53% and 69% of episodes, respectively. Fiberoptic bronchoscopy within 2 days of CT scan occurred in only 54% of episodes. The average turnaround of send-away bronchoalveolar galactomannan of 12 days (range 7-22) was associated with high empiric liposomal amphotericin consumption. A random audit of 10% negative reports revealed two clinically significant misses (0.9%, 2/223). This is the first successful use of applied machine learning for institutional IMD surveillance across an entire hematology population describing process and outcome measures relevant to AFS. Compared to current methods of clinical audit, semi-automated surveillance using NLP is more efficient and inclusive by avoiding restrictions based on any underlying hematologic condition, and has the added advantage of being potentially scalable.