We consider stochastic programs where the distribution of the uncertain parameters is only observable through a finite training dataset. Using the Wasserstein metric, we construct a ball in the space ...of (multivariate and non-discrete) probability distributions centered at the uniform distribution on the training samples, and we seek decisions that perform best in view of the worst-case distribution within this Wasserstein ball. The state-of-the-art methods for solving the resulting distributionally robust optimization problems rely on global optimization techniques, which quickly become computationally excruciating. In this paper we demonstrate that, under mild assumptions, the distributionally robust optimization problems over Wasserstein balls can in fact be reformulated as finite convex programs—in many interesting cases even as tractable linear programs. Leveraging recent measure concentration results, we also show that their solutions enjoy powerful finite-sample performance guarantees. Our theoretical results are exemplified in mean-risk portfolio optimization as well as uncertainty quantification.
Distributionally robust optimization is a paradigm for decision making under uncertainty where the uncertain problem data are governed by a probability distribution that is itself subject to ...uncertainty. The distribution is then assumed to belong to an ambiguity set comprising all distributions that are compatible with the decision maker’s prior information. In this paper, we propose a unifying framework for modeling and solving distributionally robust optimization problems. We introduce standardized ambiguity sets that contain all distributions with prescribed conic representable confidence sets and with mean values residing on an affine manifold. These ambiguity sets are highly expressive and encompass many ambiguity sets from the recent literature as special cases. They also allow us to characterize distributional families in terms of several classical and/or robust statistical indicators that have not yet been studied in the context of robust optimization. We determine conditions under which distributionally robust optimization problems based on our standardized ambiguity sets are computationally tractable. We also provide tractable conservative approximations for problems that violate these conditions.
Linear stochastic programming provides a flexible toolbox for analyzing real-life decision situations, but it can become computationally cumbersome when recourse decisions are involved. The latter ...are usually modeled as decision rules, i.e., functions of the uncertain problem data. It has recently been argued that stochastic programs can quite generally be made tractable by restricting the space of decision rules to those that exhibit a linear data dependence. In this paper, we propose an efficient method to estimate the approximation error introduced by this rather drastic means of complexity reduction: we apply the linear decision rule restriction not only to the primal but also to a dual version of the stochastic program. By employing techniques that are commonly used in modern robust optimization, we show that both arising approximate problems are equivalent to tractable linear or semidefinite programs of moderate sizes. The gap between their optimal values estimates the loss of optimality incurred by the linear decision rule approximation. Our method remains applicable if the stochastic program has random recourse and multiple decision stages. It also extends to cases involving ambiguous probability distributions.
Many drug discovery projects fail because the underlying target is finally found to be undruggable. Progress in structure elucidation of proteins now opens up a route to automatic structure-based ...target assessment. DoGSiteScorer is a newly developed automatic tool combining pocket prediction, characterization and druggability estimation and is now available through a web server.
The DoGSiteScorer web server is freely available for academic use at http://dogsite.zbh.uni-hamburg.de
rarey@zbh.uni-hamburg.de.
Predicting druggability and prioritizing certain disease modifying targets for the drug development process is of high practical relevance in pharmaceutical research. DoGSiteScorer is a fully ...automatic algorithm for pocket and druggability prediction. Besides consideration of global properties of the pocket, also local similarities shared between pockets are reflected. Druggability scores are predicted by means of a support vector machine (SVM), trained, and tested on the druggability data set (DD) and its nonredundant version (NRDD). The DD consists of 1069 targets with assigned druggable, difficult, and undruggable classes. In 90% of the NRDD, the SVM model based on global descriptors correctly classifies a target as either druggable or undruggable. Nevertheless, global properties suffer from binding site changes due to ligand binding and from the pocket boundary definition. Therefore, local pocket properties are additionally investigated in terms of a nearest neighbor search. Local similarities are described by distance dependent histograms between atom pairs. In 88% of the DD pocket set, the nearest neighbor and the structure itself conform with their druggability type. A discriminant feature between druggable and undruggable pockets is having less short-range hydrophilic–hydrophilic pairs and more short-range lipophilic–lipophilic pairs. Our findings for global pocket descriptors coincide with previously published methods affirming that size, shape, and hydrophobicity are important global pocket descriptors for automatic druggability prediction. Nevertheless, the variety of pocket shapes and their flexibility upon ligand binding limit the automatic projection of druggable features onto descriptors. Incorporating local pocket properties is another step toward a reliable descriptor-based druggability prediction.
Portfolio optimization problems involving value at risk (VaR) are often computationally intractable and require complete information about the return distribution of the portfolio constituents, which ...is rarely available in practice. These difficulties are compounded when the portfolio contains derivatives. We develop two tractable conservative approximations for the VaR of a derivative portfolio by evaluating the worst-case VaR over all return distributions of the derivative underliers with given first- and second-order moments. The derivative returns are modelled as convex piecewise linear or-by using a delta-gamma approximation-as (possibly nonconvex) quadratic functions of the returns of the derivative underliers. These models lead to new worst-case polyhedral VaR (WPVaR) and worst-case quadratic VaR (WQVaR) approximations, respectively. WPVaR serves as a VaR approximation for portfolios containing long positions in European options expiring at the end of the investment horizon, whereas WQVaR is suitable for portfolios containing long and/or short positions in European and/or exotic options expiring beyond the investment horizon. We prove that-unlike VaR that may discourage diversification-WPVaR and WQVaR are in fact coherent risk measures. We also reveal connections to robust portfolio optimization.
This paper was accepted by Dimitris Bertsimas, optimization.
A new method has been developed to detect functional relationships among proteins independent of a given sequence or fold homology. It is based on the idea that protein function is intimately related ...to the recognition and subsequent response to the binding of a substrate or an endogenous ligand in a well-characterized binding pocket. Thus, recognition of similar ligands, supposedly linked to similar function, requires conserved recognition features exposed in terms of common physicochemical interaction properties
via the functional groups of the residues flanking a particular binding cavity. Following a technique commonly used in the comparison of small molecule ligands, generic pseudocenters coding for possible interaction properties were assigned for a large sample set of cavities extracted from the entire PDB and stored in the database Cavbase. Using a particular query cavity a series of related cavities of decreasing similarity is detected based on a clique detection algorithm. The detected similarity is ranked according to property-based surface patches shared in common by the different clique solutions. The approach either retrieves protein cavities accommodating the same (e.g. co-factors) or closely related ligands or it extracts proteins exhibiting similar function in terms of a related catalytic mechanism. Finally the new method has strong potential to suggest alternative molecular skeletons in
de novo design. The retrieval of molecular building blocks accommodated in a particular sub-pocket that shares similarity with the pocket in a protein studied by drug design can inspire the discovery of novel ligands.
Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among ...them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.
Graphical abstract
While in the last years there has been a dramatic increase in the number of available bioassay datasets, many of them suffer from extremely imbalanced distribution between active and inactive ...compounds. Thus, there is an urgent need for novel approaches to tackle class imbalance in drug discovery. Inspired by recent advances in computer vision, we investigated a panel of alternative loss functions for imbalanced classification in the context of Gradient Boosting and benchmarked them on six datasets from public and proprietary sources, for a total of 42 tasks and 2 million compounds. Our findings show that with these modifications, we achieve statistically significant improvements over the conventional cross-entropy loss function on five out of six datasets. Furthermore, by employing these bespoke loss functions we are able to push Gradient Boosting to match or outperform a wide variety of previously reported classifiers and neural networks. We also investigate the impact of changing the loss function on training time and find that it increases convergence speed up to 8 times faster. As such, these results show that tuning the loss function for Gradient Boosting is a straightforward and computationally efficient method to achieve state-of-the-art performance on imbalanced bioassay datasets without compromising on interpretability and scalability.
Graphical Abstract