Recent years have seen a steep rise in the number of skin cancer detection applications. While modern advances in deep learning made possible reaching new heights in terms of classification accuracy, ...no publicly available skin cancer detection software provide confidence estimates for these predictions. We present DUNEScan (Deep Uncertainty Estimation for Skin Cancer), a web server that performs an intuitive in-depth analysis of uncertainty in commonly used skin cancer classification models based on convolutional neural networks (CNNs). DUNEScan allows users to upload a skin lesion image, and quickly compares the mean and the variance estimates provided by a number of new and traditional CNN models. Moreover, our web server uses the Grad-CAM and UMAP algorithms to visualize the classification manifold for the user's input, hence providing crucial information about its closeness to skin lesion images from the popular ISIC database. DUNEScan is freely available at: https://www.dunescan.org .
Uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of uncertainties during both optimization and decision making processes. They have been applied to solve a variety ...of real-world problems in science and engineering. Bayesian approximation and ensemble learning techniques are two widely-used types of uncertainty quantification (UQ) methods. In this regard, researchers have proposed different UQ methods and examined their performance in a variety of applications such as computer vision (e.g., self-driving cars and object detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification and segmentation), natural language processing (e.g., text classification, social media texts and recidivism risk-scoring), bioinformatics, etc. This study reviews recent advances in UQ methods used in deep learning, investigates the application of these methods in reinforcement learning, and highlights fundamental research challenges and directions associated with UQ.
•We provided an extensive review of uncertainty quantification methods in deep learning.•We covered popular and efficient Bayesian approaches for uncertainty quantification.•We listed notable ensemble techniques for quantifying uncertainty.•We discussed various applications of uncertainty quantification methods.•We summarized major open challenges and research gaps in uncertainty quantification.
T-REX (Tree and reticulogram REConstruction) is a web server dedicated to the reconstruction of phylogenetic trees, reticulation networks and to the inference of horizontal gene transfer (HGT) ...events. T-REX includes several popular bioinformatics applications such as MUSCLE, MAFFT, Neighbor Joining, NINJA, BioNJ, PhyML, RAxML, random phylogenetic tree generator and some well-known sequence-to-distance transformation models. It also comprises fast and effective methods for inferring phylogenetic trees from complete and incomplete distance matrices as well as for reconstructing reticulograms and HGT networks, including the detection and validation of complete and partial gene transfers, inference of consensus HGT scenarios and interactive HGT identification, developed by the authors. The included methods allows for validating and visualizing phylogenetic trees and networks which can be built from distance or sequence data. The web server is available at: www.trex.uqam.ca.
•Novel DGHNL credit scoring prediction model is proposed using German credit data.•Devloped 29-layer model compries of various learners (2 types of SVM, kNN, PNN, Fuzzy system).•New genetic layered ...training is used to optimize the DGHNL system.•Obtained classification accuracy of 94.60% (54 errors / 1000).
Credit scoring (CS) is an effective and crucial approach used for risk management in banks and other financial institutions. It provides appropriate guidance on granting loans and reduces risks in the financial area. Hence, companies and banks are trying to use novel automated solutions to deal with CS challenge to protect their own finances and customers. Nowadays, different machine learning (ML) and data mining (DM) algorithms have been used to improve various aspects of CS prediction. In this paper, we introduce a novel methodology, named Deep Genetic Hierarchical Network of Learners (DGHNL). The proposed methodology comprises different types of learners, including Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Probabilistic Neural Networks (PNN), and fuzzy systems. The Statlog German (1000 instances) credit approval dataset available in the UCI machine learning repository is used to test the effectiveness of our model in the CS domain. Our DGHNL model encompasses five kinds of learners, two kinds of data normalization procedures, two extraction of features methods, three kinds of kernel functions, and three kinds of parameter optimizations. Furthermore, the model applies deep learning, ensemble learning, supervised training, layered learning, genetic selection of features (attributes), genetic optimization of learners parameters, and novel genetic layered training (selection of learners) approaches used along with the cross-validation (CV) training-testing method (stratified 10-fold). The novelty of our approach relies on a proper flow and fusion of information (DGHNL structure and its optimization). We show that the proposed DGHNL model with a 29-layer structure is capable to achieve the prediction accuracy of 94.60% (54 errors per 1000 classifications) for the Statlog German credit approval data. It is the best prediction performance for this well-known credit scoring dataset, compared to the existing work in the field.
Understanding the evolution of Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV-2) and its relationship to other coronaviruses in the wild is crucial for preventing future virus outbreaks. ...While the origin of the SARS-CoV-2 pandemic remains uncertain, mounting evidence suggests the direct involvement of the bat and pangolin coronaviruses in the evolution of the SARS-CoV-2 genome. To unravel the early days of a probable zoonotic spillover event, we analyzed genomic data from various coronavirus strains from both human and wild hosts. Bayesian phylogenetic analysis was performed using multiple datasets, using strict and relaxed clock evolutionary models to estimate the occurrence times of key speciation, gene transfer, and recombination events affecting the evolution of SARS-CoV-2 and its closest relatives. We found strong evidence supporting the presence of temporal structure in datasets containing SARS-CoV-2 variants, enabling us to estimate the time of SARS-CoV-2 zoonotic spillover between August and early October 2019. In contrast, datasets without SARS-CoV-2 variants provided mixed results in terms of temporal structure. However, they allowed us to establish that the presence of a statistically robust clade in the phylogenies of gene S and its receptor-binding (RBD) domain, including two bat (BANAL) and two Guangdong pangolin coronaviruses (CoVs), is due to the horizontal gene transfer of this gene from the bat CoV to the pangolin CoV that occurred in the middle of 2018. Importantly, this clade is closely located to SARS-CoV-2 in both phylogenies. This phylogenetic proximity had been explained by an RBD gene transfer from the Guangdong pangolin CoV to a very recent ancestor of SARS-CoV-2 in some earlier works in the field before the BANAL coronaviruses were discovered. Overall, our study provides valuable insights into the timeline and evolutionary dynamics of the SARS-CoV-2 pandemic.
The SARS-CoV-2 pandemic is one of the greatest global medical and social challenges that have emerged in recent history. Human coronavirus strains discovered during previous SARS outbreaks have ...been hypothesized to pass from bats to humans using intermediate hosts, e.g. civets for SARS-CoV and camels for MERS-CoV. The discovery of an intermediate host of SARS-CoV-2 and the identification of specific mechanism of its emergence in humans are topics of primary evolutionary importance. In this study we investigate the evolutionary patterns of 11 main genes of SARS-CoV-2. Previous studies suggested that the genome of SARS-CoV-2 is highly similar to the horseshoe bat coronavirus RaTG13 for most of the genes and to some Malayan pangolin coronavirus (CoV) strains for the receptor binding (RB) domain of the spike protein.
We provide a detailed list of statistically significant horizontal gene transfer and recombination events (both intergenic and intragenic) inferred for each of 11 main genes of the SARS-CoV-2 genome. Our analysis reveals that two continuous regions of genes S and N of SARS-CoV-2 may result from intragenic recombination between RaTG13 and Guangdong (GD) Pangolin CoVs. Statistically significant gene transfer-recombination events between RaTG13 and GD Pangolin CoV have been identified in region 1215-1425 of gene S and region 534-727 of gene N. Moreover, some statistically significant recombination events between the ancestors of SARS-CoV-2, RaTG13, GD Pangolin CoV and bat CoV ZC45-ZXC21 coronaviruses have been identified in genes ORF1ab, S, ORF3a, ORF7a, ORF8 and N. Furthermore, topology-based clustering of gene trees inferred for 25 CoV organisms revealed a three-way evolution of coronavirus genes, with gene phylogenies of ORF1ab, S and N forming the first cluster, gene phylogenies of ORF3a, E, M, ORF6, ORF7a, ORF7b and ORF8 forming the second cluster, and phylogeny of gene ORF10 forming the third cluster.
The results of our horizontal gene transfer and recombination analysis suggest that SARS-CoV-2 could not only be a chimera virus resulting from recombination of the bat RaTG13 and Guangdong pangolin coronaviruses but also a close relative of the bat CoV ZC45 and ZXC21 strains. They also indicate that a GD pangolin may be an intermediate host of this dangerous virus.
Gene trees carry important information about specific evolutionary patterns which characterize the evolution of the corresponding gene families. However, a reliable species consensus tree cannot be ...inferred from a multiple sequence alignment of a single gene family or from the concatenation of alignments corresponding to gene families having different evolutionary histories. These evolutionary histories can be quite different due to horizontal transfer events or to ancient gene duplications which cause the emergence of paralogs within a genome. Many methods have been proposed to infer a single consensus tree from a collection of gene trees. Still, the application of these tree merging methods can lead to the loss of specific evolutionary patterns which characterize some gene families or some groups of gene families. Thus, the problem of inferring multiple consensus trees from a given set of gene trees becomes relevant.
We describe a new fast method for inferring multiple consensus trees from a given set of phylogenetic trees (i.e. additive trees or X-trees) defined on the same set of species (i.e. objects or taxa). The traditional consensus approach yields a single consensus tree. We use the popular k-medoids partitioning algorithm to divide a given set of trees into several clusters of trees. We propose novel versions of the well-known Silhouette and Caliński-Harabasz cluster validity indices that are adapted for tree clustering with k-medoids. The efficiency of the new method was assessed using both synthetic and real data, such as a well-known phylogenetic dataset consisting of 47 gene trees inferred for 14 archaeal organisms.
The method described here allows inference of multiple consensus trees from a given set of gene trees. It can be used to identify groups of gene trees having similar intragroup and different intergroup evolutionary histories. The main advantage of our method is that it is much faster than the existing tree clustering approaches, while providing similar or better clustering results in most cases. This makes it particularly well suited for the analysis of large genomic and phylogenetic datasets.
The emergence of functional specialization is a core problem in biology. In this work we focus on the emergence of reproductive (germ) and vegetative viability-enhancing (soma) cell functions (or ...germ-soma specialization). We consider a group of cells and assume that they contribute to two different evolutionary tasks, fecundity and viability. The potential of cells to contribute to fitness components is traded off. As embodied in current models, the curvature of the trade-off between fecundity and viability is concave in small-sized organisms and convex in large-sized multicellular organisms. We present a general mathematical model that explores how the division of labor in a cell colony depends on the trade-off curvatures, a resource constraint and different fecundity and viability rates. Moreover, we consider the case of different trade-off functions for different cells. We describe the set of all possible solutions of the formulated mathematical programming problem and show some interesting examples of optimal specialization strategies found for our objective fitness function. Our results suggest that the transition to specialized organisms can be achieved in several ways. The evolution of Volvocalean green algae is considered to illustrate the application of our model. The proposed model can be generalized to address a number of important biological issues, including the evolution of specialized enzymes and the emergence of complex organs.
Coronary artery disease (CAD) is one of the main causes of cardiac death around the world. Due to its significant impact on the society, early and accurate detection of CAD is essential. This study ...proposes a novel nested ensemble nu-Support Vector Classification (NE-nu-SVC) model which combines several traditional machine learning methods and ensemble learning techniques for effective diagnosis of CAD. We validated our model using two well-known CAD datasets (Z-Alizadeh Sani and Cleveland). To improve the performance of the model, we selected clinically significant features from the datasets using a genetic search algorithm. To further improve our results, we applied a multi-level filtering technique to balance the data using the ClassBlancer and Resample methods. Our base algorithm, nu-SVC, is performed using four well-known kernel functions (linear, polynomial, radial basis (RBF) and sigmoid). The proposed NE-nu-SVC model provided the highest accuracy of 94.66% and 98.60% to predict CAD entities in the Z-Alizadeh Sani and Cleveland CAD datasets, respectively. Our system can aid the clinicians to diagnose CAD accurately and may probably replace other invasive diagnostic techniques.
This paper gives an experimentally supported review and comparison of several indices based on the conventional K-means inertia criterion for determining the number of clusters, Formula Omitted, in ...datasets, using the popular Silhouette width index as a benchmark. Our experiments involve a novel version of the Elbow index, defined using values of Formula Omitted two or three steps apart. We also discuss alternative ways of computing the inertia and summarizing its values. Even though there are no overall winners in our experiments, some of our results are very conclusive and can be used as a guide for indices determining the number of clusters in K-means.