Abstract
Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for ...QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR), three analogizers radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR), six symbolists extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART) and two connectionists principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN) were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.
Graphical abstract
Efficient layout of large-scale graphs remains a challenging problem: the force-directed and dimensionality reduction-based methods suffer from high overhead for graph distance and gradient ...computation. In this paper, we present a new graph layout algorithm, called DRGraph, that enhances the nonlinear dimensionality reduction process with three schemes: approximating graph distances by means of a sparse distance matrix, estimating the gradient by using the negative sampling technique, and accelerating the optimization process through a multi-level layout scheme. DRGraph achieves a linear complexity for the computation and memory consumption, and scales up to large-scale graphs with millions of nodes. Experimental results and comparisons with state-of-the-art graph layout methods demonstrate that DRGraph can generate visually comparable layouts with a faster running time and a lower memory requirement.
Industry benchmarking involves comparing and analyzing a company’s performance with other top-performing enterprises. PDF documents contain valuable corporate information, but their non-editable ...nature makes data extraction complex. This study focuses on converting unstructured data from PDF documents, including tables, images, and text, to a structured format that is suitable for analysis and decision-making. The methods that are currently used for PDF document conversion primarily involve manual extraction, PDF converters, and artificial intelligence algorithms. However, they are often restricted to processing a single modality, have limitations in dealing with complex structured tables, or cannot achieve the required accuracy in practice. This study focuses on converting the periodic reports documents of listed companies from PDF format to structured data. We propose a unified framework for extracting tables, images, and text by parsing PDF documents into constituent objects. We introduce three bespoke algorithms to process complex structured tables and to develop a prototype system of visual analysis that combines AI for automated data extraction with the domain knowledge of human experts for auditing. Quantitative and qualitative experiments are conducted to validate the methodology’s superiority, including its efficiency, quality, and user-friendliness.
Large language models (LLMs) have gained popularity in various fields for their exceptional capability of generating human-like text. Their potential misuse has raised social concerns about ...plagiarism in academic contexts. However, effective artificial scientific text detection is a non-trivial task due to several challenges, including (1) the lack of a clear understanding of the differences between machine-generated and human-written scientific text, (2) the poor generalization performance of existing methods caused by out-of-distribution issues, and (3) the limited support for human-machine collaboration with sufficient interpretability during the detection process. In this paper, we first identify the critical distinctions between machine-generated and human-written scientific text through a quantitative experiment. Then, we propose a mixed-initiative workflow that combines human experts’ prior knowledge with machine intelligence, along with a visual analytics system to facilitate efficient and trustworthy scientific text detection. Finally, we demonstrate the effectiveness of our approach through two case studies and a controlled user study. We also provide design implications for interactive artificial text detection tools in high-stakes decision-making scenarios.
In this paper, we consider iterative learning control for trajectory tacking of robotic manipulator with uncertainty. An improved quadratic-criterion-based iterative learning control approach (Q-ILC) ...is proposed to obtain better trajectory tracking performance for the robotic manipulator. Besides of the position error information, which has been used in existing Q-ILC methods for robotic control, the velocity error information is also taken into consideration such that a new norm-optimal objective function is constructed. Convergence and error sensitivity properties for the proposed method are also analyzed. To deal with uncertainty, the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) are incorporated for estimation of uncertain parameters by constructing extended system states. The performances between the two filters are also compared. Simulations on a 2DOF Robot manipulator demonstrate that the improved Q-ILC with parameter estimators can achieve faster convergence and better transient performance compared to the original Q-ILC, in the presence of measurement noise and model uncertainty.
Renal biopsy is the gold standard to determine the pathologic type of primary nephrotic syndrome, which is critical for diagnosis, choice of treatment and evaluation of prognosis. However, in some ...cases, renal biopsy cannot be performed.
To explore the possibility of predicting the histology type of primary nephrotic syndrome without the need for biopsy, we trained and validated a machine learning algorithm using data from 222 patients with biopsy-confirmed primary nephrotic syndrome treated at our hospital between May 2008 and January 2016. The model was then tested prospectively on another sample of 63 patients with biopsy-confirmed primary nephrotic syndrome.
Overall accuracy of prediction from the retrospective set of 222 patients was 62.2% across all types of nephrotic syndrome. The accuracy of model prediction for the prospectively collected dataset of 63 patients was 61.9%. The algorithm identified 17 of 33 variables as contributing strongly to type of renal pathology.
To our knowledge, this is the first such application of machine learning to predict the pathologic type of primary nephrotic syndrome, which may be clinically useful by itself as well as helpful for guiding future efforts at machine learning-based prediction in other disease contexts.
AvrPphB is an avirulence (Avr) protein from the plant pathogen Pseudomonas syringae that can trigger a disease-resistance response in a number of host plants including Arabidopsis. AvrPphB belongs to ...a novel family of cysteine proteases with the charter member of this family being the Yersinia effector protein YopT. AvrPphB has a very stringent substrate specificity, catalyzing a single proteolytic cleavage in the Arabidopsis serine/threonine kinase PBS1. We have determined the crystal structure of AvrPphB by x-ray crystallography at 1.35-Å resolution. The structure is composed of a central antiparallel β-sheet, with α-helices packing on both sides of the sheet to form a two-lobe structure. The core of this structure resembles the papain-like cysteine proteases. The similarity includes the AvrPphB active site catalytic triad of Cys-98, His-212, and Asp-227 and the oxyanion hole residue Asn-93. Based on analogy with inhibitor complexes of the papain-like proteases, we propose a model for the substrate-binding mechanism of AvrPphB. A deep and positively charged pocket (S2) and a neighboring shallow surface (S3) likely bind to aspartic acid and glycine residues in the substrate located two (P2) and three (P3) residues N terminal to the cleavage site, respectively. Further implications about the specificity of plant pathogen recognition are also discussed.
Urban data is massive, heterogeneous, and spatio-temporal, posing a substantial challenge for visualization and analysis. In this paper, we design and implement a novel visual analytics approach, ...Visual Analyzer for Urban Data (VAUD), that supports the visualization, querying, and exploration of urban data. Our approach allows for cross-domain correlation from multiple data sources by leveraging spatial-temporal and social inter-connectedness features. Through our approach, the analyst is able to select, filter, aggregate across multiple data sources and extract information that would be hidden to a single data subset. To illustrate the effectiveness of our approach, we provide case studies on a real urban dataset that contains the cyber-, physical-, and social- information of 14 million citizens over 22 days.