Regulatory documents play a significant role in securing engineering project quality, standard process management and long-term sustainable developments. With the digitisation of knowledge in the AEC ...industry, the demand for automated knowledge mining has emerged when confronted with substantial regulations. However, the current interpretation approaches for regulatory documents are still mostly labour-intensive and flawed in complex knowledge. Based on transfer learning (BERT) and natural language processing (e.g., NLP-Syntactic Parsing), this paper proposes a fully automated knowledge mining framework to convert complex knowledge in textual regulations to graph-based knowledge representations. The framework uses a BERT-based engine to extract clauses from regulation documents through fine-tuning with the self-developed domain dataset. A constituent extractor is developed to process the provisions with complex knowledge and extract constituents. A knowledge modelling engine integrates the extracted constituents into a graph-based regulation knowledge model, which can be queried, visualised, and directly applied to downstream applications. The outcome has demonstrated promising performance in complex knowledge mining and knowledge graph modelling based on ISO 19650 case study. This research can effectively convert textual regulation documents to their counterpart regulatory knowledge base, contributing to automated knowledge acquisition and multi-domain knowledge fusion toward regulation digitalization.
•Proposes an autonomous transformation framework to convert regulatory knowledge into a graph-based knowledge model.•Proposes an efficient approach to establishing domain datasets via data augmentation and the Delphi method.•A BERT-based fine-tuned model is used as the input of domain knowledge to identify clauses in regulation documents.•Regulations with complex logic and multiple relations can be parsed by a linguistic and NLP enhanced extraction engine.•The proposed framework has achieved 74% accuracy against the ontology developed by domain experts in practical scenarios.
In this work, we address the task of feature ranking for multi-target regression (MTR). The task of MTR concerns problems with multiple continuous dependent/target variables, where the goal is to ...learn a model for predicting all of them simultaneously. This task is receiving an increasing attention from the research community, but performing feature ranking in the context of MTR has not been studied thus far. Here, we study two groups of feature ranking scores for MTR: scores (Symbolic, Genie3 and Random Forest score) based on ensembles (bagging, random forests, extra trees) of predictive clustering trees, and a score derived as an extension of the RReliefF method. We also propose a generic data-transformation approach to MTR feature ranking and thus have two versions of each score. For both groups of feature ranking scores, we analyze their theoretical computational complexity. For the extension of the RReliefF method, we additionally derive some theoretical properties of the scores. Next, we extensively evaluate the scores on 24 benchmark MTR datasets, in terms of the quality of the ranking and the computational complexity of producing it. The results identify the parameters that influence the quality of the rankings, reveal that both groups of methods produce relevant feature rankings, and show that the Symbolic and Genie3 score, coupled with random forest ensembles, yield the best rankings.
Abstract Background Synthetic Cannabinoid Receptor Agonists (SCRA), also known as “K2” or “Spice,” have drawn considerable attention due to their potential of abuse and harmful consequences. More ...research is needed to understand user experiences of SCRA-related effects. We use semi-automated information processing techniques through eDrugTrends platform to examine SCRA-related effects and their variations through a longitudinal content analysis of web-forum data. Method English language posts from three drug-focused web-forums were extracted and analyzed between January 1st 2008 and September 30th 2015. Search terms are based on the Drug Use Ontology (DAO) created for this study (189 SCRA-related and 501 effect-related terms). EDrugTrends NLP-based text processing tools were used to extract posts mentioning SCRA and their effects. Generalized linear regression was used to fit restricted cubic spline functions of time to test whether the proportion of drug-related posts that mention SCRA (and no other drug) and the proportion of these “SCRA-only” posts that mention SCRA effects have changed over time, with an adjustment for multiple testing. Results 19,052 SCRA-related posts (Bluelight (n = 2782), Forum A (n = 3882), and Forum B (n = 12,388)) posted by 2543 international users were extracted. The most frequently mentioned effects were “getting high” (44.0%), “hallucinations” (10.8%), and “anxiety” (10.2%). The frequency of SCRA-only posts declined steadily over the study period. The proportions of SCRA-only posts mentioning positive effects (e.g., “High” and “Euphoria”) steadily decreased, while the proportions of SCRA-only posts mentioning negative effects (e.g., “Anxiety,” ‘Nausea,” “Overdose”) increased over the same period. Conclusion This study’s findings indicate that the proportion of negative effects mentioned in web forum posts and linked to SCRA has increased over time, suggesting that recent generations of SCRA generate more harms. This is also one of the first studies to conduct automated content analysis of web forum data related to illicit drug use.
Online reviews are integral to consumer decision-making while purchasing products on an e-commerce platform. Extant literature has conclusively established the effects of various review and reviewer ...related predictors towards perceived helpfulness. However, background research is limited in addressing the following problem: how can readers interpret the topical summary of many helpful reviews that explain multiple themes and consecutively focus in-depth? To fill this gap, we drew upon Shannon's Entropy Theory and Dual Process Theory to propose a set of predictors using NLP and text mining to examine helpfulness. We created four predictors - review depth, review divergence, semantic entropy and keyword relevance to build our primary empirical models. We also reported interesting findings from the interaction effects of the reviewer's credibility, age of review, and review divergence. We also validated the robustness of our results across different product categories and higher thresholds of helpfulness votes. Our study contributes to the electronic commerce literature with relevant managerial and theoretical implications through these findings.
•Apply correlated topic modelling and Shannon's Entropy Theory to build predictors of online review helpfulness.•Propose four predictors - review depth, review divergence, semantic entropy and keyword relevance.•Significant interaction effects of reviewer's credibility, age of review, and review divergence towards the main effects.•Check the robustness of main results across different product categories and higher thresholds of helpfulness votes.
The state-of-the-art in automated processing of unstructured business documents has evolved from manual labor to advanced AI systems in the span of mere decades. Such systems involve learning ...techniques, rule or clause sets, neural models – either used alone or in combination – for the extraction to work. As an example, rule-based processes operate on a perceived layout or positioning of the information, whereas model-based frameworks adopt a semantic, and often uninspectable, approach. Verb-Based Semantic Role Labeling (VBSRL) is a novel system presented in a former paper that uses a hybrid foundation to inform the extraction phase via a set of rules modeling natural language. We propose a new VBSRL-based document processing method, aided by valuable and innovative architectural choices, which has been implemented for the Italian language and experimented upon with promising results. Even in its infancy, in fact, the first implementation of this system shows better results than comparable IE solutions, obtaining an aggregate, average F-measure of nearly 79%.
•Automating business document analysis is crucial and time consuming in enterprises.•Classification and information extraction for unstructured documents are hard tasks.•Document processing method via pre-processing, normalization and post-processing.•Information Extraction as Conceptual Dependency Theory plus Semantic Role Labeling.•Performances on real case scenario show better results than comparable IE solutions.
An evaluation is administered to measure students’ learning outcomes, which nowadays become challenging for instructors as student growth increases exponentially. Several models are proposed in the ...literature based on selected artificial intelligence algorithms that are once trained and then deployed. The problem with these kinds of systems is that the trained models are locked and cannot adjust to dynamically changing circumstances, leading to a drop in performance. Moreover, these systems only considered basic parameters for computing the semantic similarity, resulting in less accuracy. This paper develops an intelligent student evaluation model based on a predictive optimization approach, which considers question type, structure, necessary keywords, language, and conceptual aspects to evaluate the student’s answer. In order to enhance the performance of the proposed evaluation system, we have proposed a predictive optimization approach where a deep neural network is used as a learning module to learn from training data, and particle swarm optimization and gradient descent are used as an optimization scheme to optimize weighting parameters for the deep neural network. The proposed work uses and analyzes the real dataset of NTNU students’ exams to validate the proposed platform’s practicability. Initially, we will employ the natural language processing technique of deep learning in which semantic similarity score and other features will be used to compute the degree of relevance between actual answers and students’ provided answers. The proposed semantic similarity score algorithm is based on the WordNet library and Growbag dataset to check the solution’s semantics, conceptual aspects, and creativity. The resulting score will be used as a supervised machine-learning classification system feature. Performance of the classification model will be ensured using standard evaluation measures, including Precision, recall, and f-measure. The end goal of this platform is to acquire the grade against the student’s answer given as input in the developed platform.
•Develop an ISE model for open-ended question based on optimal learning.•Develop a semantic similarity module using wordnet and Growbag dataset.•Minimize feature space based on correlation index to reduce computational costs.•Proof of concept of the proposed ISE is presented based on comprehensive analysis.•Compare the similarity rate of the ISE with baseline models.
This paper proposes a method for solving optimization problems in which the decision-maker cannot evaluate the objective function, but rather can only express a
preference
such as “this is better ...than that” between two candidate decision vectors. The algorithm described in this paper aims at reaching the global optimizer by iteratively proposing the decision maker a new comparison to make, based on actively learning a surrogate of the latent (unknown and perhaps unquantifiable) objective function from past sampled decision vectors and pairwise preferences. A radial-basis function surrogate is fit via linear or quadratic programming, satisfying if possible the preferences expressed by the decision maker on existing samples. The surrogate is used to propose a new sample of the decision vector for comparison with the current best candidate based on two possible criteria: minimize a combination of the surrogate and an inverse weighting distance function to balance between exploitation of the surrogate and exploration of the decision space, or maximize a function related to the probability that the new candidate will be preferred. Compared to active preference learning based on Bayesian optimization, we show that our approach is competitive in that, within the same number of comparisons, it usually approaches the global optimum more closely and is computationally lighter. Applications of the proposed algorithm to solve a set of benchmark global optimization problems, for multi-objective optimization, and for optimal tuning of a cost-sensitive neural network classifier for object recognition from images are described in the paper. MATLAB and a Python implementations of the algorithms described in the paper are available at
http://cse.lab.imtlucca.it/~bemporad/glis
.
Preventing relapse in schizophrenia improves long-term health outcomes. Repeated episodes of psychotic symptoms shape the trajectory of this illness and can be a detriment to functional recovery. ...Despite early intervention programs, high relapse rates persist, calling for alternative approaches in relapse prevention. Predicting imminent relapse at an individual level is critical for effective intervention. While clinical profiles are often used to foresee relapse, they lack the specificity and sensitivity needed for timely prediction. Here, we review the use of speech through Natural Language Processing (NLP) to predict a recurrent psychotic episode. Recent advancements in NLP of speech have shown the ability to detect linguistic markers related to thought disorder and other language disruptions within 2–4 weeks preceding a relapse. This approach has shown to be able to capture individual speech patterns, showing promise in its use as a prediction tool. We outline current developments in remote monitoring for psychotic relapses, discuss the challenges and limitations and present the speech-NLP based approach as an alternative to detect relapses with sufficient accuracy, construct validity and lead time to generate clinical actions towards prevention.
•Definition and discussion of nine comparison metrics for already published data.•Survey of existing dataset for algorithmically generated domain names (AGDs).•Public release of a dataset with 100+ ...features and 30+ million domain names.•Exploratory analysis using the six most common classification algorithms.•Discussion and identification of guidelines for comparable future researches.
Advanced botnet threats are natively deploying concealing techniques to prevent detection and sinkholing. To tackle them, machine learning solutions have become a standard approach, especially when dealing with Algorithmically Generated Domain (AGD) names. Nevertheless, machine learning state-of-the-art is non-specialist at best, having multiple issues in terms of rigorousness, reproducibility and ultimately credibility. This research focuses on the first critical step of the training phase, that is, the collection of data suitable for being analysed by algorithms. We have detected a common lack of scientific rigorousness in the literature regarding the aforementioned AGD analysis and, therefore, we advocate two major contributions in this article: i) a thorough analysis of the cyber panorama in terms of botnets that make use of Domain Generation Algorithms (DGAs) as evasive techniques, that flows into ii) a full-fledged machine-learning-ready labelled dataset that features over 30 million AGDs sorted in 50 malware variant classes. This mature dataset aims to fill the gap in the comparability between the different researches published in the literature. Lastly, two minor contributions are also included in this article: iii) we designed an exploratory analysis of the proposed dataset to provide both data characteristics and potential future research lines, which eventually emerges as iv) a collection of suggested guidelines. When proposing a machine learning solution, researchers should adhere to it in order to achieve scientific rigorousness.