To illustrate in-depth validation of prediction models developed on multicenter data.
For each hospital in a multicenter registry, we evaluated predictive performance of a 30-day mortality prediction ...model for transcatheter aortic valve implantation (TAVI) using the Netherlands heart registration (NHR) dataset. We measured discrimination and calibration per hospital in a leave-center-out analysis (LCOA). Meta-analysis was used to calculate I2 values per performance metric from the LCOA and to compute mean and confidence interval (CI) estimates. Case mix differences between studies were inspected using the framework of Debray et al. for understanding external validation. We also aimed to discover subgroups (SGs) with high model prediction error (PE) and their distribution over the centers.
We studied 16 hospitals with 11,599 TAVI patients with an early mortality of 3.7%. The models’ area under the curve (AUCs) had a wide range between hospitals from 0.59 to 0.79, and miscalibration occurred in seven hospitals. Mean AUC from meta-analysis was 0.68 (95% CI 0.65-0.70). I2 values were 0%, 74%, and 0% for AUC, calibration intercept and slope, respectively. Between-hospital case-mix differences were substantial, and model transportability was low. One SG was discovered with marked global PE and was associated with poor performance on validation centers.
The illustrated combination of approaches provides useful insights to inspect multicenter-based prediction models, and it exposes their limitations in transportability and performance variability when applied to different populations.
Display omitted
Data mining (DM) techniques have been used to solve marketing and manufacturing problems in the fashion industry. These approaches are expected to be particularly important for highly customized ...industries because the diversity of products sold makes it harder to find clear patterns of customer preferences. The goal of this project was to investigate two different data mining approaches for customer segmentation: clustering and subgroup discovery. The models obtained produced six market segments and 49 rules that allowed a better understanding of customer preferences in a highly customized fashion manufacturer/e-tailor. The scope and limitations of these clustering DM techniques will lead to further methodological refinements.
•We investigate customer segmentation in highly customized fashion industries.•Two data mining methods are used, clustering and subgroup discovery.•The segments obtained enabled a better understanding of customer preferences.•Different approaches provide complementary perspectives.•Lines for further methodological refinements were identified.
Data mining methods in software engineering are becoming increasingly important as they can support several aspects of the software development life-cycle such as quality. In this work, we present a ...data mining approach to induce rules extracted from static software metrics characterising fault-prone modules. Due to the special characteristics of the defect prediction data (imbalanced, inconsistency, redundancy) not all classification algorithms are capable of dealing with this task conveniently. To deal with these problems, Subgroup Discovery (SD) algorithms can be used to find groups of statistically different data given a property of interest. We propose EDER-SD (Evolutionary Decision Rules for Subgroup Discovery), a SD algorithm based on evolutionary computation that induces rules describing only fault-prone modules. The rules are a well-known model representation that can be easily understood and applied by project managers and quality engineers. Thus, rules can help them to develop software systems that can be justifiably trusted. Contrary to other approaches in SD, our algorithm has the advantage of working with continuous variables as the conditions of the rules are defined using intervals. We describe the rules obtained by applying our algorithm to seven publicly available datasets from the PROMISE repository showing that they are capable of characterising subgroups of fault-prone modules. We also compare our results with three other well known SD algorithms and the EDER-SD algorithm performs well in most cases.
Extraction of biologically-meaningful knowledge is one of the important and challenging tasks in bioinformatics, in particular computational analysis of DNA and protein sequences, in order to ...identify biological function(s) and behaviour(s) of newly-extracted sequences. Computational intelligence techniques in corporation with sequence-driven features have been applied to tackle the problem and help classify different functional classes of the sequences. In order to study this problem, subgroup discovery algorithms together with a signal processing-based feature extraction method are applied, where the sequences are represented as a signal. The applicability of this method has been studied through four different Neuraminidase genes of Influenza A subtypes, H1N1, H2N2, H3N2 and H5N1. The results yielded not only higher predictive accuracy over these four classes of the proteins but also interpretable rule-based representation of the descriptive model with a significantly reduced feature set driven by means of the signal processing method. Subgroup discovery technique based on evolutionary fuzzy systems is expected to open new areas of research in bioinformatics and further help identify and understand more focused therapeutic protein targets.
This paper proposes a novel algorithm for subgroup discovery task based on genetic programming and fuzzy logic called Fuzzy Genetic Programming-based for Subgroup Discovery (FuGePSD). The genetic ...programming allows to learn compact expressions with the main objective to obtain rules for describing simple, interesting and interpretable subgroups. This algorithm incorporates specific operators in the search process to promote the diversity between the individuals. The evolutionary scheme of FuGePSD is codified through the genetic cooperative-competitive approach promoting the competition and cooperation between the individuals of the population in order to find out the optimal solutions for the SD task.
FuGePSD displays its potential with high-quality results in a wide experimental study performed with respect to others evolutionary algorithms for subgroup discovery. Moreover, the quality of this proposal is applied to a case study related to acute sore throat problems.
Robust subgroup discovery Proença, Hugo M.; Grünwald, Peter; Bäck, Thomas ...
Data mining and knowledge discovery,
09/2022, Volume:
36, Issue:
5
Journal Article
Peer reviewed
Open access
We introduce the problem of
robust subgroup discovery
, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are ...statistically robust, and 3) non-redundant. Many attempts have been made to mine either
locally
robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a
global
modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.
Objectives: Peer review is a powerful tool that steers the education and practice of medical researchers but may allow biased critique by anonymous reviewers. We explored factors unrelated to ...research quality that may influence peer review reports, and assessed the possibility that sub-types of reviewers exist. Our findings could potentially improve the peer review process.
Methods: We evaluated the harshness, constructiveness and positiveness in 596 reviews from journals with open peer review, plus 46 reviews from colleagues' anonymously reviewed manuscripts. We considered possible influencing factors, such as number of authors and seasonal trends, on the content of the review. Finally, using machine-learning we identified latent types of reviewer with differing characteristics.
Results: Reviews provided during a northern-hemisphere winter were significantly harsher, suggesting a seasonal effect on language. Reviews for articles in journals with an open peer review policy were significantly less harsh than those with an anonymous review process. Further, we identified three types of reviewers: nurturing, begrudged, and blasé.
Conclusion: Nurturing reviews were in a minority and our findings suggest that more widespread open peer reviewing could improve the educational value of peer review, increase the constructive criticism that encourages researchers, and reduce pride and prejudice in editorial processes.
Nowadays, there is an incredible increase of data volumes around the world, with the Internet as one of the main actors in this scenario and a growth rate above 30GB/s. The treatment of this huge ...amount of information cannot be carried out through traditional data mining algorithms in an efficient way and it is necessary to adapt and design new algorithms towards distributed paradigms such as MapReduce. This situation is a challenge for the community, investigated under the widely known term of big data.
This paper presents a new algorithm for the subgroup discovery task called MEFASD-BD. The algorithm is developed in Apache Spark based on the MapReduce paradigm, and it is able to tackle high dimensional datasets in an efficient way. In fact, this algorithm is the first approximation to big data within evolutionary fuzzy systems for subgroup discovery. MEFASD-BD implements novel MapReduce functions which are able to analyse the quality of the subgroups obtained for each map with respect to the original dataset in order to improve the quality of these subgroups. In addition, the final reduce function of the algorithm employs the token competition operator in order to select the best rules extracted in the different maps. An experimental study with high dimensional datasets is performed in order to show the advantages of this algorithm in this type of problems. Specifically, the results of the study show an important reduction of the runtime while keeping the values in the standard quality measures for subgroup discovery.
•New double beam algorithms for subgroup discovery (SD) and classification rules (RL).•Algorithms can use different heuristics for rule refinement and rule selection.•Variants of new SD algorithm ...give more interesting rules than state-of-the-art.•RL algorithm gives rules with comparable accuracy with state-of-the-art algorithms.•Inverted heuristics in rule refinement produce rules with better coverage.
Classification rules and rules describing interesting subgroups are important components of descriptive machine learning. Rule learning algorithms typically proceed in two phases: rule refinement selects conditions for specializing the rule, and rule selection selects the final rule among several rule candidates. While most conventional algorithms use the same heuristic for guiding both phases, recent research indicates that the use of two separate heuristics is conceptually better justified, improves the coverage of positive examples, and may result in better classification accuracy. The paper presents and evaluates two new beam search rule learning algorithms: DoubleBeam-SD for subgroup discovery and DoubleBeam-RL for classification rule learning. The algorithms use two separate beams and can combine various heuristics for rule refinement and rule selection, which widens the search space and allows for finding rules with improved quality. In the classification rule learning setting, the experimental results confirm previously shown benefits of using two separate heuristics for rule refinement and rule selection. In subgroup discovery, DoubleBeam-SD algorithm variants outperform several state-of-the-art related algorithms.
An increasing area of study for economists and sociologists is the varying organizational structures between business networks. The use of network science makes it possible to identify the ...determinants of the performance of these business networks. In this work we look for the determinants of inter-firm performance. On one hand, a new method of supervised clustering with attributed networks is proposed, SUWAN, with the aim at obtaining class-uniform clusters of the turnover, while minimizing the number of clusters. This method deals with representative-based supervised clustering, where a set of initial representatives is randomly chosen. One of the innovative aspects of SUWAN is that we use a supervised clustering algorithm to attributed networks that can be accomplished through a combination of weights between the matrix of distances of nodes and their attributes when defining the clusters. As a benchmark, we use Subgroup Discovery on attributed network data. Subgroup Discovery focuses on detecting subgroups described by specific patterns that are interesting with respect to some target concept and a set of explaining features. On the other hand, in order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. The proposed methodologies are applied to an inter-organizational network, the EuroGroups Register, a central register that contains statistical information on business networks from European countries.