Nowadays, there is an incredible increase of data volumes around the world, with the Internet as one of the main actors in this scenario and a growth rate above 30GB/s. The treatment of this huge ...amount of information cannot be carried out through traditional data mining algorithms in an efficient way and it is necessary to adapt and design new algorithms towards distributed paradigms such as MapReduce. This situation is a challenge for the community, investigated under the widely known term of big data.
This paper presents a new algorithm for the subgroup discovery task called MEFASD-BD. The algorithm is developed in Apache Spark based on the MapReduce paradigm, and it is able to tackle high dimensional datasets in an efficient way. In fact, this algorithm is the first approximation to big data within evolutionary fuzzy systems for subgroup discovery. MEFASD-BD implements novel MapReduce functions which are able to analyse the quality of the subgroups obtained for each map with respect to the original dataset in order to improve the quality of these subgroups. In addition, the final reduce function of the algorithm employs the token competition operator in order to select the best rules extracted in the different maps. An experimental study with high dimensional datasets is performed in order to show the advantages of this algorithm in this type of problems. Specifically, the results of the study show an important reduction of the runtime while keeping the values in the standard quality measures for subgroup discovery.
•New double beam algorithms for subgroup discovery (SD) and classification rules (RL).•Algorithms can use different heuristics for rule refinement and rule selection.•Variants of new SD algorithm ...give more interesting rules than state-of-the-art.•RL algorithm gives rules with comparable accuracy with state-of-the-art algorithms.•Inverted heuristics in rule refinement produce rules with better coverage.
Classification rules and rules describing interesting subgroups are important components of descriptive machine learning. Rule learning algorithms typically proceed in two phases: rule refinement selects conditions for specializing the rule, and rule selection selects the final rule among several rule candidates. While most conventional algorithms use the same heuristic for guiding both phases, recent research indicates that the use of two separate heuristics is conceptually better justified, improves the coverage of positive examples, and may result in better classification accuracy. The paper presents and evaluates two new beam search rule learning algorithms: DoubleBeam-SD for subgroup discovery and DoubleBeam-RL for classification rule learning. The algorithms use two separate beams and can combine various heuristics for rule refinement and rule selection, which widens the search space and allows for finding rules with improved quality. In the classification rule learning setting, the experimental results confirm previously shown benefits of using two separate heuristics for rule refinement and rule selection. In subgroup discovery, DoubleBeam-SD algorithm variants outperform several state-of-the-art related algorithms.
An increasing area of study for economists and sociologists is the varying organizational structures between business networks. The use of network science makes it possible to identify the ...determinants of the performance of these business networks. In this work we look for the determinants of inter-firm performance. On one hand, a new method of supervised clustering with attributed networks is proposed, SUWAN, with the aim at obtaining class-uniform clusters of the turnover, while minimizing the number of clusters. This method deals with representative-based supervised clustering, where a set of initial representatives is randomly chosen. One of the innovative aspects of SUWAN is that we use a supervised clustering algorithm to attributed networks that can be accomplished through a combination of weights between the matrix of distances of nodes and their attributes when defining the clusters. As a benchmark, we use Subgroup Discovery on attributed network data. Subgroup Discovery focuses on detecting subgroups described by specific patterns that are interesting with respect to some target concept and a set of explaining features. On the other hand, in order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. The proposed methodologies are applied to an inter-organizational network, the EuroGroups Register, a central register that contains statistical information on business networks from European countries.
The current situation of critical progression as regards the resistance of bacteria to antibiotics has led to the use of machine learning techniques in order to provide clinicians with new knowledge ...for decision making. One of the key aspects is precision medicine, which focuses on finding phenotypes of patients for whom treatments may be more effective or detecting high risk patients whose progress must be closely monitored. The identification of these phenotypes requires the application of a methodology whose results are consistent and interpretable, along with the control of the process by a clinical expert. Studies concerning machine learning phenotyping use conventional clustering or subgroup algorithms that require information to be obtained a priori.
We propose a new unsupervised machine learning technique, denominated as Trace-based clustering, and a 5-step methodology in order to support clinicians when identifying patient phenotypes. The steps proposed are: (1) Extraction and transformation of data and analysis of clustering tendency, (2) Selection of clustering algorithm and parameters, (3) Automatic generation of candidate clusters, (4) Visual support for selection of candidate clusters, and (5) Evaluation by clinical experts.
We undertake an antimicrobial resistance use case by employing the MIMIC-III open-access database for patients infected with the Methicillin-resistant Staphylococcus Aereus and Enterococcus Faecium treated with Vancomycin. The experiments were carried out using the Hopkins statistic in order to evaluate the clustering tendency of the data, the K-Means algorithm for clustering, and the Dice coefficient to measure the similarity of the clusters. Our experiments computed 370 potential patient sets (clusters) so as to obtain 19 candidate clusters for their final evaluation. We evaluated the final result with a classification model in order to ensure the consistency of the phenotypes obtained and we compared the result with a traditional clustering approach. We found a reduced set of consistent candidate clusters with a common phenotype (resistance and death), which were different from the other candidate clusters. An expert in the domain could add labels with clinical meaning to the reduced number of clusters.
We show that the proposed methodology allows physicians to identify consistent patient phenotypes. Our experiments confirm that quality measures, and the visual analysis could help expert clinicians to control the knowledge discovery process and obtain interpretable results. Our approach provides a new perspective: that of finding patient sets using clustering techniques evaluated by overlapping clusters of the previous partitions. The method proposed is general and can be easily adapted to any other problem and any other clinical settings.
Display omitted
The rich longitudinal individual level data available from electronic health records (EHRs) can be used to examine treatment effect heterogeneity. However, estimating treatment effects using EHR data ...poses several challenges, including time-varying confounding, repeated and temporally non-aligned measurements of covariates, treatment assignments and outcomes, and loss-to-follow-up due to dropout. Here, we develop the subgroup discovery for longitudinal data algorithm, a tree-based algorithm for discovering subgroups with heterogeneous treatment effects using longitudinal data by combining the generalized interaction tree algorithm, a general data-driven method for subgroup discovery, with longitudinal targeted maximum likelihood estimation. We apply the algorithm to EHR data to discover subgroups of people living with human immunodeficiency virus who are at higher risk of weight gain when receiving dolutegravir (DTG)-containing antiretroviral therapies (ARTs) versus when receiving non-DTG-containing ARTs.
Discriminative pattern mining is used to discover a set of significant patterns that occur with disproportionate frequencies in different class-labeled data sets. Although there are many algorithms ...that have been proposed, the redundancy issue that the discriminative power of many patterns mainly derives from their sub-patterns has not been resolved yet. In this paper, we consider a novel notion dubbed conditional discriminative pattern to address this issue. To mine conditional discriminative patterns, we propose an effective algorithm called CDPM (Conditional Discriminative Patterns Mining) to generate a set of non-redundant discriminative patterns. Experimental results on real data sets demonstrate that CDPM has very good performance on removing redundant patterns that are derived from significant sub-patterns so as to generate a concise set of meaningful discriminative patterns.
Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent ...statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new class of objective functions: we show how tight estimators can be computed efficiently for all functions that are determined by subgroup size (non-decreasing dependence), the subgroup median value, and a dispersion measure around the median (non-increasing dependence). In the important special case when dispersion is measured using the mean absolute deviation from the median, this novel approach yields a linear time algorithm. Empirical evaluation on a wide range of datasets shows that, when used within branch-and-bound search, this approach is highly efficient and indeed discovers subgroups with much smaller errors.
In order to estimate the reactivity of a large number of potentially complex heterogeneous catalysts while searching for novel and more efficient materials, physical as well as data-centric models ...have been developed for a faster evaluation of adsorption energies compared to first-principles calculations. However,
global
models designed to describe as many materials as possible might overlook the very few compounds that have the appropriate adsorption properties to be suitable for a given catalytic process. Here, the subgroup-discovery (SGD)
local
artificial-intelligence approach is used to identify the key descriptive parameters and constrains on their values, the so-called SG rules, which particularly describe transition-metal surfaces with outstanding adsorption properties for the oxygen-reduction and -evolution reactions. We start from a data set of 95 oxygen adsorption-energy values evaluated by density-functional-theory calculations for several monometallic surfaces along with 16 atomic, bulk and surface properties as candidate descriptive parameters. From this data set, SGD identifies constraints on the most relevant parameters describing materials and adsorption sites that (i) result in O adsorption energies within the Sabatier-optimal range required for the oxygen-reduction reaction and (ii) present the largest deviations from the linear-scaling relations between O and OH adsorption energies, which limit the catalyst performance in the oxygen-evolution reaction. The SG rules not only reflect the local underlying physicochemical phenomena that result in the desired adsorption properties, but also guide the challenging design of alloy catalysts.
The discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit ...such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its
best-first search
property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It outperforms other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks.
•A new mining task, unexpected pattern retrieval is proposed.•Frequent pattern mining algorithms on the multi-dimensional dataset is extended.•The partial results among the subgroups are shared.•New ...index is built to retrieval unexpected patterns interactively.•Experiments show the efficiency and effectiveness of the proposed method.
A typical mining task is to retrieve all frequent patterns from a multi-dimensional dataset. Those patterns give us a basic idea of how the data look like and the hidden inherent regularities. However, this is only useful for an unfamiliar dataset, while for datasets that are analyzed periodically, “unexpected” patterns are more interesting (e.g., some customers decided to subscribe to long-term deposits despite the burden of housing loan). In this paper, we propose a new mining job, unexpected mining, which targets at retrieving frequent patterns that are not valid in a reference dataset, but are significant enough in a specific subgroup. Given a reference dataset, we step by step generate all unexpected patterns for all subgroups. We extend existing mining approaches to support the new mining job efficiently. In particular, our scheme consists of an offline process and an online process. Offline process generates candidate patterns and builds an index table. Online process can retrieve unexpected patterns from user-defined subgroups and a given support. Experiments on real datasets show that our approach can find interesting patterns and is very efficient compared to existing approaches.