Display omitted
This paper presents an evolutionary algorithm for Discriminative Pattern (DP) mining that focuses on high dimensional data sets. DPs aims to identify the sets of characteristics that ...better differentiate a target group from the others (e.g. successful vs. unsuccessful medical treatments). It becomes more natural to extract information from high dimensionality data sets with the increase in the volume of data stored in the world (30GB/s only in the Internet). There are several evolutionary approaches for DP mining, but none focusing on high-dimensional data. We propose an evolutionary approach attributing features that reduce the cost of memory and processing in the context of high-dimensional data. The new algorithm thus seeks the best (top-k) patterns and hides from the user many common parameters in other evolutionary heuristics such as population size, mutation and crossover rates, and the number of evaluations. We carried out experiments with real-world high-dimensional and traditional low dimensional data. The results showed that the proposed algorithm was superior to other approaches of the literature in high-dimensional data sets and competitive in the traditional data sets.
Besides the need for more advanced predictive methods, there is increasing demand for easily interpretable results. Couples of enhanced association rules (a generalization of association ...rules/apriori/frequent itemsets) are excellent candidates for this task. They can be interpreted in various ways, subgroup discovery being an example. A typical result in rule mining is that there are too low or too many rules in the resulting ruleset. Analysts must usually iterate 5–15 times to get a reasonable number of rules. Inspired by research in a similar area of frequent itemsets to simplify input and parameter-free frequent itemsets, we have proposed a novel algorithm that finds rules based not on parameters like support and confidence but the best rules by a given range of required rule count in output. We propose this algorithm for couples of rules – SD4ft-Miner procedure and benefits from a brand new implementation of methods of mechanizing hypothesis formation in Python called Cleverminer that allows easy implementation of this algorithm. We have verified the algorithm by several applications on eight public data sets. Our original case was a case study, and it was also the reason why we developed the algorithm. However, implementation is in Python, and the algorithm itself can be used on a broader class of methods in any language. The algorithm iterates quickly, in all experiments we needed a maximum of 10 iterations. Possible enhancements to this algorithm are also outlined.
The aim of this paper is to categorize and describe different types of learners in massive open online courses (MOOCs) by means of a subgroup discovery (SD) approach based on MapReduce. The proposed ...SD approach, which is an extension of the well-known FP-Growth algorithm, considers emerging parallel methodologies like MapReduce to be able to cope with extremely large datasets. As an additional feature, the proposal includes a threshold value to denote the number of courses that each discovered rule should satisfy. A post-processing step is also included so redundant subgroups can be removed. The experimental stage is carried out by considering de-identified data from the first year of 16 MITx and HarvardX courses on the edX platform. Experimental results demonstrate that the proposed MapReduce approach outperforms traditional sequential SD approaches, achieving a runtime that is almost constant for different courses. Additionally, thanks to the final post-processing step, only interesting and not-redundant rules are discovered, hence reducing the number of subgroups in one or two orders of magnitude. Finally, the discovered subgroups are easily used by courses' instructors not only for descriptive purposes but also for additional tasks such as recommendation or personalization.
Subgroup discovery (SD) is an exploratory pattern mining paradigm that comes into its own when dealing with large real-world data, which typically involves many attributes, of a mixture of data ...types. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). Various specific algorithms have been proposed in the literature for both cases, but a systematic review of the available options is missing. This paper presents a generic framework that can be instantiated in various ways in order to create different strategies for dealing with numeric data. The bulk of the work in this paper describes an experimental comparison of a considerable range of numeric strategies in SD, where these strategies are organised according to four central dimensions. These experiments are furthermore repeated for both the classification task (target is nominal) and regression task (target is numeric), and the strategies are compared based on the quality of the top subgroup, and the quality and redundancy of the top-
k
result set. Results of three search strategies are compared: traditional beam search, complete search, and a variant of diverse subgroup set discovery called cover-based subgroup selection. Although there are various subtleties in the outcome of the experiments, the following general conclusions can be drawn: it is often best to determine numeric thresholds dynamically (locally), in a fine-grained manner, with binary splits, while considering multiple candidate thresholds per attribute.
Our lives are made of social interactions which can be recorded through personal gadgets as well as sensors capturing ubiquitous and social data. This type of data, such as spatio‐temporal data from ...the real‐time location of people, for example, can then be used for inferring interactions which can be translated into behavioural patterns. In this paper, we consider the automatic discovery of exceptional social behaviour from spatio‐temporal interaction data, focusing on two areas: exceptional subgroups and spatio‐temporal outliers – both in the form of descriptive patterns. For that, we propose a method for exceptional social behaviour discovery, combining subgroup discovery and network science methods for identifying behaviour that deviates from the norm. We also propose the use of two outlier detection metrics for identifying outliers, namely the Local Outlier Factor (LOF) and the Voronoi area. We applied the proposed method on synthetic data as well as two real datasets containing location data from children playing in the school playground. Our results indicate that this is a valid approach which is able to obtain meaningful knowledge from the data.
Supervised descriptive rule discovery represents a set of data mining techniques whose objective is to describe data with respect to a property of interest. This concept encompasses different ...techniques such as subgroup discovery, emerging patterns and contrast sets. Supervised learning is used to obtain rules for descriptive purposes but with different quality measures. Although their origin is based on different data mining tasks, our hypothesis is about the existence of a compatibility between subgroup discovery, emerging patterns and contrast sets thanks to the common use of a weighted relative accuracy quality measure. A complete analysis shows this relationship between the different tasks. The analysis is supported by an empirical study with the most representative algorithms for each technique.
The paper shows how the use of the weighted relative accuracy allows the experts to distinguish between interesting subgroups, emerging and/or contrasting rules thanks to the relation between the quality measures employed in the search process for different models. In addition, this relationship enables us to analyse the main differences and/or similarities between the different techniques within supervised descriptive rule discovery. This scenario opens up new challenges for the supervised descriptive rule learning models in analysing and developing descriptive models with a new perspective.
Many relational data result from the aggregation of several individual behaviors described by some characteristics. For instance, a bike-sharing system may be modeled as a graph where vertices stand ...for bike-share stations and connections represent bike trips made by users from one station to another. Stations and trips are described by additional information such as the description of the geographical environment of the stations (business vs. residential area, closeness to POI, elevation, urbanization density, etc.), or properties of the bike trips (timestamp, user profile, weather, events and other special conditions about the trip). Identifying highly connected components (such as communities or quasi-cliques) in this graph provides interesting insights into global usages but does not capture mobility profiles that characterize a subpopulation. To tackle this problem we propose an approach rooted in exceptional model mining to find exceptional contextual subgraphs, i.e., subgraphs generated from a context or a description of the individual behaviors that is exceptional (behaves in a different way) compared to the whole augmented graph. The dependency between a context and an edge is assessed by a
χ
2
test and the weighted relative accuracy measure is used to only retain contexts that strongly characterize connected subgraphs. We present an original algorithm that uses sophisticated pruning techniques to restrict the search space of vertices, context refinements, and edges to be considered. An experimental evaluation on synthetic data and two real-life datasets demonstrates the effectiveness of the proposed pruning mechanisms, as well as the relevance of the discovered patterns.
Subgroup Discovery is a descriptive data mining technique for obtaining subgroups with unusual statistical characteristics with respect to a given target variable. In this paper, unlike existing ...approaches, we capture the data in the form of implicative-type fuzzy rules and propose an algorithm to determine sharp transitions in the consequent when there is a minimal change in the antecedent. Our study contained herein highlights the role and employability of fuzzy implication functions in such settings through illustrative examples with several real datasets.
•A causal modelling approach for individual treatment effects identifies subgroups with robust and additive predictive value of the outcome.•The subgroups provide insight into why individuals may ...respond differently to the same treatment.•The approach is illustrated in a large real world dataset in primary care.
Individuals may respond differently to the same treatment, and there is a need to understand such heterogeneity of causal individual treatment effects. We propose and evaluate a modelling approach to better understand this heterogeneity from observational studies by identifying patient subgroups with a markedly deviating response to treatment. We illustrate this approach in a primary care case-study of antibiotic (AB) prescription on recovery from acute rhino-sinusitis (ARS).
Our approach consists of four stages and is applied to a large dataset in primary care dataset of 24,392 patients suspected of suffering from ARS. We first identify pre-treatment variables that either confound the relationship between treatment and outcome or are risk factors of the outcome. Second, based on the pre-treatment variables we create Synthetic Random Forest (SRF) models to compute the potential outcomes and subsequently the causal individual treatment effect (ITE) estimates. Third, we perform subgroup discovery using the ITE estimates as outcomes to identify positive and negative responders. Fourth, we evaluate the predictive performance of the identified subgroups for predicting the outcome in two ways: the likelihood ratio test, and whether the subgroups are selected via the Akaike Information Criterion (AIC) using backward stepwise variable selection. We validate the whole modelling strategy by means of 10-fold-cross-validation.
Based on 20 pre-treatment variables, four subgroups (three for positive responders and one for negative responders) were identified. The log likelihood ratio tests showed that the subgroups were significant. Variable selection using the AIC kept two of the four subgroups, one for positive responders and one for negative responders. As for the validation of the whole modelling strategy, all reported measures (the number of pre-treatment variables associated with the outcome, number of subgroups, number of subgroups surviving variable selection and coverage) showed little variation.
With the proposed approach, we identified subgroups of positive and negative responders to treatment that markedly deviate from the mean response. The subgroups showed additive predictive value of the outcome. The modelling approach strategy was shown to be robust on this dataset. Our approach was thus able to discover understandable subgroups from observational data that have predictive value and which may be considered by the clinical users to get insight into who responds positively or negatively to a proposed treatment.
Exceptional preferences mining (EPM) is a crossover between two subfields of data mining: local pattern mining and preference learning. EPM can be seen as a local pattern mining task that finds ...subsets of observations where some preference relations between labels significantly deviate from the norm. It is a variant of subgroup discovery, with rankings of labels as the target concept. We employ several quality measures that highlight subgroups featuring exceptional preferences, where the focus of what constitutes ‘exceptional’ varies with the quality measure: two measures look for exceptional overall ranking behavior, one measure indicates whether a particular label stands out from the rest, and a fourth measure highlights subgroups with unusual pairwise label ranking behavior. We explore a few datasets and compare with existing techniques. The results confirm that the new task EPM can deliver interesting knowledge.