A growing number of companies use mobility information in their day-to-day business. One requirement thereby is that inference about population-wide mobility patterns can be made. Therefore, it is ...not only important to find mobility patterns in a given data sample but also to assert their validity for the total population. This aspect of analysis has been largely neglected in mobility data mining research, which limits the applicability of the whole algorithmic field. In this paper we will analyze one aspect of sample bias due to incomplete mobility data. We will provide a systematic approach to detect dependencies between mobility behavior, socio-demography and missing data. Further, we apply the approach to a large GPS mobility survey in Switzerland and show that our concerns are justified and require attention in future research. We hope that our paper will raise the awareness that representativity of mobile behavior cannot be taken for granted in mobility surveys due to missing data and is a research direction of utmost importance.
Provider: - Institution: - Data provided by Europeana Collections- All metadata published by Europeana are available free of restriction under the Creative Commons CC0 1.0 Universal Public Domain ...Dedication. However, Europeana requests that you actively acknowledge and give attribution to all metadata sources including Europeana
The characterisation of defective modules in software engineering remains a challenge. In this work, we use data mining techniques to search for rules that indicate modules with a high probability of ...being defective. Using data sets from the PROMISE repository, we first applied feature selection (attribute selection) to work only with those attributes from the data sets capable of predicting defective modules. With the reduced data set, a genetic algorithm is used to search for rules characterising modules with a high probability of being defective. This algorithm overcomes the problem of unbalanced data sets where the number of non-defective samples in the data set highly outnumbers the defective ones.
One of the goals of medical research in the area of dementia is to correlate images of the brain with other variables, for instance, demographic information or outcomes of clinical tests. The usual ...approach is to select a subset of patients based on such variables and analyze the images associated with those patients. In this paper, we apply data mining techniques to take the opposite approach: We start with the images and explain the differences and commonalities in terms of the other variables. In the first step, we cluster PET scans of patients to form groups sharing similar features in brain metabolism. To the best of our knowledge, it is the first time ever that clustering is applied to whole PET scans. In the second step, we explain the clusters by relating them to non-image variables. To do so, we employ RSD, an algorithm for relational subgroup discovery, with the cluster membership of patients as target variable. Our results enable interesting interpretations of differences in brain metabolism in terms of demographic and clinical variables. The approach was implemented and tested on an exceptionally large pre-existing data collection of patients with different types of dementia. It comprises 10 GB of image data from 454 PET scans, and 42 variables from psychological and demographical data organized in 11 relations of a relational database. We believe that explaining medical images in terms of other variables (patient records, demographic information, etc.) is a challenging new and rewarding area for data mining research.
La découverte de motifs qui caractérisent fortement une classe vis à vis d'une autre reste encore un problème difficile en fouille de données. La découverte de sous-groupes (Subgroup Discovery, SD) ...est une approche formelle de fouille de motifs qui permet la construction de classifieurs intelligibles mais surtout d'émettre des hypothèses sur les données. Cependant, cette approche fait encore face à deux problèmes majeurs : (i) comment définir des mesures de qualité appropriées pour caractériser l'intérêt d'un motif et (ii) comment sélectionner une méthode heuristique adaptée lorsqu’une énumération exhaustive de l'espace de recherche n'est pas réalisable. Le premier problème a été résolu par la fouille de modèles exceptionnels (Exceptional Model Mining, EMM) qui permet l'extraction de motifs couvrant des objets de la base de données pour lesquels le modèle induit sur les attributs de classe est significativement différent du modèle induit par l'ensemble des objets du jeu de données. Le second problème a été étudié en SD et EMM principalement avec la mise en place de méthodes heuristiques de type recherche en faisceau (beam-search) ou avec des algorithmes génétiques qui permettent la découverte de motifs non redondants, diversifiés et de bonne qualité. Dans cette thèse, nous soutenons que la nature gloutonne des méthodes d'énumération précédentes génère cependant des ensembles de motifs manquant de diversité. Nous définissons formellement la fouille de données comme un jeu que nous résolvons par l'utilisation de la recherche arborescente de Monte Carlo (Monte Carlo Tree Search, MCTS), une technique récente principalement utilisée pour la résolution de jeux et de problèmes de planning en intelligence artificielle. Contrairement aux méthodes traditionnelles d'échantillonnage, MCTS donne la possibilité d'obtenir une solution à tout instant sans qu'aucune hypothèse ne soit faite que ce soit sur la mesure de qualité ou sur les données. Cette méthode d'énumération converge vers une approche exhaustive si les budgets temps et mémoire disponibles sont suffisants. Le compromis entre l'exploration et l'exploitation que propose cette approche permet une augmentation significative de la diversité dans l'ensemble des motifs calculés. Nous montrons que la recherche arborescente de Monte Carlo appliquée à la fouille de motifs permet de trouver rapidement un ensemble de motifs diversifiés et de bonne qualité à l'aide d'expérimentations sur des jeux de données de référence et sur un jeu de données réel traitant de l'olfaction. Nous proposons et validons également une nouvelle mesure de qualité spécialement conçue pour des jeux de donnée multi labels présentant une grande variance de fréquences des labels.
The discovery of patterns that strongly distinguish one class label from another is still a challenging data-mining task. Subgroup Discovery (SD) is a formal pattern mining framework that enables the construction of intelligible classifiers, and, most importantly, to elicit interesting hypotheses from the data. However, SD still faces two major issues: (i) how to define appropriate quality measures to characterize the interestingness of a pattern; (ii) how to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is unfeasible. The first issue has been tackled by Exceptional Model Mining (EMM) for discovering patterns that cover tuples that locally induce a model substantially different from the model of the whole dataset. The second issue has been studied in SD and EMM mainly with the use of beam-search strategies and genetic algorithms for discovering a pattern set that is non-redundant, diverse and of high quality. In this thesis, we argue that the greedy nature of most such previous approaches produces pattern sets that lack diversity. Consequently, we formally define pattern mining as a game and solve it with Monte Carlo Tree Search (MCTS), a recent technique mainly used for games and planning problems in artificial intelligence. Contrary to traditional sampling methods, MCTS leads to an any-time pattern mining approach without assumptions on either the quality measure or the data. It converges to an exhaustive search if given enough time and memory. The exploration/exploitation trade-off allows the diversity of the result set to be improved considerably compared to existing heuristics. We show that MCTS quickly finds a diverse pattern set of high quality in our application in neurosciences. We also propose and validate a new quality measure especially tuned for imbalanced multi-label data.
Historically, airlines around the globe have used static pricing structures, which are constrained to discrete price points and there is limited segmentation between their guests. Because of these ...limitations and constraints, the necessity of novel methods to calculate the willingness to pay and identify potential guests whose propensity to book a flight will increase if they receive a discount in order to improve their sales is huge. This paper proposes a novel methodology to identify interesting subgroups whose chance to book a flight increases if they receive an offer discount. This proposal includes a grammatically evolutionary feature selection algorithm to extract the best subgroups by analyzing the booking behaviour of historical passengers. A real case scenario was considered in the experimental analysis using private data from a commercial airline.
It is a great challenge to companies, governments and researchers to extract knowledge in high dimensional databases. Discriminative Patterns (DPs) is an area of data mining that aims to extract ...relevant and readable information in databases with target attribute. Among the algorithms developed for search DPs, it has highlighted the use of evolutionary computing. However, the evolutionary approaches typically (1) are not adapted for high dimensional problems and (2) have many nontrivial parameters. This paper presents SSDP (Simple Search Discriminative Patterns), an evolutionary approach to search the top-k DPs adapted to high dimensional databases that use only two easily adjustable external parameters.