•Application of conditional Generative Adversarial Networks as oversampling method.•Generates minority class samples by recovering the training data distribution.•Outperforms various standard ...oversampling algorithms.•Performance advantage of the proposed method remains stable with higher imbalance ratios.
Learning from imbalanced datasets is a frequent but challenging task for standard classification algorithms. Although there are different strategies to address this problem, methods that generate artificial data for the minority class constitute a more general approach compared to algorithmic modifications. Standard oversampling methods are variations of the SMOTE algorithm, which generates synthetic samples along the line segment that joins minority class samples. Therefore, these approaches are based on local information, rather on the overall minority class distribution. Contrary to these algorithms, in this paper the conditional version of Generative Adversarial Networks (cGAN) is used to approximate the true data distribution and generate data for the minority class of various imbalanced datasets. The performance of cGAN is compared against multiple standard oversampling algorithms. We present empirical results that show a significant improvement in the quality of the generated data when cGAN is used as an oversampling algorithm.
Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. ...While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE (synthetic minority oversampling technique), which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 90 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation11The implementation of k-means SMOTE can be found at https://github.com/felix-last/kmeans_smote. is made available in the Python programming language.
Organized retail crime (ORC) is a significant issue for retailers, marketplace platforms, and consumers. Its prevalence and influence have increased fast in lockstep with the expansion of online ...commerce, digital devices, and communication platforms. Today, it is a costly affair, wreaking havoc on enterprises' overall revenues and continually jeopardizing community security. These negative consequences are set to rocket to unprecedented heights as more people and devices connect to the Internet. Detecting and responding to these terrible acts as early as possible is critical for protecting consumers and businesses while also keeping an eye on rising patterns and fraud. The issue of detecting fraud in general has been studied widely, especially in financial services, but studies focusing on organized retail crimes are extremely rare in literature. To contribute to the knowledge base in this area, we present a scalable machine learning strategy for detecting and isolating ORC listings on a prominent marketplace platform by merchants committing organized retail crimes or fraud. We employ a supervised learning approach to classify postings as fraudulent or real based on past data from buyer and seller behaviors and transactions on the platform. The proposed framework combines bespoke data preprocessing procedures, feature selection methods, and state-of-the-art class asymmetry resolution techniques to search for aligned classification algorithms capable of discriminating between fraudulent and legitimate listings in this context. Our best detection model obtains a recall score of 0.97 on the holdout set and 0.94 on the out-of-sample testing data set. We achieve these results based on a select set of 45 features out of 58.
E-learning systems have witnessed a usage and research increase in the past decade. This article presents the e-learning concepts ecosystem. It summarizes the various scopes on e-learning studies. ...Here we propose an e-learning theoretical framework. This theory framework is based upon three principal dimensions: users, technology, and services related to e-learning. This article presents an in-depth literature review on those dimensions. The article first presents the related concepts of computer use in learning across time, revealing the emergence of new trends on e-learning. The theoretical framework is a contribution for guiding e-learning studies. The article classifies the stakeholder groups and their relationship with e-learning systems. The framework shows a typology of e-learning systems' services. This theoretical approach integrates learning strategies, technologies and stakeholders.
The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential ...motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.
Portugal has the highest density of wildfire ignitions among southern European countries. The ability to predict the spatial patterns of ignitions constitutes an important tool for managers, helping ...to improve the effectiveness of fire prevention, detection and firefighting resources allocation. In this study, we analyzed 127490 ignitions that occurred in Portugal during a 5-year period. We used logistic regression models to predict the likelihood of ignition occurrence, using a set of potentially explanatory variables, and produced an ignition risk map for the Portuguese mainland. Results show that population density, human accessibility, land cover and elevation are important determinants of spatial distribution of fire ignitions. In this paper, we demonstrate that it is possible to predict the spatial patterns of ignitions at the national level with good accuracy and using a small number of easily obtainable variables, which can be useful in decision-making for wildfire management.
In the age of the data deluge there are still many domains and applications restricted to the use of small datasets. The ability to harness these small datasets to solve problems through the use of ...supervised learning methods can have a significant impact in many important areas. The insufficient size of training data usually results in unsatisfactory performance of machine learning algorithms. The current research work aims to contribute to mitigate the small data problem through the creation of artificial instances, which are added to the training process. The proposed algorithm, Geometric Small Data Oversampling Technique, uses geometric regions around existing samples to generate new high quality instances. Experimental results show a significant improvement in accuracy when compared with the use of the initial small dataset as well as other popular artificial data generation techniques.
Abstract
Competitive Intelligence allows an organization to keep up with market trends and foresee business opportunities. This practice is mainly performed by analysts scanning for any piece of ...valuable information in a myriad of dispersed and unstructured sources. Here we present MapIntel, a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its own semantics. The system is designed to handle complex Natural Language queries and visual exploration of the corpus, potentially aiding overburdened analysts in finding meaningful insights to help decision‐making. The system
searching
module uses a retriever and re‐ranker engine that first finds the closest neighbours to the query embedding and then sifts the results through a cross‐encoder model that identifies the most relevant documents. The
browsing
or visualization module also leverages the embeddings by projecting them onto two dimensions while preserving the multidimensional landscape, resulting in a map where semantically related documents form topical clusters which we capture using topic modelling. This map aims at promoting a fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. We evaluate the system and its components on the 20 newsgroups data set, using the semantic document labels provided, and demonstrate the superiority of Transformer‐based components. Finally, we present a prototype of the system in Python and show how some of its features can be used to acquire intelligence from a news article corpus we collected during a period of 8 months.
•Investigating factors affecting users’ continuance intention of using food delivery apps during COVID-19 pandemic period.•Proposing a comprehensive model integrating UTAUT, ECM and Task-technology ...fit model.•Explaining users’ continuance usage intention is determined by their technological and mental perceptions.•Perceived task-technology fit formulates users’ perceptions when the technology’s functions meeting their requirements.
Food delivery apps (FDAs) as an emerging online-to-offline mobile technology, have been widely adopted by catering businesses and customers. Especially, as they have provided two-way beneficial catering delivery services in rescuing catering enterprises and satisfying customers’ technological and mental exceptions under the COVID-19 global pandemic condition. This study proposes a comprehensive model integrating UTAUT, ECM and TTF with the trust factor and examines 532 valid FDA users’ continuance intention of using FDAs during the COVID-19 pandemic period in China. The statistical results and discussions show that satisfaction is the most significant factor, and perceived task-technology fit, trust, performance expectancy, social influence and confirmation have direct or indirect positive impacts on users’ continuance usage intention of FDAs during the COVID-19 pandemic period. In addition, relevant researches and stakeholders should consider the specific characteristic of technology being associated with users’ technological and mental perceptions for better understanding and explaining users’ continuance intention.
This paper analyzes the digital development of 110 countries and its relationship with economic development. Using factor analysis, we combined seven ICT-related variables into a single measure of ...digital development. This measure was then used as the dependent variable in an OLS model that allows non-linear effects, with the GDP per capita of countries as the explanatory variable. Our findings are substantive in that the correlation between economic and digital development was found to be not linear, being much stronger in poorer countries, a finding not commonly seen in the literature. As a result, future studies that focus on the relationship between economic and digital developments may benefit from our findings, by postulating this type of relationship. In our model we were able to explain 83 % of the variation in the digital development of countries, compared to just 72 % if considering only a linear relationship.