•A new algorithm for performing one-class classification.•First large-scale evaluation of one-class classification without tuning.•Identification of recommended hyperparameter values for range of ...algorithms.
One-class classification is a challenging subfield of machine learning in which so-called data descriptors are used to predict membership of a class based solely on positive examples of that class, and no counter-examples. A number of data descriptors that have been shown to perform well in previous studies of one-class classification, like the Support Vector Machine (SVM), require setting one or more hyperparameters. There has been no systematic attempt to date to determine optimal default values for these hyperparameters, which limits their ease of use, especially in comparison with hyperparameter-free proposals like the Isolation Forest (IF). We address this issue by determining optimal default hyperparameter values across a collection of 246 one-class classification problems derived from 50 different real-world datasets. In addition, we propose a new data descriptor, Average Localised Proximity (ALP) to address certain issues with existing approaches based on nearest neighbour distances. Finally, we evaluate classification performance using a leave-one-dataset-out procedure, and find strong evidence that ALP outperforms IF and a number of other data descriptors, as well as weak evidence that it outperforms SVM, making ALP a good default choice.
We provide a thorough treatment of one-class classification with hyperparameter optimisation for five data descriptors: Support Vector Machine (SVM), Nearest Neighbour Distance (NND), Localised ...Nearest Neighbour Distance (LNND), Local Outlier Factor (LOF) and Average Localised Proximity (ALP). The hyperparameters of SVM and LOF have to be optimised through cross-validation, while NND, LNND and ALP allow an efficient form of leave-one-out validation and the reuse of a single nearest-neighbour query. We experimentally evaluate the effect of hyperparameter optimisation with 246 classification problems drawn from 50 datasets. From a selection of optimisation algorithms, the recent Malherbe–Powell proposal optimises the hyperparameters of all data descriptors most efficiently. We calculate the increase in test AUROC and the amount of overfitting as a function of the number of hyperparameter evaluations. After 50 evaluations, ALP and SVM significantly outperform LOF, NND and LNND, and LOF and NND outperform LNND. The performance of ALP and SVM is comparable, but ALP can be optimised more efficiently so constitutes a good default choice. Alternatively, using validation AUROC as a selection criterion between ALP or SVM gives the best overall result, and NND is the least computationally demanding option. We thus end up with a clear trade-off between three choices, allowing practitioners to make an informed decision.
As computer and space technologies have been developed, geoscience information systems (GIS) and remote sensing (RS) technologies, which deal with the geospatial information, have been rapidly ...maturing. Moreover, over the last few decades, machine learning techniques including artificial neural network (ANN), deep learning, decision tree, and support vector machine (SVM) have been successfully applied to geospatial science and engineering research fields. The machine learning techniques have been widely applied to GIS and RS research fields and have recently produced valuable results in the areas of geoscience, environment, natural hazards, and natural resources. This book is a collection representing novel contributions detailing machine learning techniques as applied to geoscience information systems and remote sensing.
Machine learning-based remote-sensing techniques have been widely used for the production of specific land cover maps at a fine scale. P-learning is a collection of machine learning techniques for ...training the class descriptors on the positive samples only. Panax notoginseng is a rare medicinal plant, which also has been a highly regarded traditional Chinese medicine resource in China for hundreds of years. Until now, Panax notoginseng has scarcely been observed and monitored from space. Remote sensing of natural resources provides us new insights into the resource inventory of Chinese materia medica resources, particularly of Panax notoginseng. Generally, land-cover mapping involves focusing on a number of landscape classes. However, sometimes a subset or one of the classes will be the only part of interest. In term of this study, the Panax notoginseng field is the right unit class. Such a situation makes single-class data descriptors (SCDDs) especially significant for specific land-cover interpretation. In this paper, we delineated the application such that a stack of SCDDs were trained for remote-sensing mapping of Panax notoginseng fields through P-learning. We employed and compared SCDDs, i.e., the simple Gaussian target distribution, the robust Gaussian target distribution, the minimum covariance determinant Gaussian, the mixture of Gaussian, the auto-encoder neural network, the k-means clustering, the self-organizing map, the minimum spanning tree, the k-nearest neighbor, the incremental support vector data description, the Parzen density estimator, and the principal component analysis; as well as three ensemble classifiers, i.e., the mean, median, and voting combiners. Experiments demonstrate that most SCDDs could achieve promising classification performance. Furthermore, this work utilized a set of the elaborate samples manually collected at a pixel-level by experts, which was intended to be a benchmark dataset for the future work. The measuring performance of SCDDs gives us challenging insights to define the selection criteria and scoring proof for choosing a fine SCDD in mapping a specific landscape class. With the increment of remotely sensed satellite data of the study area, the spatial distribution of Panax notoginseng could be continuously derived in the local area on the basis of SCDDs.
Nowadays, the majority of media Internet traffic consists of H.264-encoded streaming videos due to its high compatibility. One of the most popular streaming technology used to deliver videos over ...Internet is Dynamic Adaptive Streaming over HTTP (DASH). It transmit the video as a sequence of independent short video segments tailored to the receiver’s limitations (related to several factors such as available bandwidth or resolution on reception), aiming to enhance the Quality of Experience. In this paper, we present two types of datasets created from 4065 video segments of 2 seconds. The first type consists of extracting features related to color, space, and time from the segments across different vertical resolutions (240, 360, 480, 720, 1080, 1440 and 4K). The second one includes several quality metrics obtained from the same segments when they are encoded with different compression levels in the different resolutions.
Abstract In studies of the relief evolution of smaller landforms, up to several dozen meters in width/diameter, digital elevation models (DEMs) freely accessible in different repositories may be ...insufficient in terms of resolution. Existing geophysical or photogrammetric equipment is not always available due to costs, conditions and regulations, especially for students or young researchers. An alternative may be the handy-held ground-based Structure from Motion technique. It allows us to obtain free high-resolution DEMs (~0.05 m) using open-source software. The method was tested on kettle holes of the glacial flood origin on Skeiðarársandur (S Iceland). The material was collected in 2022 at two outwash levels of different ages and vegetation cover. The dataset is available in the Zenodo repository; the first part is data processed in point clouds and DEMs, and the second includes original videos in MOV format. The data can be used as a reference to assess changes in the kettle hole relief in subsequent research seasons, as a methodological study for other projects, or for didactic purposes.
Patent data represent a significant source of information on innovation, knowledge production, and the evolution of technology through networks of citations, co-invention and co-assignment. A major ...obstacle to extracting useful information from this data is the problem of name disambiguation: linking alternate spellings of individuals or institutions to a single identifier to uniquely determine the parties involved in knowledge production and diffusion. In this paper, we describe a new algorithm that uses high-resolution geolocation to disambiguate both inventors and assignees on about 8.5 million patents found in the European Patent Office (EPO), under the Patent Cooperation Treaty (PCT), and in the US Patent and Trademark Office (USPTO). We show this disambiguation is consistent with a number of ground-truth benchmarks of both assignees and inventors, significantly outperforming the use of undisambiguated names to identify unique entities. A significant benefit of this work is the high quality assignee disambiguation with coverage across the world coupled with an inventor disambiguation (that is competitive with other state of the art approaches) in multiple patent offices.
Paleoclimatic data are used in eco-evolutionary models to improve knowledge of biogeographical processes that drive patterns of biodiversity through time, opening windows into past ...climate-biodiversity dynamics. Applying these models to harmonised simulations of past and future climatic change can strengthen forecasts of biodiversity change. StableClim provides continuous estimates of climate stability from 21,000 years ago to 2100 C.E. for ocean and terrestrial realms at spatial scales that include biogeographic regions and climate zones. Climate stability is quantified using annual trends and variabilities in air temperature and precipitation, and associated signal-to-noise ratios. Thresholds of natural variability in trends in regional- and global-mean temperature allow periods in Earth's history when climatic conditions were warming and cooling rapidly (or slowly) to be identified and climate stability to be estimated locally (grid-cell) during these periods of accelerated change. Model simulations are validated against independent paleoclimate and observational data. Projections of climatic stability, accessed through StableClim, will improve understanding of the roles of climate in shaping past, present-day and future patterns of biodiversity.
This paper describes a dataset capturing insider trading activity at publicly traded companies. Investors and investment analysts demand this information because executives, directors and large ...shareholders are expected to have more intimate knowledge of their company’s prospects than outsiders. Insider stock sales and purchases may reveal information about the firm’s business not disclosed in financial statements. They may also convey new information predictive of stock price movements if insiders can better interpret public information about the firm. Since mid-2003, the Securities and Exchange Commission has made these insider trading reports available to the public in a structured format; however, most academic papers use proprietary commercial databases instead of regulatory filings directly. This makes replication challenging as the data manipulation and aggregation processes are opaque and historical records could be altered by the database provider over time. To overcome these limitations, the presented dataset is created from original regulatory filings; it is updated daily and includes all information reported by insiders without alteration.
Abstract Vehicular Ad-Hoc Networks (VANETs) were introduced to avoid vehicular-related accidents and to improve the safety of both vehicular passengers and other road users. In VANETs, the vehicles ...are expected to communicate with neighbouring vehicles to increase awareness about the surrounding by using V2V (vehicle-to-vehicle) communication links. Since the introduction of VANETs, much research has focused on developing state-of-the-art algorithms to increase safety. However, real-world testing of these developed algorithms has become challenging due to the required high cost and multiple practical reasons. Therefore, simulation-based testing is commonly used for VANETs related applications. Using real datasets inside a simulation can significantly increase the results’ accuracy and help to achieve realistic results. In this study, we present a dataset called ’CN+’, which consists of more than 25,000 vehicles collected over 32 hours at a signalized intersection in Bremen, Germany. paper