Dataset search: a survey Chapman, Adriane; Simperl, Elena; Koesten, Laura ...
The VLDB journal,
2020/1, Volume:
29, Issue:
1
Journal Article
Open access
Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking ...authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta-released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems and discuss what makes dataset search a field in its own right, with unique challenges and open questions. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to tackle these questions as well as immediate next steps that will take the field forward.
Cooperative machine learning has many applications, such as data annotation, where an initial model trained with partially labeled data is used to predict labels for unseen data continuously. ...Predicted labels with a low confidence value are manually revised to allow the model to be retrained with the predicted and revised data. In this paper, we propose an alternative to this approach: an initial training process called Deep Unsupervised Active Learning. Using the proposed training scheme, a classification model can incrementally acquire new knowledge during the testing phase without manual guidance or correction of decision making. The training process consists of two stages: the first stage of supervised training using a classification model, and an unsupervised active learning stage during the test phase. The labels predicted during the test phase, with high confidence, are continuously used to extend the knowledge base of the model. To optimize the proposed method, the model must have a high initial recognition rate. To this end, we exploited the Visual Geometric Group (VGG16) pre-trained model applied to three datasets: Mathematical Image Analysis (AMI), University of Science and Technology Beijing (USTB2), and Annotated Web Ears (AWE). This approach achieved impressive performance that shows a significant improvement in the recognition rate of the USTB2 dataset by coloring its images using a Generative Adversarial Network (GAN). The obtained performances are interesting compared to the current methods: the recognition rates are 100.00%, 98.33%, and 51.25% for the USTB2, AMI, and AWE datasets, respectively.
The massive increase in classroom video data enables the possibility of utilizing artificial intelligence technology to automatically recognize, detect and caption students’ behaviors. This is ...beneficial for related research, e.g., pedagogy and educational psychology. However, the lack of a dataset specifically designed for students’ classroom behaviors may block these potential studies. This paper presents a comprehensive dataset that can be employed for recognizing, detecting, and captioning students’ behaviors in a classroom. We collected videos of 128 classes in different disciplines and in 11 classrooms. Specifically, the constructed dataset consists of a detection part, recognition part, and captioning part. The detection part includes a temporal detection data module with 4542 samples and an action detection data module with 3343 samples, whereas the recognition part contains 4276 samples and the captioning part contains 4296 samples. Moreover, the students’ behaviors are spontaneous in real classes, rendering the dataset representative and realistic. We analyze the special characteristics of the classroom scene and the technical difficulties for each module (task), which are verified by experiments. Due to the particularity of classrooms, our datasets proposes increasing the requirements of existing methods. Moreover, we provide a baseline for each task module in the dataset and make a comparison with the current mainstream datasets. The results show that our dataset is viable and reliable. Additionally, we present a thorough performance analysis of each baseline model to provide a comprehensive comparison for models using our presented dataset. The dataset and code are available to download online:
https://github.com/BNU-Wu/Student-Class-Behavior-Dataset/tree/master
.
The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a ...number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer , i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.
High precision precipitation products are the basis of precipitation-related research. Based on 27 global climate models (GCMs) in the Coupled Model Intercomparison Project phase 6 (CMIP6), we ...designed eight schemes for comprehensively using the empirical quantile mapping (EQM) method and data ensemble method to conduct precipitation bias correction; then, we selected the scheme with the highest accuracy as the final bias correction scheme. Using the selected bias correction scheme, we created a monthly precipitation dataset with a 1° spatial resolution, which spans the historical period of 1961–2014 and the future period of 2015–2099 under three shared socioeconomic pathway (SSP) scenarios: SSP126, SSP245, and SSP585. The corrected precipitation data were validated using the CN05.1 grid precipitation dataset from the China Meteorological Data Sharing Network and were compared with the ERA5 precipitation data from the European Centre for Medium-Range Weather Forecasts. The dataset was also utilized for future prediction of alternating drought and flood events in China. The results show that this best bias correction scheme is the first to integrate precipitation simulation data from 27 GCMs using the random forest (RF) model and then the EQM method to further correct the integrated precipitation data. The corrected precipitation data are better than the original GCM precipitation data in terms of both the monthly precipitation and extreme precipitation. From the perspective of the monthly precipitation, the difference between the ERA5 and RF-EQM is small, but the extreme precipitation of the RF-EQM clearly outperforms the ERA5 extreme precipitation. For the annual maximum (minimum) monthly precipitation, the correlation coefficient, the RMSD (standardized), and the STD (standardized) between the ERA5 and CN05.1 are 0.925 (0.743), 0.474 (1.223), and 1.207 (1.765), respectively; the correlation coefficient, the RMSD (standardized), and the STD (standardized) between the RF-EQM and CN05.1 are 0.947 (0.735), 0.337 (0.837), and 0.849 (1.226), respectively. The occurrence frequency of DF (an abrupt change from drought to flood) events is continuously increasing in all scenarios, with the highest frequency observed under the SSP585 scenario. The increase in FD (an abrupt change from flood to drought) event frequency is not pronounced. This study expands the method for bias correction of meteorological data and provides a reference for other climate parameters and precipitation bias correction in other regions.
•This study proposes RF-EQM, a bias correction method that utilizes random forest and empirical quantile mapping.•Using 27 global climate models, this study created a high-precision monthly precipitation dataset spanning 1961–2099.•The new precipitation dataset encompasses three socioeconomic scenarios pathway (SSP126, SSP245, and SSP585).•A continuous increase in abrupt transitions from drought to flood events in China is predicted, especially under SSP585.
► A human action recognition approach based on human silhouettes is presented. ► The method relies on the contour points and learns sequences of key poses. ► Single- and multi-view scenarios are ...supported and successfully recognized. ► Promising success rates are achieved, showing suitability for real-time scenarios. ► The method shows high resistance to inter-actor variance.
In this paper, a human action recognition method is presented in which pose representation is based on the contour points of the human silhouette and actions are learned by making use of sequences of multi-view key poses. Our contribution is twofold. Firstly, our approach achieves state-of-the-art success rates without compromising the speed of the recognition process and therefore showing suitability for online recognition and real-time scenarios. Secondly, dissimilarities among different actors performing the same action are handled by taking into account variations in shape (shifting the test data to the known domain of key poses) and speed (considering inconsistent time scales in the classification). Experimental results on the publicly available Weizmann, MuHAVi and IXMAS datasets return high and stable success rates, achieving, to the best of our knowledge, the best rate so far on the MuHAVi Novel Actor test.
Recognizing cross-subject emotions based on brain imaging data, e.g., EEG, has always been difficult due to the poor generalizability of features across subjects. Thus, systematically exploring the ...ability of different EEG features to identify emotional information across subjects is crucial. Prior related work has explored this question based only on one or two kinds of features, and different findings and conclusions have been presented. In this work, we aim at a more comprehensive investigation on this question with a wider range of feature types, including 18 kinds of linear and non-linear EEG features. The effectiveness of these features was examined on two publicly accessible datasets, namely, the dataset for emotion analysis using physiological signals (DEAP) and the SJTU emotion EEG dataset (SEED). We adopted the support vector machine (SVM) approach and the "leave-one-subject-out" verification strategy to evaluate recognition performance. Using automatic feature selection methods, the highest mean recognition accuracy of 59.06% (AUC = 0.605) on the DEAP dataset and of 83.33% (AUC = 0.904) on the SEED dataset were reached. Furthermore, using manually operated feature selection on the SEED dataset, we explored the importance of different EEG features in cross-subject emotion recognition from multiple perspectives, including different channels, brain regions, rhythms, and feature types. For example, we found that the Hjorth parameter of mobility in the beta rhythm achieved the best mean recognition accuracy compared to the other features. Through a pilot correlation analysis, we further examined the highly correlated features, for a better understanding of the implications hidden in those features that allow for differentiating cross-subject emotions. Various remarkable observations have been made. The results of this paper validate the possibility of exploring robust EEG features in cross-subject emotion recognition.
This paper is a comprehensive survey of datasets for surgical tool detection and related surgical data science and machine learning techniques and algorithms. The survey offers a high level ...perspective of current research in this area, analyses the taxonomy of approaches adopted by researchers using surgical tool datasets, and addresses key areas of research, such as the datasets used, evaluation metrics applied and deep learning techniques utilised. Our presentation and taxonomy provides a framework that facilitates greater understanding of current work, and highlights the challenges and opportunities for further innovative and useful research.
Understanding the spatial and temporal occurrences of extreme rainfall events and their changes is crucial for reducing the risk associated with extreme events, especially under anthropogenic ...warming. This study presents a comparative analysis of twelve gridded rainfall datasets in representing the spatial and temporal variation of extreme rainfall events and trends in the recent past (1983–2015) in India. The selected datasets fall into four categories: gauge-based (APHRODITE, CPC, GPCC, REGEN), satellite-derived (CHIRPS, PERSIANN-CDR), reanalysis (ERA5, MERRA-2, PGF, and JRA-55), and product merged from these three types (WFDEI and MSWEP). For comparison, a gauge-based, high resolution (0.25° × 0.25°) daily gridded rainfall dataset, prepared by the India Meteorological Department (IMD) is used as a reference dataset. Eleven extreme climate indices, defined by the World Meteorological Organization's Expert Team on Climate Change Detection and Indices (ETCCDI) and the IMD are used to represent the magnitude, frequency, and duration of extreme rainfall events in India. The relative performance of these datasets in capturing extreme rainfall events is evaluated by computing model evaluation matrices i.e. correlation coefficient (CC) and Percentage Bias (PBIAS). Trends in extreme rainfall indices and their magnitudes are calculated using Mann-Kendell and Sen's slope methods respectively, to compare with the IMD dataset. Our study finds large uncertainties in representing extreme rainfall events where in most cases the datasets underestimate higher extreme events compared to the IMD data. GPCC and MSWEP better capture the magnitude, duration, and frequency of extreme rainfall events in India and are comparable to the pattern of the IMD dataset. CHIRPS and ERA5 perform better than other satellite and reanalysis rainfall datasets. All these selected datasets consistently underestimate extreme rainfall events over the northern Himalayan region. The gridded datasets are unable to capture the spatial pattern of observed trends in extreme rainfall indices in India when compared to the IMD dataset.
•Extreme rainfall characteristics over India studied using gauge, satellite, reanalysis, and merged gridded datasets.•Elevan ETCCDI and IMD indices were used to study extreme rainfall events over India•GPCC, CHIRPS, ERA5, and MSWEP better performed in gauge, satellite, reanalysis, and merged datasets respectively.•Datasets are incongruous in representing the observed trends in extreme rainfall indices over India.