Self-report data collections, particularly through online measures, are ubiquitous in both experimental and non-experimental psychology. Invalid data can be present in such data collections for a ...number of reasons. One reason is careless or insufficient effort (C/IE) responding. The past decade has seen a rise in research on techniques to detect and remove these data before normal analysis (Huang, Curran, Keeney, Poposki, & DeShon, 2012; Johnson, 2005; Meade & Craig, 2012). The rigorous use of these techniques is valuable tool for the removal of error that can impact survey results (Huang, Liu, & Bowling, 2015). This research has encompassed a number of sub-fields of psychology, and this paper aims to integrate different perspectives into a review and assessment of current techniques, an introduction of new techniques, and a generation of recommendations for practical use. Concerns about C/IE responding are a factor any time self-report data are collected, and all such researchers should be well-versed on methods to detect this pattern of response.
Introduction/Purpose
Ultrasound picture archiving and communication system (PACS) databases are useful for quality improvement and clinical research but frequently contain free text that is not ...easily readable. Here, we present a method to extract and clean a semi‐structured echocardiography (cardiac ultrasound) PACS database.
Methods
Echocardiography studies between 1 January 2010 and 31 December 2018 were extracted using a data mining tool. Numeric variables were recoded with extreme values excluded. Analysis of free text, including descriptions of the heart valves and right and left ventricular size and function, was performed using a rule‐based system. Different levels of free text variables were initially identified using commonly used phrases and then iteratively developed. Randomly selected sets of 100 studies were compared to the electronic health record to validate the data cleaning process.
Results
The data validation step was performed three times in total, with Cohen's kappa ranging between 0.88 and 1.00 for the final set of data validation across all measures.
Conclusion
Free text cleaning of semi‐structured PACS databases is possible using freely available open‐source software. The accuracy of this method is high, and the resulting dataset can be linked to administrative data to answer research questions. We present a method that could be used to answer clinical questions or to develop quality improvement initiatives.
Tidy Data Wickham, Hadley
Journal of statistical software,
09/2014, Letnik:
59, Številka:
10
Journal Article
Recenzirano
Odprti dostop
A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a ...small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.
Practitioners use various indicators to screen for meaningless, careless, or fraudulent responses in Internet surveys. This study employs an experimental-like design to empirically test the ability ...of non-reactive indicators to identify records with low data quality. Findings suggest that careless responses are most reliably identified by questionnaire completion time, but the tested indicators do not allow for detecting intended faking. The article introduces various indicators, their benefits and drawbacks, proposes a completion speed index for common application in data cleaning, and discusses whether to remove meaningless records at all.
•Outliers embedded in multivariate are detected and removed by local density.•Multidimensional wind features are reconfigured to a structured information map.•Tailored input of the convolutional ...model is beneficial to wind power prediction.
Wind power prediction decreases the uncertainty of the entire energy system, which is essential for balancing energy supply and demand. In order to improve the prediction accuracy, a short-term wind power prediction method based on data cleaning and feature reconfiguration is proposed. A large number of historical samples consisting of wind direction, wind speed, and wind power are mapped into a multidimensional sample space, and the distribution of wind data in different dimensions are analyzed in depth. By calculating the local density of each sample, outliers are effectively detected. The features of wind are reconfigured into a global information map combined with the time series information, which reflects the variation of the wind process in the short term. The features of the original data are greatly enriched, providing a high-quality training set for the prediction model. A redesigned convolutional neural network was used to predict short-term wind power, and the proposed methods were trained and tested based on a dataset of a real wind farm in China. Data cleaning and feature reconfiguration reduce the average single-point error by 1.38% and 2.56%, respectively, while the combined method reduced it by 6.24%. Plenty of experimental results show that the proposed methods achieve good performance and effectively improve the accuracy of short-term wind power prediction.
Abstract
Traditional Chinese medicine (TCM) prescriptions have been developed for thousands of years. Data forms are diverse, content is discrete and missing, and there are many uncertainties due to ...cultural and regional differences. Therefore, it has brought some difficulties to the mining of TCM prescriptions. Data based on the 3108 prescriptions for the treatment of typhoid fever, for example, is given priority to with data cleaning and data transformation of data preprocessing, prescriptions combined with multiple functions, expounds the unqualified prescriptions data cleansing, drug name normalization, dose for solving the problems of the unification, the data structured method, make the processed data can be effectively mining, It provides a strong support for exploring the compatibility law of prescription and the development of new drugs.
On testing machine learning programs Braiek, Houssem Ben; Khomh, Foutse
The Journal of systems and software,
June 2020, 2020-06-00, Letnik:
164
Journal Article
Recenzirano
Odprti dostop
•We identify and explain ML testing challenges that should be addressed.•We report existing solutions found in the literature for testing ML programs.•We identify gaps in the literature related to ...the testing of ML programs.•We make recommendations of future research directions for the community.
Nowadays, we are witnessing a wide adoption of Machine learning (ML) models in many software systems. They are even being tested in safety-critical systems, thanks to recent breakthroughs in deep learning and reinforcement learning. Many people are now interacting with systems based on ML every day, e.g., voice recognition systems used by virtual personal assistants like Amazon Alexa or Google Home. As the field of ML continues to grow, we are likely to witness transformative advances in a wide range of areas, from finance, energy, to health and transportation. Given this growing importance of ML-based systems in our daily life, it is becoming utterly important to ensure their reliability. Recently, software researchers have started adapting concepts from the software testing domain (e.g., code coverage, mutation testing, or property-based testing) to help ML engineers detect and correct faults in ML programs. This paper reviews current existing testing practices for ML programs. First, we identify and explain challenges that should be addressed when testing ML programs. Next, we report existing solutions found in the literature for testing ML programs. Finally, we identify gaps in the literature related to the testing of ML programs and make recommendations of future research directions for the scientific community. We hope that this comprehensive review of software testing practices will help ML engineers identify the right approach to improve the reliability of their ML-based systems. We also hope that the research community will act on our proposed research directions to advance the state of the art of testing for ML programs.
Survey respondents differ in their levels of attention and effort when responding to items. There are a number of methods researchers may use to identify respondents who fail to exert sufficient ...effort in order to increase the rigor of analysis and enhance the trustworthiness of study results. Screening techniques are organized into three general categories, which differ in impact on survey design and potential respondent awareness. Assumptions and considerations regarding appropriate use of screening techniques are discussed along with descriptions of each technique. The utility of each screening technique is a function of survey design and administration. Each technique has the potential to identify different types of insufficient effort. An example dataset is provided to illustrate these differences and familiarize readers with the computation and implementation of the screening techniques. Researchers are encouraged to consider data screening when designing a survey, select screening techniques on the basis of theoretical considerations (or empirical considerations when pilot testing is an option), and report the results of an analysis both before and after employing data screening techniques.
In this article, we study the problem of computing Random Forest-distances in the presence of missing data. We present a general framework which avoids pre-imputation and uses in an agnostic way the ...information contained in the input points. We centre our investigation on RatioRF, an RF-based distance recently introduced in the context of clustering and shown to outperform most known RF-based distance measures. We also show that the same framework can be applied to several other state-of-the-art RF-based measures and provide their extensions to the missing data case. We provide significant empirical evidence of the effectiveness of the proposed framework, showing extensive experiments with RatioRF on 15 datasets. Finally, we also positively compare our method with many alternative literature distances, which can be computed with missing values.
Wind turbine power curve cleaning, by way of removing curtailment, stoppage, and other anomalies, is an essential step in making raw data useable for further analysis, such as determining turbine ...performance, site characteristics, or improving forecasting models. Typically, data comes as SCADA (Supervisory Control and Data Acquisition) data, so contains not only environmental and turbine performance data but also the control action imposed on the turbine by the operator. Many different anomaly detection (AD) methods have been proposed to clean power curves; however, few papers have explored filtering explicit and obvious anomalies from the SCADA prior to running AD. This paper actively explores this filtering impact by comparing the performances of 4 different AD methods with/without filtering. These are: iForest, Local Outlier Factor, Gaussian Mixture Models, and k-Nearest Neighbours. Each approach is evaluated in terms of prediction error, data removal rates, and ability to maintain the underlying wind statistical characteristics. The results show the effectiveness of filtering with every technique showing improvement compared to its unfiltered counterpart. Furthermore, Gaussian Mixture Models are shown to provide favourable accuracy whilst maintaining wind variability, however, with the wide range of performances of methods, a user's choice may be different depending on their needs.