•Data cleaning and restoring methods for vehicle battery big data platform.•An integrated data cleaning framework is built to improve vehicle battery dataset quality.•A quality assessment model is ...built for detecting outliers by analyzing temporal features.•A data restoring model is further developed for improving dataset integrity.•The developed method is validated by real battery operation data in a cloud platform.
Battery is one of the most important and costly devices in electric vehicles (EVs). Developing an efficient battery management method is of great significance to enhancing vehicle safety and economy. Recently developed big-data and cloud platform computing technologies bring a bright perspective for efficient utilization and protection of vehicle batteries. However, a reliable data transmission network and a high-quality cloud battery dataset are indispensable to enable this benefit.
This paper makes the first effort to systematically solve data quality problems in cloud-based vehicle battery monitoring and management by developing a novel integrated battery data cleaning framework. In the first stage, the outlier samples are detected by analyzing the temporal features in the battery data time series. The outlier data in the dataset can be accurately detected to avoid their impacts on battery monitoring and management. Then, the abnormal samples, including the noise polluted data and missing value, are restored by a novel future fusion data restoring model. The real electric bus operation data collected by a cloud-based battery monitoring and management platform are used to verify the performance of the developed data cleaning method. More than 93.3% of outlier samples can be detected, and the data restoring error can be limited to 2.11%, which validates the effectiveness of the developed methods. The proposed data cleaning method provides an effective data quality assessment tool in cloud-based vehicle battery management, which can further boost the practical application of the vehicle big data platform and Internet of vehicle.
Accurate and non-invasive measurement of material thickness plays an important role across several industry sectors such as aerospace, oil and gas, rail and others. This paper aims to use neural ...networks as a predictive tool to enhance thickness measurement accuracy of immersed steel samples. In this study, a set of training data is provided through conducting experiments on an immersed wedge sample with varying thickness using the A-scan method. This dataset is used for training a single-layer neural network. To evaluate the performance of the trained neural network, a set of test data is provided on different samples with various thicknesses. Through this study, a promising methodology is demonstrated toward accurate and effective thicknesses prediction using neural networks. The outcomes exhibited good agreement when employing a neural network with the same architecture to predict the void locations in another sample of similar material. Furthermore, the results revealed that this method has achieved an error of less than 3% for thickness prediction and less than 7% for void detection.
Using constrained factor mixture models (FMM) for careless response identification is still in its infancy. Existing models have overly restrictive statistical assumptions that do not identify all ...types of careless respondents. The current paper presents a novel constrained FMM model with more reasonable assumptions that capture both longstring and random careless respondents. We provide a comprehensive comparison of the statistical assumptions between the proposed model and two previous constrained models. The proposed model was evaluated using both real data ( N = 1,455) and statistical simulation. The results showed that the model had a superior fit, stronger convergent validity with other indicators of careless responding, more accurate parameter recovery and more accurate identification of careless respondents when compared to its predecessors. The proposed model does not require additional data collection effort, and thus researchers can routinely use it to control careless responses. We provide user-friendly syntax with detailed explanations online to facilitate its use.
•Advocate network models complement latent variable models of individual differences.•Reviews best practices for data cleaning and diagnostics for psychometric modeling.•Reanalysis of Freed et al. ...(2017) reveals problematic measurement model.•Measures of reading comprehension and language experience load on one latent factor.•Network models corroborate findings and alternative conclusions are proposed.
Individual differences in reading comprehension have often been explored using latent variable modeling (LVM), to assess the relative contribution of domain-general and domain-specific cognitive abilities. However, LVM is based on the assumption that the observed covariance among indicators of a construct is due to a common cause (i.e., a latent variable; Pearl, 2000). This is a questionable assumption when the indicator variables are measures of performance on complex cognitive tasks. According to Process Overlap Theory (POT; Kovacs & Conway, 2016), multiple processes are involved in cognitive task performance and the covariance among tasks is due to the overlap of processes across tasks. Instead of a single latent common cause, there are thought to be multiple dynamic manifest causes, consistent with an emerging view in psychometrics called network theory (Barabási, 2012; Borsboom & Cramer, 2013). In the current study, we reanalyzed data from Freed et al. (2017) and compared two modeling approaches: LVM (Study 1) and psychometric network modeling (Study 2). In Study 1, two exploratory LVMs demonstrated problems with the original measurement model proposed by Freed et al. Specifically, the model failed to achieve discriminant and convergent validity with respect to reading comprehension, language experience, and reasoning. In Study 2, two network models confirmed the problems found in Study 1, and also served as an example of how network modeling techniques can be used to study individual differences. In conclusion, more research, and a more informed approach to psychometric modeling, is needed to better understand individual differences in reading comprehension.
In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning - of the kind commonly used in production ...ML systems - impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
Researchers on travel behavior and regional economic trends increasingly rely on multiple data sources to locate employers and site-specific employment. In a previous study, we proposed a method to ...assess and integrate multiple sources of employment data using three components: the Google Places application programming interface (API), a business existence verification model, and manual reviews of sampled data. This paper updates our previous methodology with a dual conditional classification of incoming and previously verified employment data made possible by checks using Google Places API and two rounds of string comparisons for both business names and establishment locations. The resulting match classes distinguish well-matched or confirmed business listings from those that require additional review to evaluate potential business closure or relocation. This screening process, augmented with fuzzy logic string matching techniques, reduces the effort needed to update employer information and assists with automated data standardization and deduplication, integrating incoming employment information with a database of verified employers.
Due to long-standing federal restrictions on cannabis-related research, the implications of cannabis legalization on traffic and occupational safety are understudied. Accordingly, there is a need for ...objective and validated measures of acute cannabis impairment that may be applied in public safety and occupational settings. Pupillary response to light may offer an avenue for detection that outperforms typical sobriety tests and tetrahydrocannabinol concentrations. We developed a video processing and analysis pipeline that extracts pupil sizes during a light stimulus test administered with goggles utilizing infrared videography. The analysis compared pupil size trajectories in response to a light for those with occasional, daily, and no cannabis use before and after smoking. Pupils were segmented using a combination of image pre-processing techniques and segmentation algorithms which were validated using manually segmented data and found to achieve 99% precision and 94% F-score. Features extracted from the pupil size trajectories captured pupil constriction and rebound dilation and were analyzed using generalized estimating equations. We find that acute cannabis use results in less pupil constriction and slower pupil rebound dilation in the light stimulus test.
Depression is an increasingly common problem that often goes undiagnosed. The aim of this paper was to determine whether an analysis of tweets can serve as a proxy for assessing depression levels in ...the society. The work considered keyword-based sentiment analysis, which was enhanced to exclude informational tweets about depression or about recovery. The results demonstrated the words used in the posts most often and the emotional polarity of the tweets. A schedule of user activity was mapped out and trends related to daily activity of users were analyzed. It was observed that the identified X (Twitter) activity related to depression corresponded well with reports on persons with depression and statistics related to suicidal deaths. Therefore, it could be construed that people with undiagnosed depression express their feelings in social media more often, looking, in this way, for help with their emotional problems.
•A method for bearing life prediction under variable operation conditions is proposed.•A shape similarity-based method for selecting degradation feature is proposed.•An unsupervised segmented data ...cleaning algorithm is proposed.•The cross-transformer to capture dependencies across time and variables is improved.•Proposed method improves prediction accuracy and generalization.
Bearing remaining useful life (RUL) prediction research based on deep learning mostly emphasizes model performance and effective feature vectors, overlooking different densities of outlier distributions in vibration signals at varying degradation stages. Moreover, forecasting models focus on capturing cross-time dependencies, ignoring the dependencies between different variables. To solve these problems, this paper proposes an unsupervised segmented data cleaning algorithm and a RUL prediction framework adaptable to variable operating conditions. The method consists of four steps: (1) Multi-domain feature extraction and selection establish a feature vector space reflecting degradation trends. (2) Segmented data cleaning divides degradation stages, using different penalty factors for outlier cleaning. (3) Cleaned vibration signals undergo a second round of multidomain feature engineering and degradation-stage division. (4) A two-stage Cross-Transformer model is used for RUL prediction. The method proposed has been validated on the prognostics and health management (PHM) bearing degradation dataset. In the constant condition prediction task, the root mean square error (RMSE) and mean absolute error (MAE) were improved to 1.88 and 5.78, respectively. In the variable condition prediction task, the proposed method outperformed existing methods, with an improvement of 59.10 % in RMSE, demonstrating strong generalization performance and practical application value.
Display omitted