ABSTRACT Urban water demand prediction is not only the foundation of water resource planning and management, but also an important component of water supply system optimization and scheduling. ...Therefore, predicting future water demand is of great significance. For univariate time series data, the issue of outliers can be solved through data preprocessing. Then, the data input dimension is increased through feature engineering, and finally, the LightGBM (Light Gradient Boosting Machine) model is used to predict future water demand. The results demonstrate that cubic polynomial interpolation outperforms the Prophet model and the linear method in the context of missing value interpolation tasks. In terms of predicting water demand, the LightGBM model demonstrates excellent forecasting performance and can effectively predict future water demand trends. The evaluation indicators MAPE (mean absolute percentage error) and NSE (Nash–Sutcliffe efficiency coefficient) on the test dataset are 4.28% and 0.94, respectively. These indicators can provide a scientific basis for short-term prediction of water supply enterprises.
Due to long-standing federal restrictions on cannabis-related research, the implications of cannabis legalization on traffic and occupational safety are understudied. Accordingly, there is a need for ...objective and validated measures of acute cannabis impairment that may be applied in public safety and occupational settings. Pupillary response to light may offer an avenue for detection that outperforms typical sobriety tests and tetrahydrocannabinol concentrations. We developed a video processing and analysis pipeline that extracts pupil sizes during a light stimulus test administered with goggles utilizing infrared videography. The analysis compared pupil size trajectories in response to a light for those with occasional, daily, and no cannabis use before and after smoking. Pupils were segmented using a combination of image pre-processing techniques and segmentation algorithms which were validated using manually segmented data and found to achieve 99% precision and 94% F-score. Features extracted from the pupil size trajectories captured pupil constriction and rebound dilation and were analyzed using generalized estimating equations. We find that acute cannabis use results in less pupil constriction and slower pupil rebound dilation in the light stimulus test.
Depression is an increasingly common problem that often goes undiagnosed. The aim of this paper was to determine whether an analysis of tweets can serve as a proxy for assessing depression levels in ...the society. The work considered keyword-based sentiment analysis, which was enhanced to exclude informational tweets about depression or about recovery. The results demonstrated the words used in the posts most often and the emotional polarity of the tweets. A schedule of user activity was mapped out and trends related to daily activity of users were analyzed. It was observed that the identified X (Twitter) activity related to depression corresponded well with reports on persons with depression and statistics related to suicidal deaths. Therefore, it could be construed that people with undiagnosed depression express their feelings in social media more often, looking, in this way, for help with their emotional problems.
Many have envisioned the use of AI methods to find hidden patterns of public interest in large volumes of data, greatly reducing the cost of investigative journalism. But so far only a few ...investigative stories have utilized AI methods, in relatively narrow ways. This paper surveys what has been accomplished in investigative reporting using AI techniques, why it has been difficult to apply more advanced methods, and what sorts of investigative journalism problems might be solved by AI in the near term. Journalism problems are often unique to a particular story, which means that training data is not readily available and the cost of complex models cannot be amortized over multiple projects. Much of the data relevant to a story is not publicly accessible but in the hands of governments and private entities, often requiring collection, negotiation, or purchase. Journalistic inference requires very high accuracy, or extensive manual checking, to avoid the risk of libel. The factors that make some set of facts "newsworthy" are deeply sociopolitical and therefore difficult to encode computationally. The biggest near-term potential for AI in investigative journalism lies in data preparation tasks, such as data extraction from diverse documents and probabilistic cross-database record linkage.
Data cleaning is one of the most important tasks in data analysis processes. One of the perennial challenges in data analytics is the detection and handling of non-valid data. Failing to do so can ...result in creating imbalanced observations that can cause bias and influence estimates, and in extreme cases, can even lead to inaccurate analytics and unreliable decisions. Usually, the process of data cleaning is time-consuming due to its growing volume, velocity, and variety. Further, the complexity and difficulty of the cleaning process increase with the amount of data to be analyzed. It is rarely the case that any real-world data is clean and error-free. Thus, pre-processing the data before using it for analysis has become standard practice. This paper is intended to provide an easy-to-use and reliable system which automates the cleaning process for univariate time series data. Also, automating the process reduces the time required for cleaning it. Another issue that the proposed system aims to solve is making the visualization of a large amount of data more effective. To tackle these issues, an R package, cleanTS is proposed. The proposed system provides a way to analyze data on different scales and resolutions. Also, it provides users with tools and a benchmark system for comparing various techniques used in data cleaning.
•A GCN fusion GRU method for extracting spatiotemporal features of measurement data is proposed, which can maximize the recovery of the true features of abnormal data. Compared with existing methods, ...it has higher data reconstruction accuracy. The maximum relative RMSE of 4-dimensional state quantity does not exceed over 0.006.•Pre-identification process based on interval slope is proposed to locate the fuzzy intervals of the predicted data, which can greatly reduce the overall running time of the identification process. The test results show that the running time after introducing the pre-identification process has been reduced from 3.55 s to 1.89 s.•The effectiveness and superiority of the proposed method were verified through simulation experiments and actual data testing. The experimental results show that the proposed method has better performance and higher accuracy than existing methods.
Power system measurement data is used as the basis for various power system analyses. Its accuracy and reliability are particularly important. However, due to many problems, there are abnormal data in the measurement data. An identification method based on the spatial-temporal characteristics of measurement data is proposed to resolve the problems of poor identification and low efficiency. The Pre-identification (PI) process used to improve efficiency is introduced, and the sliding window based on the slope is used to realize rapid interval targeting of abnormal data and generate suspicious data sets. A spatiotemporal fusion model based on Graph Convolutional Network (GCN) and Gated Recurrent Unit (GRU) has been proposed. By aggregating the spatial-temporal characteristics of measurement data, precise reconstruction of measurement data is achieved. Then, by setting a threshold to separate the real abnormal data and normal data, the suspicious data set can be cleaned. By simulation experiments, under the cases of different ratios and different types of abnormal data, the proposed method is proved to be better in identification performances and higher in efficiency. By testing actual measurement data, the proposed method can accurately identify abnormal data under the interference of fluctuating data, indicating the proposed method has good robustness.
Graph Neural Network (GNN) has emerged as a predominant tool for graph data analysis. Despite their proliferation, the low-quality labels of many real-world graphs will undermine their performance ...dramatically. Existing studies on learning neural networks with noisy labels mainly focus on independent data and thus cannot fully exploit the structural information of graph data. Currently, there are few studies of robustness to noisy labels for graph-structured data even if this problem is commonly seen in real-world settings. To remedy this deficiency, we propose GNN Cleaner which utilizes structural information of graph data to combat noisy labels. More specifically, a pseudo label is computed from the neighboring labels for each node in the training set via a modified version of label propagation. Additionally, a novel method is developed to learn to correct the labels adaptively and dynamically. Extensive experiments show that GNN Cleaner can train GNNs robustly and correct both the synthetic and real-world noisy labels even if the noise is severe. Moreover, GNN Cleaner is model-agnostic and can be combined with various GNNs to improve their robustness against label noise.
Video captioning automatically generates short descriptions of the video content, usually in form of a single sentence. Many methods have been proposed for solving this task. A large dataset called ...MSR Video to Text (MSR-VTT) is often used as the benchmark dataset for testing the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in the dataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problems may pose difficulties to video captioning models for learning underlying patterns. We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted the performances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trained on the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the model generated captions that were more coherent and more relevant to the contents of the video clips.
•Identify duplicate captions and grammatical problems in the MSR-VTT dataset.•Clean the dataset and compare model performance before and after data cleaning.•Inspect the impact of each step in data cleaning.•Human evaluation shows the positive impact of data cleaning.
Summary
1. Compilation of vegetation databases has contributed significantly to the advancement of vegetation science all over the world. Yet, methodological problems result from the use of plant ...names, particularly in data that originate from numerous and heterogeneous sources. One of the main problems is the inordinate number of synonyms that can be found in vegetation lists.
2. We present Taxonstand, an r package to automatically standardise plant names using The Plant List (http://www.theplantlist.org). The scripts included in this package allow connection to the online search engine of the Plant List and retrieve information from each species about its current taxonomic status. In those cases where the species name is a synonym, it is replaced by the current accepted name. In addition, this package can help correcting orthographic errors in specific epithets.
3. This tool greatly facilitates the preparation of large vegetation databases prior to their analyses, particularly when they cover broad geographical areas (supranational or even continental scale) or contain data from regions with rich floras where taxonomic problems have not been resolved for many of their taxa. Automated workflows such as the one provided by the taxonstand package can ease considerably this task using a widely accessible working nomenclatural authority list for plant species names such as The Plant List.