Imbalanced response variable distribution is a common occurrence in data science. In fields such as fraud detection, medical diagnostics, system intrusion detection and many others where abnormal ...behavior is rarely observed the data under study often features disproportionate target class distribution. One common way to combat class imbalance is through resampling of the minority class to achieve a more balanced distribution. In this paper, we investigate the performance of the sampling method based on kernel density estimation (KDE). We believe that KDE offers a more natural way to generate new instances of minority class that is less prone to overfitting than other standard sampling techniques. It is based on a well established theory of nonparametric statistical estimation. Numerical experiments show that KDE can outperform other sampling techniques on a range of real life datasets as measured by F1-score and G-mean. The results remain consistent across a number of classification algorithms used in the experiments. Furthermore, the proposed method outperforms the benchmark methods irregardless of the class distribution ratio. We conclude, based on the solid theoretical foundation and strong experimental results, that the proposed method would be a valuable tool in problems involving imbalanced class distribution.
Overcoming concerns about bicycling safety is critical to increasing the health benefits of bicycling for transportation. While exposure measures are critical for monitoring and understanding bike ...safety, lack of spatially and temporally detailed bike counts makes it challenging to conduct robust bicycling safety studies. Crowdsourced data from smartphone apps like Strava provide counts for nearly all individual road and trail sections with 1-min temporal resolution. Researchers have found that patterns of Strava bicyclists are similar to all bicyclists in our study area. In this paper, we develop and test a method to normalize bike safety incident hotspots using exposure estimated from Strava data for Ottawa, Canada. We mapped incident hotspots normalized by exposure at increasingly detailed temporal scales. In a dataset with more than more than 8 million Strava activities and 395 incidents (approximately 20,000 Strava activities per incident), adjusting for exposure moved incident hotspots away from protected bike lanes and multi-use paths and onto commercial streets with no bike infrastructure. Strava data are available to correct for exposure where other measures are not available. We encourage researchers, planners, and public health practitioners to consider crowdsourced data to fill exposure data gaps and provide context for interpreting incident data.
•We present a method of correcting for the bicycling exposure for incident kernel density estimates.•Strava Metro provides spatially and temporally detailed exposure for bicycling incidents.•In Ottawa, multi-use trails had low safety burden and risk.•Correcting for exposure highlighted major streets as key locations for safety interventions.
•Maritime accident data analysis provides decision-making basis for stakeholders.•Geospatial techniques and methods for maritime accident evaluation are developed.•Spatial patterns of global maritime ...accidents are explored.•Characteristics of global maritime accidents are found to be diverse in different regions.
Maritime safety has become one of the top concerns of the global maritime sector in recent years. This paper explores the spatial patterns and characteristics of maritime accidents on a global scale. Maritime accident data dating from 2003 to 2018 from the Marine Casualties and Incidents (MCI) module of the Global Integrated Shipping Information System (GISIS) was collected and manipulated and descriptive analyses were conducted subsequently to obtain the overall profile of global maritime accidents. Geospatial techniques of Kernel Density Estimation (KDE) and K-means clustering method were introduced and parameters specifically used in this study were identified. These geospatial techniques were utilized to a) create a number of KDE maps of global maritime accidents, and b) subdivide these accidents into six classes and profile characteristics of maritime accidents within each class. Maritime accidents are more likely to occur around the United Kingdom, Denmark, Singapore, and Shanghai of China. This may be due to the large cargo volume, the high density of routes, the poor geographical conditions of the sea area and the poor climate conditions in these areas. Distributions of maritime accidents by time, initial event, and ship type are found to be diverse in different accident classes.
To carry out the diagnosis and evaluation of the ecosystem health in Yuxi three-lake watershed, this paper presents the changing trend of its health state, and predicts the future development. This ...also provides ideas for maintaining the regional ecosystem health, and then gradually improves the ecological environment quality. Taking Fuxian Lake, Qilu Lake and Xingyun Lake (the three-lake watershed) in Yuxi City, Yunnan Province, Southwest China as the research object, a model combining pressure-state-response and kernel density estimation (PSR-KDE) adopts to diagnose and evaluate the ecosystem health of the "three lake" watershed from 2010 to 2020, and the distribution map of ecosystem health index has obtained by the evaluation indexes integration based on GIS spatial analysis. Hence, the evaluation results have visualized on the map. The results show that: The distribution of ecosystem health index in the study area was 0.1530-0.7045 in 2010, 0.2056-0.7512 in 2015, and 0.2248-0.7662 in 2020. 0.12% was in the pathological area in 2010. After 2015, the pathological condition of ecosystem health has completely solved, and the proportion of unhealthy ecosystems was 11.95% in 2010, 7.38% in 2015, and 5.97% in 2020. The ecosystem health index of the study region was 0.5523 in 2010, 0.5807 in 2015, and 0.5815 in 2020, it indicates that the ecosystem was in a sub-health state. From 2010 to 2020, the ecosystem health around Qilu Lake was the most worrying, followed by the northwest of Fuxian Lake and the northern and southern regions of Xingyun Lake. The ecosystem health of the three-lake watershed showed significant improvement from 2010 to 2020. The study ecosystem health assessment and early warning in the three-lake watershed is significant to the ecological environment protection and management of the plateau lake basin, the restoration of the territorial space ecology and the economic development of the surrounding area.
A number of traffic crash databases at present contain the precise positions and dates of these events. This feature allows for detailed spatiotemporal analysis of traffic crash patterns.
We applied ...a clustering method for identification of traffic crash hotspots to the rural parts of primary roads in the Czech road network (3,933 km) where 55,296 traffic crashes occurred over 2010 – 2018. The data were analyzed using a 3-year time window which moved forward with a one-day step as an elementary temporal resolution. The spatiotemporal behavior of hotspots could therefore be analyzed in great detail.
All the identified hotspots, during the monitored nine-year period, covered between 6.8% and 8.2% of the entire road network length in question. The percentage of traffic crashes within the hotspots remained stable over time at approximately 50%. Three elementary types of hotspots were identified when analyzing spatiotemporal crash patterns: hotspot emergence, stability and disappearance. Only 100 hotspots were stable (remained in approximately the same position) over the entire nine-year period. This approach can be applied to any traffic-crash time series when the precise positions and date of crashes are available.
•We studied spatiotemporal behavior of hotspots identified using the KDE+.•Crash data were analyzed using a 3-year time window with a one-day step.•All the hotspots covered between 6.8% and 8.2% of the road network.•Hotspots evolved over time, emerged or disappeared.•Only 100 hotspots were stable over nine-year period (2010–2018).
Robust process monitoring and reliable fault isolation in industrial processes usually encounter different challenges, including process nonlinearity and noise interference. In this brief, a novel ...method denoising autoencoder and elastic net (DAE-EN) is proposed to solve the aforementioned issues by effectively integrating DAE and EN. The DAE is first trained to robustly capture the nonlinear structure of the industrial data. Then, the encoder network is updated into a sparse model using EN, so that the key variables associated with each neuron can be selected. After that two statistics are developed based on the extracted systematic structure and the retained residual information. In addition, another statistic is also constructed by combining the aforementioned two statistics to provide an overall measurement for the process sample. In this way, a robust monitoring model can be constructed to monitor the abnormal status in industrial processes. After the fault is detected, the faulty neurons are identified by the sparse exponential discriminant analysis, so that the associated faulty variables along each faulty neuron can thus be isolated. Two real industrial processes are used to validate the performance of the proposed method. Experimental results show that the proposed method can effectively detect the abnormal samples in industrial processes and accurately isolate the faulty variables from the normal ones.
Nations across the world share common responsibility towards achieving Sustainable Development Goals (SDGs). To monitor the progress of individual goals and their global-level comparisons, a set of ...targets and indicators are developed by the experts. However, systematic methods for assessing spatio-temporal progress towards achieving the SDGs are lacking. This study demonstrates the use of geographically referenced information (GIS) analysis in mapping the SDGs as achieved under the Mahatma Gandhi National Rural Employment Generation Act (MGNREGA) programme in India, taking Uttarakhand state as a case study. Geotagged data of assets representing various work categories permissible under MGNREGA are linked to the targets and indicators of various SDGs. Kernel Density Estimation (KDE) function is used to derive spatially explicit maps. Sub-national-level composite analysis of overall contribution of the MGNREGA to SDGs is carried out district wise for better understanding. Results obtained show significant spatial variation in the distribution of works across the districts, reflecting their varying priorities as MGNREGA is a demand-driven scheme. The future implication of the study is a vastly improved ability to derive latent information based on geographical indicators for targeting interventions and developing informed strategies towards SDGs.
Short-term wind power forecast (WPF) depends highly on the wind speed forecast (WSF), which is the prime contributor to the forecasting error. To achieve more accurate WPF results, this article ...proposes a wind speed correction method to improve the WSF result obtained by using the weather research and forecasting (WRF) model. First, the WRF model is constructed to forecast the wind speed, and its performance is analyzed. Second, a novel hidden Markov model (HMM) is developed to explore both the temporal autocorrelation of WSF error and the nonlinear correlation between the WSF result and the error. In the model, the fuzzy C-means cluster is introduced to properly divide the hidden state space of HMM and the emission probability of HMM is improved as continuous by the kernel density estimation (KDE) to make full use of the observation information. The proposed HMM model is better at wind speed correction through modification. Third, the HMM is solved by the Viterbi algorithm and the minimum mean-square error regulation to correct the predicted wind speed. Finally, the deterministic and probabilistic WPF results are obtained by using another KDE model, the proposed method is demonstrated to be superior to the benchmarks in case studies.
Early detection of incipient faults in industrial processes is increasingly becoming important, as these faults can slowly develop into serious abnormal events, an emergency situation, or even ...failure of critical equipment. Multivariate statistical process monitoring methods are currently established for abrupt fault detection. Among these, the canonical variate analysis (CVA) was proven to be effective for dynamic process monitoring. However, the traditional CVA indices may not be sensitive enough for incipient faults. In this work, an extension of CVA, called the canonical variate dissimilarity analysis (CVDA), is proposed for process incipient fault detection in nonlinear dynamic processes under varying operating conditions. To handle the non-Gaussian distributed data, the kernel density estimation was used for computing detection limits. A CVA dissimilarity based index has been demonstrated to outperform traditional CVA indices and other dissimilarity-based indices, namely the dissimilarity analysis, recursive dynamic transformed component statistical analysis, and generalized canonical correlation analysis, in terms of sensitivity when tested on slowly developing multiplicative and additive faults in a continuous stirred-tank reactor under closed-loop control and varying operating conditions.