In China, ozone pollution shows an increasing trend and becomes the primary air pollutant in warm seasons. Leveraging the air quality monitoring network, a random forest model is developed to predict ...the daily maximum 8-h average ozone concentrations (O3MDA8) across China in 2015 for human exposure assessment. This model captures the observed spatiotemporal variations of O3MDA8 by using the data of meteorology, elevation, and recent-year emission inventories (cross-validation R2 = 0.69 and RMSE = 26 μg/m3). Compared with chemical transport models that require a plenty of variables and expensive computation, the random forest model shows comparable or higher predictive performance based on only a handful of readily-available variables at much lower computational cost. The nationwide population-weighted O3MDA8 is predicted to be 84 ± 23 μg/m3 annually, with the highest seasonal mean in the summer (103 ± 8 μg/m3). The summer O3MDA8 is predicted to be the highest in North China (125 ± 17 μg/m3). Approximately 58% of the population lives in areas with more than 100 nonattainment days (O3MDA8>100 μg/m3), and 12% of the population are exposed to O3MDA8>160 μg/m3 (WHO Interim Target 1) for more than 30 days. As the most populous zones in China, the Beijing-Tianjin Metro, Yangtze River Delta, Pearl River Delta, and Sichuan Basin are predicted to be at 154, 141, 124, and 98 nonattainment days, respectively. Effective controls of O3 pollution are urgently needed for the highly-populated zones, especially the Beijing-Tianjin Metro with seasonal O3MDA8 of 140 ± 29 μg/m3 in summer. To the best of the authors’ knowledge, this study is the first statistical modeling work of ambient O3 for China at the national level. This timely and extensively validated O3MDA8 dataset is valuable for refining epidemiological analyses on O3 pollution in China.
Display omitted
•Spatiotemporal distributions of ambient O3 levels are estimated for China in 2015.•The random forest model shows good performance with cross-validation R2 of 0.69.•Evaporation is the most important variable for predicting ambient O3 levels.•Annual average of population-weighted O3MDA8 is predicted to be 84 ± 23 μg/m3.•58% of the population lives in areas with more than 100 nonattainment days.
In China 58% of the population lives in areas with more than 100 nonattainment days, and 12% of the population are exposed to O3MDA8>160 μg/m3 for more than 30 days.
Display omitted
•Long-term daily NO2 are derived for post-policy evaluation and exposure assessment.•A common modeling approach (Base-RF) gives biased estimation in back-extrapolation.•We propose a ...novel approach named RBE-RF for the bias correction.•Average NO2 levels for China in 2011 can be underestimated by 22.4% by Base-RF.•National population exposed to NO2 > 40 µg/m3 is 18.5% by Base-RF and 33.0% by RBE-RF.
Long-term surface NO2 data are essential for retrospective policy evaluation and chronic human exposure assessment. In the absence of NO2 observations for Mainland China before 2013, training a model with 2013–2018 data to make predictions for 2005–2012 (back-extrapolation) could cause substantial estimation bias due to concept drift.
This study aims to correct the estimation bias in order to reconstruct the spatiotemporal distribution of daily surface NO2 levels across China during 2005–2018.
On the basis of ground- and satellite-based data, we proposed the robust back-extrapolation with a random forest (RBE-RF) to simulate the surface NO2 through intermediate modeling of the scaling factors. For comparison purposes, we also employed a random forest (Base-RF), as a representative of the commonly used approach, to directly model the surface NO2 levels.
The validation against Taiwan’s NO2 observations during 2005–2012 showed that RBE-RF adequately corrected the substantial underestimation by Base-RF. The RMSE decreased from 10.1 to 8.2 µg/m3, 7.1 to 4.3 µg/m3, and 6.1 to 2.9 µg/m3 in predicting daily, monthly, and annual levels, respectively. For North China with the most severe pollution, the population-weighted NO2 (NO2pw) during 2005–2012 was estimated as 40.2 and 50.9 µg/m3 by Base-RF and RBE-RF, respectively, i.e., 21.0% difference. While both models predicted that the national annual NO2pw increased during 2005–2011 and then decreased, the interannual trends were underestimated by >50.2% by Base-RF relative to RBE-RF. During 2005–2018, the nationwide population that lived in the areas with NO2 > 40 µg/m3 were estimated as 259 and 460 million by Base-RF and RBE-RF, respectively.
With RBE-RF, we corrected the estimation bias in back-extrapolation and obtained a full-coverage dataset of daily surface NO2 across China during 2005–2018, which is valuable for environmental management and epidemiological research.
Visible/near-infrared spectroscopy (NIRS), with the characteristics of high speed, non-destructiveness, high precision and reliable detection data, etc., is a pollution-free, rapid, quantitative and ...qualitative analysis method. A new approach for discrimination of varieties of tea by means of vis/NIR spectroscopy (325–1075
nm) was developed in this work. The relationship between the reflectance spectra and tea varieties was established. The spectral data was compressed by the wavelet transform (WT). The features from WT can be visualized in principal component (PC) space, which can lead to discovery of structures correlative with the different class of spectra samples. It appeared to provide a reasonable clustering of the varieties of tea. The scores of the first eight principal components computed by PCA had been applied as inputs to a back propagation neural network with one hidden layer. The 200 samples of eight varieties were selected randomly to build BP-ANN model. This model was used to predict the varieties of 40 unknown samples. The recognition rate of 100% was achieved. This model comes to be reliable and practicable.
Comparative evaluation of SOC baselines between global soil database (HWSD) and the recent soil survey for farmlands using a random forest model.
Display omitted
•Total of 29,927 farmland sites in ...Zhejiang, East China were surveyed for SOC stock.•Random forest model showed high predictive performance with R2 of 0.76.•Maps of fine-resolution SOC stock baseline and its uncertainty were estimated.•Considerable spatial discrepancies between this study and HWSD were revealed.•Carbon accounting based on SOC content of HWSD should be reinvestigated.
Soil organic carbon (SOC) is important to soil fertility and the global carbon cycle. Accurate estimates of SOC stock and its dynamics are critical for managing agricultural ecosystems and carbon accounting under climate change, especially for highly cultivated regions. We extensively surveyed the SOC levels in 29,927 sites in Zhejiang province, an intensively cultivated region of East China, from year 2007 to 2008. We then estimated the spatial distribution of topsoil (0–30cm) organic carbon stock using a random forest (RF) model, which is a powerful machine learning algorithm with superior predictive performance over parametric statistical models. The final RF model contained 23 predictor variables, covering soil properties, vegetation, climate, topography, land cover, farming practices, and locations. The RF model showed high performance in predicting the SOC stock, with a coefficient of determination (R2) of 0.76 and a root mean square error (RMSE) of 10.63tCha−1. This performance was superior to the General Linear Model (GLM) (R2=0.35, RMSE=19.93tCha−1) and the ordinary kriging (OK) method (R2=0.57, RMSE=14.44tCha−1), and was equivalent to Boosted Regressing Trees (BRT) (R2=0.73, RMSE=11.26tCha−1). According to the variable importance evaluation, soil properties were the most important predictor variables, followed by climate and location, with relative importance values of 61%, 17%, and 14%, respectively. The predicted SOC stock ranged from 14.8 to 125.5tCha−1, with an average±standard deviation of 50.1±12.3tCha−1. The mean SOC level obtained from this survey was considerably lower than the value of 60.5tCha−1 reported for the same region in the Harmonized World Soil Database (HWSD), which is the most commonly used soil database worldwide. A large spatial discrepancy of SOC stock was observed between this survey and HWSD in regional and sub-regional levels. This study provided an updated regional baseline map of SOC levels for improving farmland management and refining carbon accounting under climate change.
Ginkgo biloba L. is a rare dioecious species that is valued for its diverse applications and is cultivated globally. This study aimed to develop a rapid and effective method for determining the sex ...of a Ginkgo biloba. Green and yellow leaves representing annual growth stages were scanned with a hyperspectral imager, and classification models for RGB images, spectral features, and a fusion of spectral and image features were established. Initially, a ResNet101 model classified the RGB dataset using the proportional scaling–background expansion preprocessing method, achieving an accuracy of 90.27%. Further, machine learning algorithms like support vector machine (SVM), linear discriminant analysis (LDA), and subspace discriminant analysis (SDA) were applied. Optimal results were achieved with SVM and SDA in the green leaf stage and LDA in the yellow leaf stage, with prediction accuracies of 87.35% and 98.85%, respectively. To fully utilize the optimal model, a two-stage Period-Predetermined (PP) method was proposed, and a fusion dataset was built using the spectral and image features. The overall accuracy for the prediction set was as high as 96.30%. This is the first study to establish a standard technique framework for Ginkgo sex classification using hyperspectral imaging, offering an efficient tool for industrial and ecological applications and the potential for classifying other dioecious plants.
Global climate change is a serious threat to food and energy security. Crop growth modelling is an important tool for simulating crop food production and assisting in decision making. Planting date ...is one of the important model parameters. Larger-scale spatial distribution with high accuracy for planting dates is essential for the widespread application of crop growth models. In this study, a planting date prediction method based on environmental similarity was developed in accordance with the third law of geography. Spring maize planting date observations from 124 agricultural meteorological experiment stations in China over the years 1992–2010 were used as the data source. Samples spanning from 1992 to 2009 were allocated as training data, while samples from 2010 constituted the independent validation set. The results indicated that the root mean square error (RMSE) for spring maize planting date based on environmental similarity was 10 days, which is better than that of multiple regression analysis (RMSE = 13 days) in 2010. Additionally, when applied at varying scales, the accuracy of national-scale prediction was better than that of regional-scale prediction in areas with large differences in planting dates. Consequently, the method based on environmental similarity can effectively and accurately estimate planting date parameters at multiple scales and provide reasonable parameter support for large-scale crop growth modelling.
A new approach for the rapid and lossless discrimination of varieties of yogurt by Visible/NIR-spectroscopy was put forward. Through the principal component analysis of spectroscopy curves of 5 ...typical kinds of yogurt, the clustering of yogurt varieties was processed. The results end to be that the cumulate reliabilities of the first two principle components (PC1, PC2) were more than 98.9%, and the first seven principle component (PC1 to PC7) were 99.97%. In addition, an artificial neural network (BP-ANN) model was set up. The first seven principles components of the samples were applied as BP-ANN inputs and the values of the type of yogurt were applied as outputs, which build the three-layer BP-ANN. With this model, the discrimination of yogurt came to be possible. The results of distinguishing the rate of the five yogurt varieties came to be satisfied. It presented that this model was reliable and practicable.
In September 2018, China’s air quality monitoring protocol was amended from the standard conditions to actual conditions for particulate matter and to reference conditions for gaseous pollutants. Due ...to the amendment, the reported concentrations of the gaseous pollutants decreased by a constant rate of 8.4%, and the averages of PM2.5 (particulate matter that has an aerodynamic diameter of 2.5 microns or smaller) reported during September 2017 and August 2018 decreased by 7.9 ± 6.1% at 99% of the monitoring stations. Comparing the periods before and after the amendment, the 12-month PM2.5 concentrations at 17.2% of the stations actually increased despite appearing to decrease if the amendments were not considered. We reviewed 370 papers published in 2020 that utilized this air quality dataset, and 21% of these papers used the data before and after the amendment without explicitly stating whether or how conversions were conducted. It is urgent to widely broadcast the protocol amendment to ensure proper use of this extensively cited dataset.
A high degree of uncertainty associated with the emission inventory for China tends to degrade the performance of chemical transport models in predicting PM2.5 concentrations especially on a daily ...basis. In this study a novel machine learning algorithm, Geographically-Weighted Gradient Boosting Machine (GW-GBM), was developed by improving GBM through building spatial smoothing kernels to weigh the loss function. This modification addressed the spatial nonstationarity of the relationships between PM2.5 concentrations and predictor variables such as aerosol optical depth (AOD) and meteorological conditions. GW-GBM also overcame the estimation bias of PM2.5 concentrations due to missing AOD retrievals, and thus potentially improved subsequent exposure analyses. GW-GBM showed good performance in predicting daily PM2.5 concentrations (R2 = 0.76, RMSE = 23.0 μg/m3) even with partially missing AOD data, which was better than the original GBM model (R2 = 0.71, RMSE = 25.3 μg/m3). On the basis of the continuous spatiotemporal prediction of PM2.5 concentrations, it was predicted that 95% of the population lived in areas where the estimated annual mean PM2.5 concentration was higher than 35 μg/m3, and 45% of the population was exposed to PM2.5 >75 μg/m3 for over 100 days in 2014. GW-GBM accurately predicted continuous daily PM2.5 concentrations in China for assessing acute human health effects.
Display omitted
•A novel machine learning model for predicting daily PM2.5 concentrations in China.•This model shows superior predictive performance and is able to handle missing data.•>90% of the population lived in areas with annual mean PM2.5 > 35 μg/m3•>40% of the population was exposed to PM2.5 >75 μg/m3 for over 100 days in a year.
Satellite-retrieved aerosol optical depth (AOD) is commonly used to estimate ambient levels of fine particulate matter (PM2.5), though it is important to mitigate the estimation bias of PM2.5 due to ...gaps in satellite-retrieved AOD. A nonparametric approach with two random-forest submodels is proposed to estimate PM2.5 levels by filling gaps in satellite-retrieved AOD. This novel approach was employed to estimate the spatiotemporal distribution of daily PM2.5 levels during 2013–2015 in the Sichuan Basin of Southwest China, where the coverage rate of composite AOD retrieved by the Terra and Aqua satellites was only 11.7%. Based on the retrieved AOD and various covariates (including meteorological conditions and land use types), the first random-forest submodel (named AOD-submodel) was trained to fill the gaps in the AOD dataset, giving a cross-validation R2 of 0.95. Subsequently, the second random-forest submodel (named PM2.5-submodel) was trained to estimate the PM2.5 levels for unmonitored areas/days based on the gap-filled AOD, ground-monitored PM2.5 levels, and the covariates, and achieved a cross-validation R2 of 0.86. By comparing the complete and incomplete (i.e., without the days when AOD data were missing) estimates, we found that the monthly PM2.5 levels could be overestimated by 34.6% if the PM2.5 values coincident with AOD gaps were not considered. The newly developed approach is valuable for deriving the complete spatiotemporal distribution of daily PM2.5 from incomplete remote-sensing data, which is essential for air quality management and human exposure assessment.
Display omitted
•Low AOD coverage in regions like Sichuan Basin (11.7%) caused PM2.5 estimation bias.•Novel approach employed two random forests to fill AOD gap and predict PM2.5•High cross-validation R2 achieved for AOD- (0.95) and PM2.5-submodels (0.86).•Monthly PM2.5 in Sichuan Basin could be overestimated by 34.6% due to missing AOD.•Annual average PM2.5 in Sichuan decreased from 67.7 to 50.1 μg/m3 during 2013–2015.
A two-stage and random-forest-based approach is proposed to fill gaps in satellite-retrieved AOD for estimating full-coverage ambient PM2.5 concentrations.