Recent academic and industry reports confirm that web robots dominate the traffic seen by web servers across the Internet. Because web robots crawl in an unregulated fashion, they may threaten the ...privacy, function, performance, and security of web servers. There is therefore a growing need to be able to identify robot visitors automatically, in offline and in real time, to assess their impact and to potentially protect web servers from abusive bots. Yet contemporary detection approaches, which rely on syntactic log analysis, finding statistical variations between robot and human traffic, analytical learning techniques, or complex software modifications may not be realistic to implement or remain effective as the behavior of robots evolve over time. Instead, this paper presents a novel detection approach that relies on the differences in the resource request patterns of web robots and humans. It rationalizes why differences in resource request patterns are expected to remain intrinsic to robots and humans despite the continuous evolution of their traffic. The performance of the approach, adoptable for both offline and real time settings with a simple implementation, is demonstrated by playing back streams of actual web traffic with varying session lengths and proportions of robot requests.
Nowadays, internet public participation has become a powerful tool to supervise the pollution activities of the enterprises. However, fewer studies have focused on the effect of internet public ...participation on environmental protection. Thus, this paper use network platform data (Sina Weibo and Baidu) from 2013 to 2018 to test whether internet public participation can help control environmental pollution emissions. Firstly, this paper selects the number of micro-blogs with the Chinese keywords "environmental protection", "haze", "water pollution", "air pollution" and their Baidu index to describe the level of internet public participation. Secondly, the effect of the internet public participation on four environmental pollutants including industrial sulfur dioxide, industrial soot, industrial wastewater and industrial solid waste are explored using the mediating effect model. In addition, we have examined the government’s intermediary effect on environmental pollutants emission. The study found that the internet public participation can significantly (p≤0.05) reduce the discharge of the industrial waste water. In addition, the mediating effect of government is significant (p≤0.05) on pollutant emissions. From the view of regional difference, internet public participation has significantly reduced the discharge of the industrial sulfur dioxide (p≤0.1), industrial waste water (p≤0.05) and industrial solid waste (p≤0.05) in the east.
•The effect of internet public participation on environmental protection is explored.•Network platform data (Sina Weibo and Baidu) is explored.•The government’s intermediary effect on environmental pollutants is explored.
General Election is the most prominent political activity in the election of people representatives in democratic countries. In the Information era, the veracity and volume of available information ...opportunities to inform the decision-making of election candidates. This research outlines the development of a Business Analytics Electoral Recommender System. The system aims to provide recommendations for political candidates based on various factors such as demographics, political preferences, and socio-economic status. We also propose an election candidate's recommender system using a hybrid of collaborative filtering (CF) and a knowledge-based (KB) model. The KB approach used a web crawler to scrape and process relevant information on various internet sources and social media to nominate the representative candidates based on the voter's preferences. We used a hybrid approach of The Cross Industry Standard Process for Data Mining (CRISP-DM) to the web scraping process of unstructured and semi-structured static and dynamic information of representative candidates. The recommendation uses the CF approach based on criteria for informed electoral decision-making. The results from user acceptance test showed the score are 86% for ease of use and 78.5% for the perceived usefulness of the proposed system's key features. The average percentage of behavioral intentions, and attitudes towards the actual use and utilization of the system, was 88.5%, which concludes that the respondents were generally satisfied with the proposed system.
Abstract
With the arrival of the era of big data, people have gradually realized the importance of data. Data is not just a resource, it is an asset. This paper mainly studies the realization of Web ...data mining technology based on Python. This paper analyzes the overall architecture design of distributed web crawler system, and then analyzes in detail the principles of crawler’s URL function module, crawler’s web crawl function module, crawler’s web page parsing function module, crawler’s data storage function module and so on. Each function module of the crawler system was tested on the experimental computer, and the data information was summarized for comparative analysis. The main significance of this paper lies in the design and implementation of a distributed web crawler system, which, to a certain extent, solves the problems of slow speed, low efficiency and poor scalability of traditional single computer web crawler, and improves the speed and efficiency of web crawler in grasping information and web page data.
Recycled water holds great promise for relieving pressure on water resources, but its utilization rate is low due to low public willingness to accept it in China. According to technology acceptance ...theory, the public’s attention and sentiment orientation have a strong influence on acceptance-behavior toward something new. Therefore, in this study, text-mining analysis is used to analyze social media text data to explore the current status, temporal and spatial trends of public attention, sentiment orientation, and focus on recycled water in China. The behavior over the past six years of Weibo users in 34 provincial-level administrative regions was analyzed, and the results show the following. Although public attention has increased significantly as a result of government actions, it is still at a low level. Residents pay more attention to recycled water in water-scarce and high-economic-development areas than others. In addition, the majority of the Chinese public holds positive feelings about recycled water. However, there is widespread disgust at the thought of recycled wastewater, and the public’s concerns about its safety cannot be ignored.
•A nationally representative survey of attitudes towards recycled water in China using microblog data.•The results show the status of concern for recycled water and the temporal and spatial changes.•Government actions can impact public attention to recycled water.•The public perception of recycled water in China is relatively positive.
The discernible alterations in regional precipitation patterns, influenced by the intersecting factors of urbanization and climate change, exert a substantial impact on urban flood disasters. Based ...on multi-source precipitation data, a data-driven model fusion framework was constructed to analyze the spatial and temporal dynamic distribution characteristics of precipitation in Beijing. Wavelet analysis method was used to reveal the periodic variation characteristics and multi-scale effects of precipitation, and the machine learning method was used to characterize the spatiotemporal dynamic change pattern of precipitation. Finally, geographical detector was used to explore the causes of waterlogging in Beijing. The research outcomes reveal a disparate distribution of precipitation across the year, with 78 % of the total precipitation occurring during the flood season. The principal periodic cycles observed in annual cumulative precipitation (ACP) were identified at 21, 13, and 9-year intervals. Spatially, while a decreasing trend in precipitation was observed in most areas of Beijing, 63.4 % of the region exhibited an escalating concentration trend, thereby heightening the risk of urban waterlogging. Machine learning model clustering elucidated three predominant spatial dynamic distribution patterns of precipitation in Beijing. The utilization of web crawler technology to acquire water accumulation data addressed challenges in obtaining urban waterlogging data, and validation through Landsat8 images enhanced data reliability and authenticity. Factor detection shows that road network density, topography, and precipitation were the main factors affecting urban waterlogging. These findings hold significant implications for informing flood control strategies and emergency management protocols in urban areas across China.
Display omitted
•Machine learning models are used to identify patterns of precipitation spatiotemporal dynamics.•Web crawler technology combined with social media is used to obtain urban waterlogging datasets.•MSWEP overestimates Beijing's historical precipitation and long-term change trends.•GeoDetector is used to reveal the causes of waterlogging in Beijing.
Off-site construction (OSC) has become the direction of China's construction transformation owing to the disappearance of demographic dividend and demand of environmental sustainability. Although the ...Chinese government has made efforts to promote OSC, its development progress remains slow. According to technology acceptance theory, the positive attitude and sufficient understanding of the public are of great significance to the advancement of new technology transformations. No relevant study has investigated people's awareness of OSC, and social media platforms provide big data for social science research. Accordingly, this study adopts web crawler technology combined with two methods of text mining, which are topic modeling and sentiment analysis, to explore public attitude toward OSC based on data collected from Sina Weibo. The research findings of this study are as follows: The attention status of the public toward OSC was greatly fluctuated but generally showed an upward tendency. The release of relevant government policies also significantly influenced the attention status of the public. What the public most concern about OSC was the building attributes and performance, such as connection issues, earthquake resistance, and waterproof property. Moreover, the Chinese public had relatively positive sentiments on OSC, and most of them expressed their interest and curiosity about OSC. The leading causes of public negative sentiments were safety issues, technical level, high prices, simplistic design and unemployment, indicating that popularization of the updated information on building performance, supportive policy and long-term significance of OSC was far from enough.
Display omitted
•Web crawler technology combined with two text mining methods were employed.•Government actions affect the public attentions on off-site construction (OSC).•The Chinese public has relatively positive sentiments on OSC.•Lack of popularization of the updated OSC information is a serious drawback.
Display omitted
•Adapting the crawled updated CXR COVID-19 images datasets using web crawler-based cloud environment.•Crawling the updated CXR COVID-19 images datasets from different websites ...simultaneously.•Designing a novel Gray-Scale Spatial Exploitation Net (GSEN) to detect infected COVID-19 cases easily.•Optimizing the hyperparameters of GSEN by using Stochastic Gradient Descent (SGD) Optimizer.
Today, the earth planet suffers from the decay of active pandemic COVID-19 which motivates scientists and researchers to detect and diagnose the infected people. Chest X-ray (CXR) image is a common utility tool for detection. Even the CXR suffers from low informative details about COVID-19 patches; the computer vision helps to overcome it through grayscale spatial exploitation analysis. In turn, it is highly recommended to acquire more CXR images to increase the capacity and ability to learn for mining the grayscale spatial exploitation. In this paper, an efficient Gray-scale Spatial Exploitation Net (GSEN) is designed by employing web pages crawling across cloud computing environments. The motivation of this work are i) utilizing a framework methodology for constructing consistent dataset by web crawling to update the dataset continuously per crawling iteration; ii) designing lightweight, fast learning, comparable accuracy, and fine-tuned parameters gray-scale spatial exploitation deep neural net; iii) comprehensive evaluation of the designed gray-scale spatial exploitation net for different collected dataset(s) based on web COVID-19 crawling verse the transfer learning of the pre-trained nets. Different experiments have been performed for benchmarking both the proposed web crawling framework methodology and the designed gray-scale spatial exploitation net. Due to the accuracy metric, the proposed net achieves 95.60% for two-class labels, and 92.67% for three-class labels, respectively compared with the most recent transfer learning Google-Net, VGG-19, Res-Net 50, and Alex-Net approaches. Furthermore, web crawling utilizes the accuracy rates improvement in a positive relationship to the cardinality of crawled CXR dataset.