In Data Mining (DM) projects, more specifically in the Data Understanding and the Data Preparation phases, several techniques found in the literature are used to detect and handle data quality ...problems such as missing data, outliers, inconsistent data or time-variant data. However, the main limitation in the application of these techniques is the complexity caused by a lack of anticipation in the detection and resolution of data quality problems. Then, a DM process model designed for the prior management of data quality was recently proposed. It has the distinctive feature of having linked the DM process and the Software Engineering (SE) one by combining them in parallel. However, authors of that work 1 have just specified what should be done, not how it should be. The present research work is an improvement of that DM process model. It adds to it a methodology that indicates in a concrete way a guideline on how to combine the SE process and the DM one to anticipate and manage data quality problems that can be found during the mining process. This work will specifically address the case of temporal data. The main contribution of this methodology is the definition, in concrete terms, of how to anticipate and automate all activities necessary to remove temporal data quality problems in a mining process.
Innovation in the public-sector refers to the development of important improvements in the public administration and their corresponding services. One of such public services is the social security, ...of which central process has been the information security of their offered services. The aim of the present study has been the analysis of the trends and the discovery of behavioural patterns in the attacks to the data network of an institution of the public-sector. To fulfil such objective, a model has been implemented on algorithms and data mining techniques, based on the Cross Industry Standard Process for Data Mining methodology. The model uses a free and open source network Intrusion Detection and Prevention System (IDS/IPS) for the capture of the logs of the attacks to the data network of the organization. This has been followed by a quantitative assessment of various algorithms of intrusion detection leading to the selection of J48 and REPTree as Data Mining algorithms with a level of insolence in instances properly classified by the lowest absolute error. The data were processed and served as input for the construction of rules. The resulting rules of the decision tree have been based on the principle of calculating the information gain via entropy and minimizing the error that arises from the variance. These rules were the product of applying machine learning on the logs analysed and they were subsequently translated and reprogrammed to the IDS/IPS in order to assess the efficiency of the model. The results demonstrate a significant improvement of some 67% in detection of attacks in relation to the traditional IDS. Consequently, we extrapolated a wide difference in behaviour and trends with the use of a traditional system compared to that generated by Data Mining.
bh-This paper presents an analysis of data from the National Fund of Education Development (FNDE) in relation to the teachers in its database, relating them according to the demographic density of ...each region, the number of teachers from the Brazilian Capitals who have courses graduate, specialization, master's and doctoral degrees and the percentages of the Development index of basic education of the students (IDEB) of the respective regions. The goal is to analyze if the academic qualification of the teachers impact on the IDEB of the schools students of the regions with more representative indexes. In our verification we used the Apriori association algorithm to identify and analyze the data set to collect the results. The results demonstrate that the maximum academic qualification of male teachers is mostly of specialization, whereas female teachers in most of the regions have a specialization and master's degree course. Furthermore, it was possible to verify that the region with the highest IDEB is the South region with 6.27%, the region with the highest number of teachers with a master degree (4.2%) and a doctor (0.5%). The Central West region has an IDEB of 6.20% and the professors have the majority of the undergraduate course (56.7%) and specialization (31.9%). It is possible to affirm with the results obtained that the regions with the highest IDEB are the regions with the highest level of training by the teachers.
Competitive index is the diagnostic tool in assessing competitiveness of the city or municipality to determine problematic factors for improvement. Naïve Bayes algorithm is a useful method in ...predicting development competitive index basis for recommendation in the development plan. The predicted status can provide vision for development plan, business investment, policy making, and resiliency to calamities. Specifically, it addressed the following objectives: (1) indicated all the indicators in the competitiveness index for the different cities and municipalities; (2) computed the relative efficiency among cities and municipalities using Data Envelopment Analysis (DEA); (3) ranked the relative efficiency to determine competitiveness of a city or municipality in Region 1; and (4) utilized the Bayes theorem to determine the probability of competitiveness of all the city or municipality in Region 1 basis for recommendation system. The study classifies Local Government Units (LGU's) competitiveness index based on the four pillars from cities and municipalities competitive index survey tool. The four pillars focuses on government efficiency, economic dynamism, infrastructure and resiliency. In the process of development, it involves applied and developmental research designs, Cross-Industry Standard Process for Data Mining (CRISP-DM) Data Envelopment Analysis and Navïe Bayes algorithm. The CRISP-DM is the method used in preparing and processing of Data. While Data Envelopment Analysis was used to determine the relative efficiency, target, slack and slack percentage. Using Waikato Environment for Knowledge Analysis (WEKA) as a tool for prediction applying the Navïe Bayes algorithm, the accuracy result is 90.64 percent. Utilizing the Navïe Bayes algorithm determined the probability of competitiveness of all LGU's in Region 1 for recommendations in the development plan.
there is currently a significant amount of technology in hospitals in particular in the Intensive Care Units (ICU). The clinical data daily generated are integrated into Decision Support Systems ...(DSS), in real time for a better quality of patient care.The hospital environment has many outbreaks of infections, objects or environments in which microorganisms can survive or multiply, such as the facilities, invasive devices or equipment used, or even patients, health professionals and visitors. The existence of nosocomial infection prediction systems in healthcare environments can contribute to improving the quality of the healthcare institution. It also can reduce the costs of the treatment of the patients that acquire these infections. The analysis of the available information allows preventing these infections which can help to identify their future occurrence. This paper presents the results of applying models to real clinical data. Good models were obtained, induced by the Data Mining (DM), K-Means and K-Medoids Clustering techniques (Davies-Bouldin Index 0.14). These models, classification models, should act in a DSS capable of helping to reduce this type of infections as well as reduce the costs associated with them.
With the growing interest in research in poker, by scientists belonging to the area of Artificial Intelligence, has arisen the need to overcome the imperfect information of the same, as a stochastic ...game, by the challenges that this enables. In this work we intend to create data models that allow us to know the plays of a real player, supporting the decision to play in the pre-flop phase, within the Texas Hold'em side. To accomplish this, several Data Mining techniques were applied for classification, using RapidMiner software. The development of this work was based on data from online games of professional poker players, who had important information of the factors inherent to a real game.
The data produced from educational activities could be exploited in order to extract useful knowledge, assist educational decision makers in making better decisions and help students achieve better ...results. In this study, we report our findings about the application of a data mining technique following the CRISP-DM model at the department of Computer Science at the University of Jijel, Algeria. Our proposed system is able to classify undergraduate and post-graduate students according to their results and to predict their performance for the coming years based on their current results and on history data. The system can also be used as an early-warning tool for students at risk and to help graduates in choosing the appropriate Master's disciplines to pursue their studies.
A survey on various challenges and aspects in handling big data Pradeep, S.; Kallimani, Jagadish S.
2017 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT),
2017-Dec.
Conference Proceeding
The data which are very larger in size and which cannot be maintained in terms of Mega Byte (MB) and Giga Byte (GB) are termed to be as Big Data. The big data usually sizes in Peta Byte (10 Λ 15). ...Research says that the data which is referred as big data in today's life is the data that has been collected from the past three years. The resources for the big data are social networking sites which collects vast data from face book, twitter, linked-In where billions of users post the data in a daily basis. Share markets also contribute to the big data with the process of stock exchanging and also by collecting data through the share transactions. E-commerce sites collect the data which can be useful for the service providers to introduce the goods and items that satisfies the user requirements. The other resources for the big data are weather station, telecom companies and etc. The three important features of big data are Velocity, Volume and Veracity. The size of the big data is increasing in a very rapid way so that the data doubles for every two years. Big data is sometimes structured data and sometimes unstructured data. The data that can be maintained in a table structure or with rows and column, then those data are called as structured data. The data from CCTV footage are the examples for the unstructured data. In this paper, we attempted to discuss the growth, life cycle stages, handling of huge data on Hadoop framework, publication frequencies, advantages and challenges in handling big data.
The massive amount of current data has led to many different forms of data analysis processes that aim to explore this data to uncover valuable insights. Methodologies to guide the development of big ...data science projects, including CRISP-DM and SEMMA, have been widely used in industry and academia. The data analysis modeling phase, which involves decisions on the most appropriate models to adopt, is at the core of these projects. However, from a software engineering perspective, the design and automation of activities performed in this phase are challenging. In this paper, we propose an approach to the data analysis modeling process which involves (i) the assessment of the variability inherent in the CRISP-DM data analysis modeling phase and the provision of feature models that represent this variability; (ii) the definition of a framework structural design that captures the identified variability; and (iii) evaluation of the developed framework design in terms of the possibilities for process automation. The proposed approach advances the state of the art by offering a variability-aware design solution that can enhance system flexibility, potentially leading to novel software frameworks which can significantly improve the level of automation in data analysis modeling process.