Data mining offers strong techniques for different sectors involving education. In the education field the research is developing rapidly increasing due to huge number of student’s information which ...can be used to invent valuable pattern pertaining learning behavior of students. The institutions of education can utilize educational data mining to examine the performance of students which can support the institution in recognizing the student’s performance. In data mining classification is a familiar technique that has been implemented widely to find the performance of students. In this study a new prediction algorithm for evaluating student’s performance in academia has been developed based on both classification and clustering techniques and been ested on a real time basis with student dataset of various academic disciplines of higher educational institutions in Kerala, India. The result proves that the hybrid algorithm combining clustering and classification approaches yields results that are far superior in terms of achieving accuracy in prediction of academic performance of the students.
This paper describes an attempt to use data obtained from sodar (sound detection and ranging) for short-term forecasting of PM.sub.10 concentration levels in Krakow. Krakow is one of the most ...polluted cities in central Europe (CE) in terms of PM.sub.10 concentration. This is due to the high municipal emissions. Thanks to intensive corrective actions taken by the city authorities, these are being effectively eliminated, but the unfavourable topographic location of the city limits natural ventilation. The article describes all these conditions, focusing on presenting the method of short-term correction of air quality for the time needed to take quick corrective actions by the city authorities in the event of anticipated exceedances of the permissible values. Based on several years of measurements of the physical properties of the atmosphere with sodar, the authors of the paper suggest that sodar data could be considered for operational use to generate short-term predictions.
Data measuring airborne pollutants, public health and environmental factors are increasingly being stored and merged. These big datasets offer great potential, but also challenge traditional ...epidemiological methods. This has motivated the exploration of alternative methods to make predictions, find patterns and extract information. To this end, data mining and machine learning algorithms are increasingly being applied to air pollution epidemiology.
We conducted a systematic literature review on the application of data mining and machine learning methods in air pollution epidemiology. We carried out our search process in PubMed, the MEDLINE database and Google Scholar. Research articles applying data mining and machine learning methods to air pollution epidemiology were queried and reviewed.
Our search queries resulted in 400 research articles. Our fine-grained analysis employed our inclusion/exclusion criteria to reduce the results to 47 articles, which we separate into three primary areas of interest: 1) source apportionment; 2) forecasting/prediction of air pollution/quality or exposure; and 3) generating hypotheses. Early applications had a preference for artificial neural networks. In more recent work, decision trees, support vector machines, k-means clustering and the APRIORI algorithm have been widely applied. Our survey shows that the majority of the research has been conducted in Europe, China and the USA, and that data mining is becoming an increasingly common tool in environmental health. For potential new directions, we have identified that deep learning and geo-spacial pattern mining are two burgeoning areas of data mining that have good potential for future applications in air pollution epidemiology.
We carried out a systematic review identifying the current trends, challenges and new directions to explore in the application of data mining methods to air pollution epidemiology. This work shows that data mining is increasingly being applied in air pollution epidemiology. The potential to support air pollution epidemiology continues to grow with advancements in data mining related to temporal and geo-spacial mining, and deep learning. This is further supported by new sensors and storage mediums that enable larger, better quality data. This suggests that many more fruitful applications can be expected in the future.
Society is increasingly relying on data-driven predictive models for automated decision making. This is not by design, but due to the nature and noisiness of observational data, such models may ...systematically disadvantage people belonging to certain categories or groups, instead of relying solely on individual merits. This may happen even if the computing process is fair and well-intentioned. Discrimination-aware data mining studies of how to make predictive models free from discrimination, when the historical data, on which they are built, may be biased, incomplete, or even contain past discriminatory decisions. Discrimination-aware data mining is an emerging research discipline, and there is no firm consensus yet of how to measure the performance of algorithms. The goal of this survey is to review various discrimination measures that have been used, analytically and computationally analyze their performance, and highlight implications of using one or another measure. We also describe measures from other disciplines, which have not been used for measuring discrimination, but potentially could be suitable for this purpose. This survey is primarily intended for researchers in data mining and machine learning as a step towards producing a unifying view of performance criteria when developing new algorithms for non-discriminatory predictive modeling. In addition, practitioners and policy makers could use this study when diagnosing potential discrimination by predictive models.
Abstract Purpose One of the major hurdles in enabling personalized medicine is obtaining sufficient patient data to feed into predictive models. Combining data originating from multiple hospitals is ...difficult because of ethical, legal, political, and administrative barriers associated with data sharing. In order to avoid these issues, a distributed learning approach can be used. Distributed learning is defined as learning from data without the data leaving the hospital. Patients and methods Clinical data from 287 lung cancer patients, treated with curative intent with chemoradiation (CRT) or radiotherapy (RT) alone were collected from and stored in 5 different medical institutes (123 patients at MAASTRO (Netherlands, Dutch), 24 at Jessa (Belgium, Dutch), 34 at Liege (Belgium, Dutch and French), 48 at Aachen (Germany, German) and 58 at Eindhoven (Netherlands, Dutch)). A Bayesian network model is adapted for distributed learning (watch the animation: http://youtu.be/nQpqMIuHyOk ). The model predicts dyspnea, which is a common side effect after radiotherapy treatment of lung cancer. Results We show that it is possible to use the distributed learning approach to train a Bayesian network model on patient data originating from multiple hospitals without these data leaving the individual hospital. The AUC of the model is 0.61 (95%CI, 0.51–0.70) on a 5-fold cross-validation and ranges from 0.59 to 0.71 on external validation sets. Conclusion Distributed learning can allow the learning of predictive models on data originating from multiple hospitals while avoiding many of the data sharing barriers. Furthermore, the distributed learning approach can be used to extract and employ knowledge from routine patient data from multiple hospitals while being compliant to the various national and European privacy laws.
The Adverse Outcome Pathway (AOP) framework describes the progression of a toxicity pathway from molecular perturbation to population-level outcome in a series of measurable, mechanistic responses. ...The controlled, computer-readable vocabulary that defines an AOP has the ability to, automatically and on a large scale, integrate AOP knowledge with publically available sources of biological high-throughput data and its derived associations. To support the discovery and development of putative (existing) and potential AOPs, we introduce the AOP-DB, an exploratory database resource that aggregates association relationships between genes and their related chemicals, diseases, pathways, species orthology information, ontologies, and gene interactions. These associations are mined from publically available annotation databases and are integrated with the AOP information centralized in the AOP-Wiki, allowing for the automatic characterization of both putative and potential AOPs in the context of multiple areas of biological information, referred to here as “biological entities”. The AOP-DB acts as a hypothesis-generation tool for the expansion of putative AOPs, as well as the characterization of potential AOPs, through the creation of association networks across these biological entities. Finally, the AOP-DB provides a useful interface between the AOP framework and existing chemical screening and prioritization efforts by the US Environmental Protection Agency.
•The AOP-DB stores AOP-gene targets, chemical, disease, pathway, species information (85).•Associations are sourced from public annotation to provide a context of biology (82).•AOP-DB enables researchers to predict chemical stressors and toxicological outcomes (85).•AOP-DB is an effective data integration tool for characterization of AOPs (75).
Due to its continuously increasing occurrence, more and more families are influenced by diabetes mellitus. Most diabetics know little about their health quality or the risk factors they face prior to ...diagnosis. In this study, we have proposed a novel model based on data mining techniques for predicting type 2 diabetes mellitus (T2DM). The main problems that we are trying to solve are to improve the accuracy of the prediction model, and to make the model adaptive to more than one dataset. Based on a series of preprocessing procedures, the model is comprised of two parts, the improved K-means algorithm and the logistic regression algorithm. The Pima Indians Diabetes Dataset and the Waikato Environment for Knowledge Analysis toolkit were utilized to compare our results with the results from other researchers. The conclusion shows that the model attained a 3.04% higher accuracy of prediction than those of other researchers. Moreover, our model ensures that the dataset quality is sufficient. To further evaluate the performance of our model, we applied it to two other diabetes datasets. Both experiments' results show good performance. As a result, the model is shown to be useful for the realistic health management of diabetes.
Sentiment Analysis of Short Informal Texts Kiritchenko, S.; Zhu, X.; Mohammad, S. M.
The Journal of artificial intelligence research,
08/2014, Letnik:
50
Journal Article
Recenzirano
Odprti dostop
We describe a state-of-the-art sentiment analysis system that detects (a) the sentiment of short informal textual messages such as tweets and SMS (message-level task) and (b) the sentiment of a word ...or a phrase within a message (term-level task). The system is based on a supervised statistical text classification approach leveraging a variety of surface-form, semantic, and sentiment features. The sentiment features are primarily derived from novel high-coverage tweet-specific sentiment lexicons. These lexicons are automatically generated from tweets with sentiment-word hashtags and from tweets with emoticons. To adequately capture the sentiment of words in negated contexts, a separate sentiment lexicon is generated for negated words.
The system ranked first in the SemEval-2013 shared task `Sentiment Analysis in Twitter' (Task 2), obtaining an F-score of 69.02 in the message-level task and 88.93 in the term-level task. Post-competition improvements boost the performance to an F-score of 70.45 (message-level task) and 89.50 (term-level task). The system also obtains state-of-the-art performance on two additional datasets: the SemEval-2013 SMS test set and a corpus of movie review excerpts. The ablation experiments demonstrate that the use of the automatically generated lexicons results in performance gains of up to 6.5 absolute percentage points.