The aim of this study is to compare sampling methods using machine and deep learning algorithms with a small and imbalanced data set for the prevention of violence against physicians. In this data ...set, it is determined whether there is violence against physicians by using various demographic information of physicians. In addition, in this study, it is tried find effective solutions to improve the working conditions of physicians in order to reduce violence against physicians. As a solution to the imbalanced data problem, Synthetic Minority Oversampling (SMOTE), Random Oversampling (ROS) and Random Undersampling (RUS) methods were used to balance the data in this study. Then, Random Forest Classifier (RFC), Extra Tree Classifier (ETC) and Multi-Layer Perceptron (MLP) algorithms were applied. Among all sampling techniques and classification algorithms, the ETC algorithm applied with the ROS method shows the best performance with 82% accuracy and 0.81 F1-Score.
Automation of Big Data Analysis (BDA) procedure gives us a great profit in the era of Big Data and Artificial Intelligence. BDA procedure can be efficiently automated by the automatic service ...composition concept efficiently. Our previous work for Auto-BDA shows a great future prospect in reducing turnaround time for data analysis. Moreover, it requires consideration of the automation with a well-geared combination of the data preparation and the optimal model (deep learning) generation. This paper shows the construction of automating BDA and model generation (here deep learning) together with data preparation and parameter optimization.
The transition from traditional to digital of financial industry in Indonesia is supported by the role of peer-to-peer lending platform. Technology disruption on the peer-to-peer lending platform ...goes hand in hand with the high risks that must be faced by the users. In this study, text classification is conducted to find the risks that perceived by the users. Text classification process follows the CRISP-DM data mining process and uses SVM which generates better accuracy compared to the other algorithms. Twitter data of peer-to-peer lending platforms is used in this study and classified based on the perceived risk into six classes, namely performance, financial, time, psychological, social, and privacy. The result of text classification using SVM algorithm generate 81.51% accuracy. Classification results indicate that performance risk is the most felt risk by users and must be considered by peer-to-peer Lending platforms.
Data mining is utilized to explore banks' data to unravel any hidden scams and detect potential frauds. The aim of this paper is to compare between the Naïve Bayes, Decision Tree and Logistic ...Regression in fraudulent credit card transactions. Cross-Industry Standard Process for Data Mining (CRISP-DM) is followed to achieve the aim of this research. In terms of accuracy, the best classification model was Logistic Regression with 94.6% accuracy, compared with the Decision Tree and Naïve Bayes that showed accuracy of 89.1% and 90.9% respectively. Other measures were also calculated like time needed to build the model among others.
In this era, most people use social media to communicate and keep up with any kind of news. Politic is one of the sectors that use social media to spread the news. By utilizing information from ...public opinion and other politicians, will add another important information in order to increase the possibility of winning the elections. This paper uses social network analysis to measure public conversational activity through social media network. We select "Pemilu Presiden 2019" as the case study since this is the biggest politic event in Indonesia and this event also has a massive impact not only for Indonesia but also other countries in the world. In addition to that, this paper uses conversational activity in social media platform which is Twitter and collected 4.564 tweets between March 29 until April 8, 2018. We analyze two political parties such as, "Gerindra" and "PDI Perjuangan" as the parties from the presidential candidates. We also found a number of influencers in the network that indicate "PDI Perjuangan" has more a good sentiment than "Gerindra". We collect and summarize all the influencer's conversations related to each party and found a high correlation between the influencers, parties, and presidential candidates. It shows the bigger influencers are able to provide more risks for the presidential candidates image, also the parties.
Insurance companies need to define actions to improve the persistence indicators of customers in subscription services. We want to fully understand the behavior of cancellations in mass sales ...channels with the retention tasks performed inside the company. As a proposal to increase the retention of customers calling to cancel, a logistic regression model and a decision tree model were used on real data through the CRISP-DM methodology. After comparing the models, the logistic regression model gives better results since its accuracy of prediction is 87.21%; this allows us to propose strategies to increase the customer retention rate.
This paper introduces a reusable framework for analytics governance accelerating the use of data and analytics across organizations. The Analytics Governance Framework (AGF) is a set of guiding ...principles for streamlining the work of managers, analytics practitioners, and data management practitioners. The guiding principles of analytics work are designed to have check points where both managers and analytics practitioners work toward the same outcome. These principles facilitate communication and coordination on the project unique data requirements between practitioners of data management and analytics. This framework enables client organizations to optimize the number of analytics projects they are undertaking to achieve business transformation and to shorten the time of an analytics project to achieve the business impact.
Process models are an important tool for software engineers to produce reliable software within schedule and budget. Especially technically challenging domains like machine learning need a supportive ...process model to guide the developers and stakeholders during the development process. One major problem type of machine learning is anomaly detection. Its goal is to identify anomalous data points (outlier) between the normal data instances. Anom- aly detection has a wide scope of applications in industrial and scienti c areas. Detecting intruders in computer networks, distin- guishing between cancerous and healthy tissue in medical images, cleaning data from disturbing outliers for further evaluation and many more. The cross-industry standard process for data mining (CRISP-DM) has been developed to support developers with all kinds of data mining applications. It describes a generic model of six phases that covers the whole development cycle. The generality of the CRISP-DM model is as much a strength as it is a weakness, since the particularities of di erent problem types like anomaly detection can not be addressed without making the model overly complex. There is a need for a more practical, specialised process model for anomaly detection applications. We demonstrate this issue and outline an approach towards a practical process model tailored to the development of anomaly detection systems.
Travelling salesman problem (TSP) is an NP-hard optimization problem. So it is necessary to use intelligent and heuristic methods to solve such a hard problem in a less computational time. In this ...paper, a novel data mining-based approach is presented. The purpose of the proposed approach is to extract a number of rules from optimum tours of small TSPs. The obtained rules can be used for solving larger TSPs. Our proposed approach is mentioned in a standard data mining framework, called CRISP-DM. For rule extracting, generalized rule induction (GRI) as a powerful association rule mining algorithm is used. The results of this approach are stated as if-then rules. This approach is performed on two standard examples of TSPs. The obtained rules from these examples are compared, and it is shown that the rules form two examples have much similarity. This issue shows that it is possible to use from extracted rules to solve larger TSPs.