This book constitutes the refereed proceedings of the International ECML/PKDD Workshop on Privacy and Security Issues in Data Mining and Machine Learning, PSDML 2010, held in Barcelona, Spain, in ...September 2010.
The 11 revised full papers presented were carefully reviewed and selected from 21 submissions. The papers range from data privacy to security applications, focusing on detecting malicious behavior in
computer systems.
Traditional Machine Learning (ML) pipeline development requires the ML practitioner to directly access the data to analyze, clean and preprocess it, in order to develop an ML model, train it and ...evaluate its performance. When the data owner has no infrastructure for in-house development, such pipelines are outsourced. It is common that data has some level of privacy constraints that will impose a laborious and maybe expensive infrastructure, including among others contracts drafting and infrastructure improvement. Traditional approaches rely either on anonymization which does not entirely protect from identity disclosure, or on synthetic data generation which requires expertise not necessarily available to the organization. In this paper, we present Data-Blind ML, an automated framework, fueled by synthetic generative learning and distributed computing paradigms, which enables an organization to outsource the development and training of ML models without sharing any sample from the real dataset. In addition, the framework allows the ML practitioner to get feedback of the model's performance against the actual real data without accessing it directly.
The present spreading out of big data found the realization of AI and machine learning. With the rise of big data and machine learning, the idea of improving accuracy and enhancing the efficacy of AI ...applications is also gaining prominence. Machine learning solutions provide improved guard safety in hazardous traffic circumstances in the context of traffic applications. The existing architectures have various challenges, where data privacy is the foremost challenge for vulnerable road users (VRUs). The key reason for failure in traffic control for pedestrians is flawed in the privacy handling of the users. The user data are at risk and are prone to several privacy and security gaps. If an invader succeeds to infiltrate the setup, exposed data can be malevolently influenced, contrived, and misrepresented for illegitimate drives. In this study, an architecture is proposed based on machine learning to analyze and process big data efficiently in a secure environment. The proposed model considers the privacy of users during big data processing. The proposed architecture is a layered framework with a parallel and distributed module using machine learning on big data to achieve secure big data analytics. The proposed architecture designs a distinct unit for privacy management using a machine learning classifier. A stream processing unit is also integrated with the architecture to process the information. The proposed system is apprehended using real-time datasets from various sources and experimentally tested with reliable datasets that disclose the effectiveness of the proposed architecture. The data ingestion results are also highlighted along with training and validation results.
Due to technological development, personal data has become more available to collect, store and analyze. Companies can collect detailed browsing behavior data, health-related data from smartphones ...and smartwatches, voice and movement recordings from smart home devices. Analysis of such data can bring numerous advantages to society and further development of science and technology. However, given an often sensitive nature of the collected data, people have become increasingly concerned about the data they share and how they interact with new technology. These concerns have motivated companies and public institutions to provide services and products with privacy guarantees. Therefore, many institutions and research communities have adopted the notion of differential privacy to address privacy concerns which has emerged as a powerful technique for enabling data analysis while preventing information leakage about individuals. In simple words, differential privacy allows us to use and analyze sensitive data while maintaining privacy guarantees for every individual data point. As a result, numerous algorithmic private tools have been developed for various applications. However, multiple open questions and research areas remain to be explored around differential privacy in machine learning, statistics, and data analysis, which the existing literature has not covered. In Chapter 1, we provide a brief discussion of the problems and the main contributions that are presented in this thesis. Additionally, we briefly recap the notion of differential privacy with some useful results and algorithms. In Chapter 2, we study the problem of differentially private change-point detection for unknown distributions. The change-point detection problem seeks to identify distributional changes in streams of data. Non-private tools for change-point detection have been widely applied in several settings. However, in certain applications, such as identifying disease outbreaks based on hospital records or IoT devices detecting home activity, the collected data is highly sensitive, which motivates the study of privacy-preserving tools. Much of the prior work on change-point detection---including the only private algorithms for this problem---requires complete knowledge of the pre-change and post-change distributions. However, this assumption is not realistic for many practical applications of interest. In this chapter, we present differentially private algorithms for solving the change-point problem when the data distributions are unknown to the analyst. Additionally, we study the case when data may be sampled from distributions that change smoothly over time rather than fixed pre-change and post-change distributions. Furthermore, our algorithms can be applied to detect changes in linear trends of such data streams. Finally, we also provide a computational study to empirically validate the performance of our algorithms. In Chapter 3, we study the problem of learning from imbalanced datasets, in which the classes are not equally represented, through the lens of differential privacy. A widely used method to address imbalanced data is resampling from the minority class instances. However, when confidential or sensitive attributes are present, data replication can lead to privacy leakage, disproportionally affecting the minority class. This challenge motivates the study of privacy-preserving pre-processing techniques for imbalanced learning. In this work, we present a differentially private synthetic minority oversampling technique (DP-SMOTE) which is based on a widely used non-private oversampling method known as SMOTE. Our algorithm generates differentially private synthetic data from the minority class. We demonstrate the impact of our pre-processing technique on the performance and privacy leakage of various classification methods in a detailed computational study. In Chapter 4, we focus on the analysis of sensitive data that is generated from online internet activity. Accurately analyzing and modeling online browsing behavior play a key role in understanding users and technology interactions. Towards this goal, in this chapter, we present an up-to-date measurement study of online browsing behavior. We study both self-reported and observational browsing data and analyze what underlying features can be learned from statistical analysis of this potentially sensitive data. For this, we empirically address the following questions: (1) Do structural patterns of browsing differ across demographic groups and types of web use?, (2) Do people have correct perceptions of their behavior online?, and (3) Do people change their browsing behavior if they are aware of being observed? In response to these questions, we found little difference across most demographic groups and website categories, suggesting that these features cannot be implied solely from clickstream data. We find that users significantly overestimate the time they spend online but have relatively accurate perceptions of how they spend their time online. We find no significant changes in behavior throughout the study, which may indicate that observation had no effect on behavior or that users were consciously aware of being observed throughout the study.
Due to technological development, personal data has become more available to collect, store and analyze. Companies can collect detailed browsing behavior data, health-related data from smartphones ...and smartwatches, voice and movement recordings from smart home devices. Analysis of such data can bring numerous advantages to society and further development of science and technology. However, given an often sensitive nature of the collected data, people have become increasingly concerned about the data they share and how they interact with new technology.
These concerns have motivated companies and public institutions to provide services and products with privacy guarantees. Therefore, many institutions and research communities have adopted the notion of differential privacy to address privacy concerns which has emerged as a powerful technique for enabling data analysis while preventing information leakage about individuals. In simple words, differential privacy allows us to use and analyze sensitive data while maintaining privacy guarantees for every individual data point. As a result, numerous algorithmic private tools have been developed for various applications. However, multiple open questions and research areas remain to be explored around differential privacy in machine learning, statistics, and data analysis, which the existing literature has not covered.
In Chapter 1, we provide a brief discussion of the problems and the main contributions that are presented in this thesis. Additionally, we briefly recap the notion of differential privacy with some useful results and algorithms.
In Chapter 2, we study the problem of differentially private change-point detection for unknown distributions. The change-point detection problem seeks to identify distributional changes in streams of data. Non-private tools for change-point detection have been widely applied in several settings. However, in certain applications, such as identifying disease outbreaks based on hospital records or IoT devices detecting home activity, the collected data is highly sensitive, which motivates the study of privacy-preserving tools. Much of the prior work on change-point detection---including the only private algorithms for this problem---requires complete knowledge of the pre-change and post-change distributions. However, this assumption is not realistic for many practical applications of interest. In this chapter, we present differentially private algorithms for solving the change-point problem when the data distributions are unknown to the analyst. Additionally, we study the case when data may be sampled from distributions that change smoothly over time rather than fixed pre-change and post-change distributions. Furthermore, our algorithms can be applied to detect changes in linear trends of such data streams. Finally, we also provide a computational study to empirically validate the performance of our algorithms.
In Chapter 3, we study the problem of learning from imbalanced datasets, in which the classes are not equally represented, through the lens of differential privacy. A widely used method to address imbalanced data is resampling from the minority class instances. However, when confidential or sensitive attributes are present, data replication can lead to privacy leakage, disproportionally affecting the minority class. This challenge motivates the study of privacy-preserving pre-processing techniques for imbalanced learning. In this work, we present a differentially private synthetic minority oversampling technique (DP-SMOTE) which is based on a widely used non-private oversampling method known as SMOTE. Our algorithm generates differentially private synthetic data from the minority class. We demonstrate the impact of our pre-processing technique on the performance and privacy leakage of various classification methods in a detailed computational study.
In Chapter 4, we focus on the analysis of sensitive data that is generated from online internet activity. Accurately analyzing and modeling online browsing behavior play a key role in understanding users and technology interactions. Towards this goal, in this chapter, we present an up-to-date measurement study of online browsing behavior. We study both self-reported and observational browsing data and analyze what underlying features can be learned from statistical analysis of this potentially sensitive data. For this, we empirically address the following questions: (1) Do structural patterns of browsing differ across demographic groups and types of web use?, (2) Do people have correct perceptions of their behavior online?, and (3) Do people change their browsing behavior if they are aware of being observed?
In response to these questions, we found little difference across most demographic groups and website categories, suggesting that these features cannot be implied solely from clickstream data. We find that users significantly overestimate the time they spend online but have relatively accurate perceptions of how they spend their time online. We find no significant changes in behavior throughout the study, which may indicate that observation had no effect on behavior or that users were consciously aware of being observed throughout the study.
With the rapid development of mobile medical care, medical institutions also have the hidden danger of privacy leakage while sharing personal medical data. Based on the k-anonymity and l-diversity ...supervised models, it is proposed to use the classified personalized entropy l-diversity privacy protection model to protect user privacy in a fine-grained manner. By distinguishing solid and weak sensitive attribute values, the constraints on sensitive attributes are improved, and the sensitive information is reduced for the leakage probability of vital information to achieve the safety of medical data sharing. This research offers a customized information entropy l-diversity model and performs experiments to tackle the issues that the information entropy l-diversity model does not discriminate between strong and weak sensitive features. Data analysis and experimental results show that this method can minimize execution time while improving data accuracy and service quality, which is more effective than existing solutions. The limits of solid and weak on sensitive qualities are enhanced, sensitive data are reduced, and the chance of crucial data leakage is lowered, all of which contribute to the security of healthcare data exchange. This research offers a customized information entropy l-diversity model and performs experiments to tackle the issues that the information entropy l-diversity model does not discriminate between strong and weak sensitive features. The scope of this research is that this paper enhances data accuracy while minimizing the algorithm’s execution time.
Data mining (DM) and machine learning (ML) applications in medical diagnostic systems are budding. Data privacy is essential in these systems as healthcare data are highly sensitive. The proposed ...work first discusses various privacy and security challenges in these systems. To address these next, we discuss different privacy‐preserving (PP) computation techniques in the context of DM and ML for secure data evaluation and processing. The state‐of‐the‐art applications of these systems in healthcare are analyzed at various stages such as data collection, data publication, data distribution, and output phases regarding PPDM and input, model, training, and output phases in the context of PPML. Furthermore, PP federated learning is also discussed. Finally, we present open challenges in these systems and future research directions.
This article is categorized under:
Application Areas > Health Care
Technologies > Machine Learning
Commercial, Legal, and Ethical Issues > Security and Privacy
Data mining (DM) and machine learning (ML) applications in medical diagnostic systems are budding. Data privacy is extremely essential in these systems as healthcare data are highly sensitive. The proposed work first discusses various privacy and security challenges in these systems. To address these next, we discussed different privacy‐preserving (PP) computation techniques in the context of DM and ML for secure data evaluation and processing. The state‐of‐the‐art applications of these systems in healthcare are analyzed at various stages such as data collection, data publication, data distribution, and output phases regarding PPDM, and input, model, training, and output phases in the context of PPML. Furthermore, PP federated learning is also discussed.
A key challenge in clinical recommendation systems is the problem of aberrant patient profiles in social networks. As a result of a person’s abnormal profile, numerous vests might be used to make ...fake remarks about them, cyber bullying, or cyber-attacks. Many clinical researchers have done extensive study on this topic. The most recent studies on this topic are summarized, and an overarching framework is provided. When it comes to the methods and datasets that make up the data collection, the feature presentation and algorithm selection layers provide an overview of the various types of algorithm selections available. The categorization and evaluation of diseases and disorders has been one of the major advantages of machine learning in medical. Because it was harder to predict, it rendered it more controllable. It might range from difficult-to-find cancers in the early stages to certain other illnesses spread through the bloodstream. In healthcare, we may pick methods in machine learning depending on reliable outcomes. To do so, we must run the findings through each method. The major issue arises during information training and validation. Because the dataset is so large, eliminating mistakes might be difficult. The providers, other characteristics, various algorithms, data labelling techniques, and assessment criteria are all presented and contrasted in depth. Detecting anomalous users in medical social networks, on the other hand, is a work in progress. The result evaluation layer provides an explanation of how to evaluate and mark up the results of the various algorithm selection layers. Finally, it looks forward to more study in this area.