We explore the risk of re-identification in anonymized data sets that preserve genealogical information (i.e. parent-child links). We consider attacks based on the number of children of an ...individual. We use part of a well known data set in our experiments, which show that a substantial part of the population involved in the set can be identified, even if additional anonymity protection measures based on graph trimming are taken. We have also found that the risk quickly increases with generatiou depth.
Surveying the MOOC Data Set Universe Lohse, James J.; McManus, Christine A.; Joyner, David A.
2019 IEEE Learning With MOOCS (LWMOOCS),
2019-Oct.
Conference Proceeding
This paper is a survey of the availability of open data sets generated from Massively Open Online Courses (MOOCs). This log data allows researchers to analyze and predict student performance. Often, ...the goal of the analysis is to focus on at-risk students who are not likely to finish a course. There is a growing gap between the average researcher (who does not have access to proprietary data) and the ready availability of data sets for analysis. Most research papers studying and predicting student performance in MOOCs are done on proprietary data sets that are not anonymized (de-identified) or released for general study. There are no standardized tools that provide a gateway to access usable data sets; instead, the researcher must navigate a maze of sites with different data structures and varying data access policies. To our knowledge, no open data sets are being produced, and have not been since 2016. The authors survey the history of MOOC data sharing, identify the few available open data sets, and discuss a path forward to increase the reproducibility of MOOC research.
Recent studies show that ubiquitous smartphone data, e.g., the universal cell tower IDs, WiFi access points, etc., can be used to effectively recover individuals' mobility. However, recording and ...releasing the data containing such information without anonymization can hurt individuals' location privacy. Therefore, many anonymization methods have been used to sanitize these datasets before they are shared to the research community. In this paper, we demonstrate the idea of statistical mobile user profiling and identification based on anonymized datasets. Our insight is that, the mobility patterns inferred from different individuals' data are identifiable by using the statistical profiles constructed from the patterns. Experimental results show that, the proposed method achieves a promising identification accuracy of 96% on average based on randomly chosen two users' data, which makes our framework feasible for the application of inferring the fraud usage of the smartphones. Also, extensive experiments are conducted on the more challenging cases, showing a 59.5% identification accuracy for a total of 50 users based on 636 weekly data segments and a 56.1% accuracy for a total of 63 users based on 786 weekly data segments for two separate datasets. As the first work of such kind, our result suggests good possibility of developing location-based services or applications on the ubiquitous location anonymized datasets.
In this paper, I will discuss Secondary Use of official statistical data in Japan. The new Statistical Act, which was enforced in 2009, enables users to access and utilize ordermade tabulation and ...anonymized data. This system is rather new and not so popular in Japan, comparing with Public Use File in the US and European countries. I hope that this paper will be helpful to professionals and young researchers in SocioInformatics field.
Sharing information is one of the most important parts of social activities. However, sharing information can leak users' information. Removing all direct identifiers is not enough. Sweeney proposed ...an approach that applying k-anonymity to protect users' identities from linking attack. Sweeney`s algorithm finds out the optimal anonymized dataset through minimal distortion metric. Other authors proposed other optimal algorithms but their proposals are still impractical due to their high computational cost. Another approach is to release the minimal anonymized dataset by applying some heuristics. Wang and Fung proposed Bottom-up Generalization and Top-down Specialization (TDS) to publish a minimal anonymized dataset with information loss metric, whose performance is more efficient. However, these algorithms still have some limitations. In this paper, we propose an algorithm to publish anonymized datasets through bottom-up generalization approach and information loss data metric. Our algorithm can save time by storing statistical information for later usage. The experimental results is performanced on Adult dataset, which is used in all former algorithms. Experimental results show that our algorithm can process 949,662 records dataset in 42.219s. Classification error on anonymized data, which is created by our algorithm, is lower than Wang's algorithm 3.8%.
Modeling location obfuscation for continuous query Saxena, Anuj S.; Bera, Debajyoti; Goyal, Vikram
Journal of information security and applications,
February 2019, 2019-02-00, Volume:
44
Journal Article
Peer reviewed
•We formalize local obfuscation mechanisms in the context of continuous query scenario and exemplify how some of the existing location obfuscations can be expressed using our formal model.•We define ...a probability based measure to quantify the privacy of obfuscation mechanisms.•We give a theoretical framework to analyze privacy enforcement techniques based on local obfuscation approaches.•We demonstrate the effectiveness of the proposed framework by analyzing a square-grid based location obfuscation mechanisms for continuous query scenario.
One of the major problems of location-based services is ensuring the location privacy of a mobile user. However, it becomes more challenging in the case of a continuously moving user. Though many techniques for preserving location privacy for continuous query scenario have been studied in the last decade, a formal approach for quantification of privacy and justification for the correctness of the privacy guarantee remains mostly at infancy level. In this paper, we present a theoretical model for location obfuscation that is flexible enough to express several obfuscation mechanisms and allow to reason about their privacy guarantees. We illustrate the effectiveness of our theoretical model by analyzing a popular square grid-based obfuscation mechanism.