We consider the problem of a Parameter Server (PS) that wishes to learn a model that fits data distributed on the nodes of a graph. We focus on Federated Learning (FL) as a canonical application. One ...of the main challenges of FL is the communication bottleneck between the nodes and the parameter server. A popular solution in the literature is to allow each node to do several local updates on the model in each iteration before sending it back to the PS. While this mitigates the communication bottleneck, the statistical heterogeneity of the data owned by the different nodes has proven to delay convergence and bias the model. In this work, we study random walk (RW) learning algorithms for tackling the communication and data heterogeneity problems. The main idea is to leverage available direct connections among the nodes themselves, which are typically "cheaper" than the communication to the PS. In a random walk, the model is thought of as a "baton" that is passed from a node to one of its neighbors after being updated in each iteration. The challenge in designing the RW is the data hetErogeneity and the uncertainty about the data distributions. Ideally, we would want to visit more often nodes that hold more informative data. We cast this problem as a sleeping multi-armed bandit (MAB) to design near-optimal node sampling strategy that achieves a variance reduced gradient estimates and approaches sub-linearly the optimal sampling strategy. Based on this framework, we present an adaptive random walk learning algorithm. We provide theoretical guarantees on its convergence. Our numerical results validate our theoretical findings and show that our algorithm outperforms existing random walk algorithms.
Motivated by the growing availability of personal genomics services, we study an information-theoretic privacy problem that arises when sharing genomic data: a user wants to share his or her genome ...sequence while keeping the genotypes at certain positions hidden, which could otherwise reveal critical health-related information. A straightforward solution of erasing (masking) the chosen genotypes does not ensure privacy, because the correlation between nearby positions can leak the masked genotypes. We introduce an erasure-based privacy mechanism with perfect information-theoretic privacy, whereby the released sequence is statistically independent of the sensitive genotypes. Our mechanism can be interpreted as a locally-optimal greedy algorithm for a given processing order of sequence positions, where utility is measured by the number of positions released without erasure. We show that finding an optimal order is NP-hard in general and provide an upper bound on the optimal utility. For sequences from hidden Markov models, a standard modeling approach in genetics, we propose an efficient algorithmic implementation of our mechanism with complexity polynomial in sequence length. Moreover, we illustrate the robustness of the mechanism by bounding the privacy leakage from erroneous prior distributions. Our work is a step towards more rigorous control of privacy in genomic data sharing.
We study the communication efficient secret sharing (CESS) problem. A classical threshold secret sharing scheme encodes a secret into <inline-formula> <tex-math notation="LaTeX">n ...</tex-math></inline-formula> shares given to <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> parties, such that any set of at least <inline-formula> <tex-math notation="LaTeX">t </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">t<n </tex-math></inline-formula>, parties can reconstruct the secret, and any set of at most <inline-formula> <tex-math notation="LaTeX">z </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">z<t<n </tex-math></inline-formula>, colluding parties cannot obtain any information about the secret. A CESS scheme satisfies the previous properties of threshold secret sharing. Moreover, it allows to reconstruct the secret from any set of <inline-formula> <tex-math notation="LaTeX">d\geq t </tex-math></inline-formula>, parties by reading and communicating the minimum amount of information. In this paper, we introduce three explicit constructions of CESS codes called Staircase codes . The first construction achieves optimal communication and read costs for a given <inline-formula> <tex-math notation="LaTeX">d </tex-math></inline-formula>. The second construction achieves optimal costs universally for all possible values of <inline-formula> <tex-math notation="LaTeX">d </tex-math></inline-formula> between <inline-formula> <tex-math notation="LaTeX">t </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>. The third construction, which is the most general, achieves optimal costs universally for all values of <inline-formula> <tex-math notation="LaTeX">d </tex-math></inline-formula> in any given set <inline-formula> <tex-math notation="LaTeX">\Delta \subseteq \{t, {\dots },n\} </tex-math></inline-formula>. The introduced Staircase codes can store a secret of maximal size, i.e., equal to <inline-formula> <tex-math notation="LaTeX">t-z </tex-math></inline-formula> shares, and they are all designed over a small finite field <inline-formula> <tex-math notation="LaTeX">GF(q) </tex-math></inline-formula>, for any prime power <inline-formula> <tex-math notation="LaTeX">q> n </tex-math></inline-formula>. However, Staircase codes may require dividing the secret and the shares into many symbols. We also describe how Staircase codes can be used to construct threshold changeable secret sharing with minimum storage cost, i.e., minimum share size.
ON-OFF Privacy in the Presence of Correlation Ye, Fangwei; Naim, Carolina; Rouayheb, Salim El
IEEE transactions on information theory,
11/2021, Volume:
67, Issue:
11
Journal Article
Peer reviewed
Open access
We formulate and study the problem of ON-OFF privacy. ON-OFF privacy algorithms enable a user to continuously switch his privacy between ON and OFF. An obvious example is the incognito mode in ...internet browsers. But beyond internet browsing, ON-OFF privacy can be a desired feature in most online applications. The challenge is that the statistical correlation over time of a user's online behavior can lead to leakage of information. We consider the setting in which a user is interested in retrieving the latest message generated by one of <inline-formula> <tex-math notation="LaTeX">N </tex-math></inline-formula> sources. The user's privacy status can change between ON and OFF over time. When privacy is ON the user wants to hide his request. Moreover, since the user's requests depend on personal attributes such as age, gender, and political views, they are typically correlated over time. As a consequence, the user cannot simply ignore privacy when privacy is OFF. We model the correlation between user's requests by an <inline-formula> <tex-math notation="LaTeX">N </tex-math></inline-formula> state Markov chain. The goal is to design query schemes with optimal download rate, that preserve privacy in an ON-OFF privacy setting. In this paper, we present inner and outer bounds on the achievable download rate for <inline-formula> <tex-math notation="LaTeX">N </tex-math></inline-formula> sources. We also devise an efficient algorithm to construct an ON-OFF privacy scheme achieving the inner bound and prove its optimality in the case <inline-formula> <tex-math notation="LaTeX">N=2 </tex-math></inline-formula> sources. For <inline-formula> <tex-math notation="LaTeX">N > 2 </tex-math></inline-formula>, finding tighter outer bounds and efficient constructions of ON-OFF privacy schemes that would achieve them remains an open question.
The problem of providing privacy, in the private information retrieval (PIR) sense, to users requesting data from a distributed storage system (DSS), is considered. The DSS is coded by an ...<inline-formula> <tex-math notation="LaTeX">(n,k,d) </tex-math></inline-formula> maximum distance separable code to store the data reliably on unreliable storage nodes. Some of these nodes can be spies which report to a third party, such as an oppressive regime, which data is being requested by the user. An information theoretic PIR scheme ensures that a user can satisfy its request while revealing no information on which data is being requested to the nodes. A user can trivially achieve PIR by downloading all the data in the DSS. However, this is not a feasible solution due to its high communication cost. We construct PIR schemes with low download communication cost. When there is <inline-formula> <tex-math notation="LaTeX">b=1 </tex-math></inline-formula> spy node in the DSS, in other words, no collusion between the nodes, we construct PIR schemes with download cost <inline-formula> <tex-math notation="LaTeX">\frac {1}{1-R} </tex-math></inline-formula> per unit of requested data (<inline-formula> <tex-math notation="LaTeX">R=k/n </tex-math></inline-formula> is the code rate), achieving the information theoretic limit for linear schemes. The proposed schemes are universal since they depend on the code rate, but not on the generator matrix of the code. Also, if <inline-formula> <tex-math notation="LaTeX">b\leq n-\delta k </tex-math></inline-formula> nodes collude, with <inline-formula> <tex-math notation="LaTeX">\delta =\lfloor {\frac {n-b}{k}}\rfloor </tex-math></inline-formula>, we construct linear PIR schemes with download cost <inline-formula> <tex-math notation="LaTeX">\frac {b+\delta k}{\delta } </tex-math></inline-formula>.
We consider the setting of a Master server, M , who possesses confidential data and wants to run intensive computations on it, as part of a machine learning algorithm for example. The Master wants to ...distribute these computations to untrusted workers who volunteered to help with this task. However, the data must be kept private in an information theoretic sense. Some of the workers may be stragglers, e.g., slow or busy. We are interested in reducing the delays experienced by the Master. We focus on linear computations as an essential operation in many iterative algorithms. We propose a solution based on new codes, called Staircase codes, introduced previously by two of the authors. Staircase codes allow flexibility in the number of stragglers up to a given maximum, and universally achieve the information theoretic limit on the download cost by the Master, leading to latency reduction. We find upper and lower bounds on the Master's mean waiting time. We derive the distribution of the Master's waiting time, and its mean, for systems with up to two stragglers. We show that Staircase codes always outperform existing solutions based on classical secret sharing codes. We validate our results with extensive implementation on Amazon EC2.
We address the problem of securing distributed storage systems against eavesdropping and adversarial attacks. An important aspect of these systems is node failures over time, necessitating, thus, a ...repair mechanism in order to maintain a desired high system reliability. In such dynamic settings, an important security problem is to safeguard the system from an intruder who may come at different time instances during the lifetime of the storage system to observe and possibly alter the data stored on some nodes. In this scenario, we give upper bounds on the maximum amount of information that can be stored safely on the system. For an important operating regime of the distributed storage system, which we call the bandwidth-limited regime, we show that our upper bounds are tight and provide explicit code constructions. Moreover, we provide a way to short list the malicious nodes and expurgate the system.
Private Multi-Group Aggregation Naim, Carolina; D'Oliveira, Rafael G. L.; Rouayheb, Salim El
IEEE journal on selected areas in communications,
03/2022, Volume:
40, Issue:
3
Journal Article
Peer reviewed
Open access
We study the differentially private multi-group aggregation (PMGA) problem. This setting involves a single server and <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> users. ...Each user belongs to one of <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> distinct groups and holds a discrete value. The goal is to design schemes that allow the server to find the aggregate (sum) of the values in each group (with high accuracy) under communication and local differential privacy constraints. The privacy constraint guarantees that the user's group remains private. This is motivated by applications where a user's group can reveal sensitive information, such as his religious and political beliefs, health condition, or race. We propose a novel scheme, dubbed Query and Aggregate (Q&A) for PMGA. The novelty of Q&A is that it is an interactive aggregation scheme. In Q&A, each user is assigned a random query matrix, to which he sends the server an answer based on his group and value. We characterize the Q&A scheme's performance in terms of accuracy (MSE), privacy, and communication. We compare Q&A to the Randomized Group (RG) scheme, which is non-interactive and adapts existing randomized response schemes to the PMGA setting. We observe that typically Q&A outperforms RG, in terms of privacy vs. utility, in the high privacy regime.
We consider distributed gradient descent in the presence of stragglers. Recent work on gradient coding and approximate gradient coding have shown how to add redundancy in distributed gradient descent ...to guarantee convergence even if some workers are stragglers -that is, slow or non-responsive. In this work we propose an approximate gradient coding scheme called Stochastic Gradient Coding (SGC), which works when the stragglers are random. SGC distributes data points redundantly to workers according to a pair-wise balanced design, and then simply ignores the stragglers. We prove that the convergence rate of SGC mirrors that of batched Stochastic Gradient Descent (SGD) for the <inline-formula> <tex-math notation="LaTeX">\ell _{2} </tex-math></inline-formula> loss function, and show how the convergence rate can improve with the redundancy. We also provide bounds for more general convex loss functions. We show empirically that SGC requires a small amount of redundancy to handle a large number of stragglers and that it can outperform existing approximate gradient codes when the number of stragglers is large.
We consider straggler-resilient learning. In many previous works, e.g., in the coded computing literature, straggling is modeled as random delays that are independent and identically distributed ...between workers. However, in many practical scenarios, a given worker may straggle over an extended period of time. We propose a latency model that captures this behavior and is substantiated by traces collected on Microsoft Azure, Amazon Web Services (AWS), and a small local cluster. Building on this model, we propose DSAG, a mixed synchronous-asynchronous iterative optimization method, based on the stochastic average gradient (SAG) method, that combines timely and stale results. We also propose a dynamic load-balancing strategy to further reduce the impact of straggling workers. We evaluate DSAG for principal component analysis, cast as a finite-sum optimization problem, of a large genomics dataset, and for logistic regression on a cluster composed of 100 workers on AWS, and find that DSAG is up to about 50% faster than SAG, and more than twice as fast as coded computing methods, for the particular scenario that we consider.