Robust and secure distributed machine learning

In distributed learning systems, even a single adversarial node could disrupt the system operation. We developed two approaches to this problem. The first suitable when we are able to encode the data stored at the distributed nodes and the other when one cannot do so, and also have high-dimensional models. We describe each below.

When one uses partially trusted nodes for distributed optimization, nodes could potentially be malicious, and return adversarial responses. We study distributed optimization in the presence of Byzantine adversaries, where both data and computation are distributed among m worker machines, t of which can be corrupt and collaboratively deviate arbitrarily from their pre-specified programs, and a designated (master) node iteratively computes the model/parameter vector for generalized linear models. We proposed a method based on data encoding and error correction over real numbers to combat adversarial attacks, without any statistical assumptions on the data. We demonstrated that up to t ~ (m-1)/2 corrupt worker nodes can be tolerated, which is information-theoretically optimal. This is based on a sparse encoding scheme which enables computationally efficient data encoding and decoding. Our results demonstrated a trade-off between corruption threshold and the resource requirement (storage and computational/communication complexity).

When building learning models in the presence of malicious input from potentially compromised (Byzantine) edge nodes, a challenge is to design robust methods that give provable guarantees for learning models despite such attacks. This becomes especially challenging when the models are high-dimensional, since optimal adversary filtering techniques, such as geometric median etc., become computationally infeasible. Alternatively, computationally feasible methods have poor performance in high-dimensions. All these challenges imply that traditional methods designed for learning models in data centers are not directly portable to federated/distributed learning.

Adversary filtering techniques, previously developed are either unsuitable for high-dimensional models or make homogeneous statistical assumptions on data, making them unsuitable for federated learning, where the local data is heterogeneous and one cannot make such statistical assumptions. We developed a robust learning framework suitable for federated learning with local heterogeneous data, which used local computation to also enable communication efficiency. To combat the adversaries, we employed an efficient (polynomial complexity) high-dimensional robust mean estimation algorithm at the server to filter-out corrupt vectors; and to analyze the outlier-filtering procedure, we developed a novel matrix concentration result for heterogeneous data that may be of independent interest. We provided convergence analyses for both strongly-convex and non-convex smooth objectives in the heterogeneous data setting, where different clients may have different local datasets, and we do not make any probabilistic assumptions on data generation.