Communication efficient distributed machine learning

As machine learning moves away from data-center architectures to learning with edge devices (like smartphones, IoT sensors etc.), new paradigms such as federated learning are emerging. Here data resides on edge devices (not sent to a central server), and a collaborative model using distributed devices is built through interactive communication. Realizing this federated learning vision has important challenges. An important one is efficient use of limited communication bandwidth.

Many current implementations of high-dimensional learning models, based on distributed/parallel optimization, exchange full-precision gradients, requiring significant communication resources. For example, consider training the ResNet 152 architecture resnet50 which has about 62 million parameters, on the ImageNet dataset that contains 14 million images. Each full precision exchange between workers is around 240MB. Although we give a quantitative sense of the scale via the ResNet example, such a scenario can occur in building large scale neural network models based on data from millions of users (via cellphones and IoT devices).

We developed Qsparse-local-SGD which combines quantization, sparsification and local computation to get learning performance comparable to the state-of-the art with orders of magnitude reduction in communication bits needed, demonstrated in implementations using ResNet-50 neural network
architecture and the ImageNet dataset.

We proposed both synchronous and asynchronous implementations of Qsparse-local-SGD in a distributed setting, where the nodes perform computations on their local datasets. We have analyzed convergence for Qsparse-local-SGD in the distributed case, for smooth non-convex and smooth strongly-convex objective functions. We demonstrated that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers.

We proposed SPARQ-SGD, which can be used when nodes do not have a central coordinator, and are not directly connected to all other nodes, but are connected through a communication network. In addition to compressing the gradients, we developed the novel idea of {\em event-triggered} communication, where a node initiates a (communication) action only if the change in model parameter exceeds a prescribed threshold. Theoretically, we showed that SPARQ-SGD convergences at the same rate as vanilla SGD for both smooth non-convex and convex objective functions, even when nodes have heterogenous data. We also extended this to learning using Nesterov momentum acceleration designing SQuARM-SGD and demonstrated theoretically and over datasets that one can get communication advantages without sacrificing performance. These methods yielded savings 20 times over the state-of-the-art and 1000 times over vanilla full-precision methods for Imagenet dataset. While the works discussed above learn a singular common model for all the nodes, in many scenarios this may not be desirable on account of heterogeneity (especially, regarding data) among the devices. Thus, there is interest in learning individual (potentially related) models for the participating nodes in a collaborative manner. We have considered this formulation for optimization of (additive) convex-objectives at the nodes in a communication efficient manner (using compression) under pairwise constraints arising at the node level. Further, we consider two different frameworks of sample access at the nodes: (i) Sample feedback: where nodes can directly sample the gradient and objective value at any point of interest at each instance, (ii) Bandit feedback: where nodes are only allowed access to two random perturbed function values at the parameter of interest. For both these frameworks, we develop communication efficient saddle-point algorithms and provide thoeretical guarantees on their convergence as well as the expected constraint violation of the obtained solutions. Our results show that these bounds match (in order sense) the ones that correspond to saddle-point algorithms in literature without any compression of exchanges.

We considered the problem of distributed feature quantization, where the goal is to enable a pre-trained classifier at a central node to carry out its classification on features that are gathered from distributed nodes through communication constrained channels. We proposed quantization algorithms, that leveraged discrete neural representations and training data, and can be designed in polynomial-time for any number of features, any number of classes, and arbitrary division of features across the distributed nodes. We found that tailoring the quantizers to the classification task can offer significant savings: as compared to baseline of quantizers for reconstructing data. In particular, we can achieve more than a factor of two reduction in terms of the number of bits communicated, for the same classification accuracy.