Load imbalance pervasively exists in distributed deep learning
training systems, either caused by the inherent imbalance
in learned tasks or by the system itself. Traditional
synchronous Stochastic Gradient Descent (SGD) achieves
good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training
step. In this paper, we propose eager-SGD, which relaxes
the global synchronization for decentralized accumulation.
To implement eager-SGD, we propose to use two partial collectives:
solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting
for the slower processes, whereas with majority allreduce,
at least half of the participants must contribute gradients
before continuing, all without using a central parameter
server. We theoretically prove the convergence of the algorithms
and describe the partial collectives in detail. Experiments
are conducted on a variety of neural networks and
datasets. The results on load-imbalanced environments show
that eager-SGD achieves 2.64× speedup (ResNet-50 on ImageNet)
over the asynchronous centralized SGD, and achieves
1.29× speedup (ResNet-50 on ImageNet) and 1.27× speedup
(LSTM on UCF101) over the state-of-the-art synchronous
decentralized SGDs, without losing accuracy.
@inproceedings{, author={Shigang Li and Tal Ben-Nun and Salvatore Di Girolamo and Dan Alistarh and Torsten Hoefler}, title={{Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations}}, year={2020}, month={Feb.}, booktitle={Proceedings of the 25th Symposium on Principles and Practice of Parallel Programming (PPoPP'20)}, source={http://www.unixer.de/~htor/publications/}, }