horovod
- Version:
0.22.1
- Category:
ai
- Cluster:
Loki
Description
Horovod is a distributed deep learning training framework that makes it easy to scale training across multiple GPUs and nodes using TensorFlow, Keras, PyTorch, and MXNet.
Built on top of MPI and NCCL, Horovod simplifies multi-GPU and multi-node workflows by abstracting the complexity of communication backends and synchronization strategies.
Version 0.22.1 includes:
AllReduce-based data parallelism
Support for Horovod on PyTorch, TensorFlow 2, and Keras
GPU-aware training via NCCL
Tensor Fusion for communication efficiency
Horovod timeline for performance profiling
Available Module Variants
Module Name |
Backend |
CUDA |
Python |
GCC |
|---|---|---|---|---|
horovod-pytorch-py37-cuda10.2-gcc8/0.22.1 |
PyTorch |
10.2 |
3.7 |
8 |
horovod-pytorch-py37-cuda11.2-gcc8/0.22.1 |
PyTorch |
11.2 |
3.7 |
8 |
horovod-pytorch-py39-cuda11.2-gcc9/0.22.1 |
PyTorch |
11.2 |
3.9 |
9 |
horovod-tensorflow2-py37-cuda10.2-gcc8/0.22.1 |
TensorFlow 2 |
10.2 |
3.7 |
8 |
horovod-tensorflow2-py37-cuda11.2-gcc8/0.22.1 |
TensorFlow 2 |
11.2 |
3.7 |
8 |
horovod-tensorflow2-py39-cuda11.2-gcc9/0.22.1 |
TensorFlow 2 |
11.2 |
3.9 |
9 |
horovod-mxnet-py37-cuda10.2-gcc8/0.22.1 |
MXNet |
10.2 |
3.7 |
8 |
Documentation
CLI:
horovodrun -np <num_procs> -H <hostlist> <script.py>
Common environment variables:
HOROVOD_FUSION_THRESHOLD
HOROVOD_TIMELINE
NCCL_DEBUG=INFO
Python entrypoints:
horovod.tensorflow.keras
horovod.torch
horovod.mxnet
Help:
$ horovodrun --help
$ horovodrun -H localhost:4 python train.py
Examples/Usage
Load the desired module:
$ module load horovod-pytorch-py39-cuda11.2-gcc9/0.22.1
Basic training script pattern:
import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
model = MyModel().cuda()
optimizer = hvd.DistributedOptimizer(
torch.optim.SGD(model.parameters(), lr=0.01),
named_parameters=model.named_parameters()
)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
Run across 4 GPUs:
$ horovodrun -np 4 -H localhost:4 python train.py
Unload the module:
$ module unload horovod-pytorch-py39-cuda11.2-gcc9/0.22.1
Installation
Source code is obtained from Horovod