horovod

Version:

0.22.1

Category:

ai

Cluster:

Loki

Author / Distributor

https://github.com/horovod/horovod

Description

Horovod is a distributed deep learning training framework that makes it easy to scale training across multiple GPUs and nodes using TensorFlow, Keras, PyTorch, and MXNet.

Built on top of MPI and NCCL, Horovod simplifies multi-GPU and multi-node workflows by abstracting the complexity of communication backends and synchronization strategies.

Version 0.22.1 includes:

  • AllReduce-based data parallelism

  • Support for Horovod on PyTorch, TensorFlow 2, and Keras

  • GPU-aware training via NCCL

  • Tensor Fusion for communication efficiency

  • Horovod timeline for performance profiling

Available Module Variants

Module Name

Backend

CUDA

Python

GCC

horovod-pytorch-py37-cuda10.2-gcc8/0.22.1

PyTorch

10.2

3.7

8

horovod-pytorch-py37-cuda11.2-gcc8/0.22.1

PyTorch

11.2

3.7

8

horovod-pytorch-py39-cuda11.2-gcc9/0.22.1

PyTorch

11.2

3.9

9

horovod-tensorflow2-py37-cuda10.2-gcc8/0.22.1

TensorFlow 2

10.2

3.7

8

horovod-tensorflow2-py37-cuda11.2-gcc8/0.22.1

TensorFlow 2

11.2

3.7

8

horovod-tensorflow2-py39-cuda11.2-gcc9/0.22.1

TensorFlow 2

11.2

3.9

9

horovod-mxnet-py37-cuda10.2-gcc8/0.22.1

MXNet

10.2

3.7

8

Documentation

CLI:
  horovodrun -np <num_procs> -H <hostlist> <script.py>

Common environment variables:
  HOROVOD_FUSION_THRESHOLD
  HOROVOD_TIMELINE
  NCCL_DEBUG=INFO

Python entrypoints:
  horovod.tensorflow.keras
  horovod.torch
  horovod.mxnet

Help:
  $ horovodrun --help
  $ horovodrun -H localhost:4 python train.py

Examples/Usage

  • Load the desired module:

$ module load horovod-pytorch-py39-cuda11.2-gcc9/0.22.1
  • Basic training script pattern:

import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
model = MyModel().cuda()
optimizer = hvd.DistributedOptimizer(
    torch.optim.SGD(model.parameters(), lr=0.01),
    named_parameters=model.named_parameters()
)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
  • Run across 4 GPUs:

$ horovodrun -np 4 -H localhost:4 python train.py
  • Unload the module:

$ module unload horovod-pytorch-py39-cuda11.2-gcc9/0.22.1

Installation

Source code is obtained from Horovod