ncll2
- Version:
2.7.8, 2.14.3
- Category:
ai
- Cluster:
Loki
Description
NCCL (NVIDIA Collective Communications Library) is a high-performance, multi-GPU communication library optimized for NVIDIA GPUs. It provides primitives for broadcast, all-reduce, reduce, all-gather, reduce-scatter, and more — tailored for deep learning frameworks and HPC workloads.
This version supports:
CUDA 10.2 and CUDA 11.2
Volta, Turing, and Ampere GPU architectures
Fast collective operations using NVLink, PCIe, and NVIDIA networking fabrics
NCCL is often used with frameworks like PyTorch, TensorFlow, and MXNet to enable efficient multi-GPU training.
Documentation
$ nccl-tests
The NCCL package does not include a CLI. Most interactions are through frameworks or custom programs
that link against the libnccl.so shared library.
Common API usage includes:
- ncclCommInitAll
- ncclAllReduce
- ncclBroadcast
- ncclReduce
- ncclAllGather
- ncclReduceScatter
Developer documentation: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/
Examples/Usage
Load the appropriate module for your CUDA version:
# For CUDA 10.2
$ module load nccl2-cuda10.2-gcc8/2.11.4
# For CUDA 11.2
$ module load nccl2-cuda11.2-gcc8/2.11.4
# For CUDA 11.2
$ module load nccl2-cuda11.2-gcc9/2.14.3
Verify shared library availability:
$ ls $EBROOTNCCL2/lib/libnccl.so*
Build with NCCL (example Makefile snippet):
CXX = nvcc
CXXFLAGS += -I$(EBROOTNCCL2)/include
LDFLAGS += -L$(EBROOTNCCL2)/lib -lnccl
Use with PyTorch (example):
torch.distributed.init_process_group(backend='nccl')
Unload module:
$ module unload nccl2-cuda10.2-gcc8/2.11.4
$ module unload nccl2-cuda11.2-gcc8/2.11.4
Installation
Source obtained from https://developer.nvidia.com/nccl/nccl-download