ncll2

Version:: 2.7.8, 2.14.3
Category:: ai
Cluster:: Loki

Author / Distributor

https://developer.nvidia.com/nccl

Description

NCCL (NVIDIA Collective Communications Library) is a high-performance, multi-GPU communication library optimized for NVIDIA GPUs. It provides primitives for broadcast, all-reduce, reduce, all-gather, reduce-scatter, and more — tailored for deep learning frameworks and HPC workloads.

This version supports:

CUDA 10.2 and CUDA 11.2
Volta, Turing, and Ampere GPU architectures
Fast collective operations using NVLink, PCIe, and NVIDIA networking fabrics

NCCL is often used with frameworks like PyTorch, TensorFlow, and MXNet to enable efficient multi-GPU training.

Documentation

$ nccl-tests

The NCCL package does not include a CLI. Most interactions are through frameworks or custom programs
that link against the libnccl.so shared library.

Common API usage includes:

- ncclCommInitAll
- ncclAllReduce
- ncclBroadcast
- ncclReduce
- ncclAllGather
- ncclReduceScatter

Developer documentation: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/

Examples/Usage

Load the appropriate module for your CUDA version:

# For CUDA 10.2
$ module load nccl2-cuda10.2-gcc8/2.11.4

# For CUDA 11.2
$ module load nccl2-cuda11.2-gcc8/2.11.4

# For CUDA 11.2
$ module load nccl2-cuda11.2-gcc9/2.14.3

Verify shared library availability:

$ ls $EBROOTNCCL2/lib/libnccl.so*

Build with NCCL (example Makefile snippet):

CXX = nvcc
CXXFLAGS += -I$(EBROOTNCCL2)/include
LDFLAGS += -L$(EBROOTNCCL2)/lib -lnccl

Use with PyTorch (example):

torch.distributed.init_process_group(backend='nccl')

Unload module:

$ module unload nccl2-cuda10.2-gcc8/2.11.4
$ module unload nccl2-cuda11.2-gcc8/2.11.4

Installation

Source obtained from https://developer.nvidia.com/nccl/nccl-download