ncll2

Version:

2.7.8, 2.14.3

Category:

ai

Cluster:

Loki

Author / Distributor

https://developer.nvidia.com/nccl

Description

NCCL (NVIDIA Collective Communications Library) is a high-performance, multi-GPU communication library optimized for NVIDIA GPUs. It provides primitives for broadcast, all-reduce, reduce, all-gather, reduce-scatter, and more — tailored for deep learning frameworks and HPC workloads.

This version supports:

  • CUDA 10.2 and CUDA 11.2

  • Volta, Turing, and Ampere GPU architectures

  • Fast collective operations using NVLink, PCIe, and NVIDIA networking fabrics

NCCL is often used with frameworks like PyTorch, TensorFlow, and MXNet to enable efficient multi-GPU training.

Documentation

$ nccl-tests

The NCCL package does not include a CLI. Most interactions are through frameworks or custom programs
that link against the libnccl.so shared library.

Common API usage includes:

- ncclCommInitAll
- ncclAllReduce
- ncclBroadcast
- ncclReduce
- ncclAllGather
- ncclReduceScatter

Developer documentation: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/

Examples/Usage

  • Load the appropriate module for your CUDA version:

# For CUDA 10.2
$ module load nccl2-cuda10.2-gcc8/2.11.4

# For CUDA 11.2
$ module load nccl2-cuda11.2-gcc8/2.11.4

# For CUDA 11.2
$ module load nccl2-cuda11.2-gcc9/2.14.3
  • Verify shared library availability:

$ ls $EBROOTNCCL2/lib/libnccl.so*
  • Build with NCCL (example Makefile snippet):

CXX = nvcc
CXXFLAGS += -I$(EBROOTNCCL2)/include
LDFLAGS += -L$(EBROOTNCCL2)/lib -lnccl
  • Use with PyTorch (example):

torch.distributed.init_process_group(backend='nccl')
  • Unload module:

$ module unload nccl2-cuda10.2-gcc8/2.11.4
$ module unload nccl2-cuda11.2-gcc8/2.11.4

Installation

Source obtained from https://developer.nvidia.com/nccl/nccl-download