Back to all articles
Distributed Training

Multi-GPU Training Setup: From Single Node to Cluster

Step-by-step guide to setting up multi-GPU training infrastructure. Covers hardware selection, networking, NCCL configuration, and troubleshooting for distributed PyTorch training.

Flash Attention TeamJanuary 8, 20266 min read
multi-GPUNCCLdistributed trainingtraining clusterInfiniBandPyTorch

Setting up multi-GPU training infrastructure requires careful attention to hardware, networking, and software configuration. This guide walks through the complete setup process from single node to multi-node clusters.

Hardware Requirements

GPU Selection

GPUVRAMInterconnectBest For
RTX 409024 GBPCIe 4.0Budget training
A100 40GB40 GBNVLink 3.0Production training
A100 80GB80 GBNVLink 3.0Large models
H100 80GB80 GBNVLink 4.0Maximum performance

Interconnect Comparison

InterconnectBandwidthLatencyUse Case
PCIe 4.064 GB/s~1-2 μsConsumer GPUs
NVLink 3.0600 GB/s~0.7 μsA100
NVLink 4.0900 GB/s~0.5 μsH100
InfiniBand HDR200 Gb/s~1 μsMulti-node
InfiniBand NDR400 Gb/s~0.5 μsHigh-performance

Single Node Configuration

For a single training node:

# Check GPU topology
nvidia-smi topo -m

# Example output (8x A100 with NVSwitch):
#         GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
# GPU0     X    NV12  NV12  NV12  NV12  NV12  NV12  NV12
# GPU1    NV12   X    NV12  NV12  NV12  NV12  NV12  NV12
# ...

Software Setup

CUDA and Drivers

# Check CUDA version
nvcc --version

# Check driver version
nvidia-smi

# Recommended versions (as of 2024)
# CUDA: 12.1+
# Driver: 535+

PyTorch Installation

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.cuda.device_count())"

NCCL Configuration

# Check NCCL version
python -c "import torch; print(torch.cuda.nccl.version())"

# Key environment variables
export NCCL_DEBUG=INFO              # Debug output
export NCCL_IB_DISABLE=0            # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5         # GPU Direct RDMA
export NCCL_SOCKET_IFNAME=eth0      # Network interface
export NCCL_P2P_LEVEL=NVL           # NVLink for P2P

Single Node Training

Basic Setup with torchrun

# train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    # Initialize process group
    dist.init_process_group(backend="nccl")

    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])

    # Set device
    torch.cuda.set_device(local_rank)

    # Create model
    model = MyModel().cuda()
    model = DDP(model, device_ids=[local_rank])

    # Training loop
    ...

    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Launch with:

# 4 GPUs on single node
torchrun --nproc_per_node=4 train.py

# All available GPUs
torchrun --nproc_per_node=gpu train.py

Using Accelerate

# Configure accelerate
accelerate config

# Launch training
accelerate launch train.py

Multi-Node Setup

Network Configuration

# On each node, check network
ip addr show

# Test connectivity between nodes
ping node2
ib_write_bw --all  # InfiniBand bandwidth test

SSH Setup

# Generate SSH key
ssh-keygen -t rsa

# Copy to all nodes
ssh-copy-id user@node1
ssh-copy-id user@node2

# Verify passwordless SSH
ssh node1 hostname
ssh node2 hostname

Hostfile Configuration

# hostfile
node1 slots=8
node2 slots=8
node3 slots=8
node4 slots=8

Launch Multi-Node Training

# Using torchrun (recommended)
# On node 0 (master):
torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=0 \
    --master_addr=node1 \
    --master_port=29500 \
    train.py

# On node 1:
torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=1 \
    --master_addr=node1 \
    --master_port=29500 \
    train.py

Using DeepSpeed Launcher

# With hostfile
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json

# Or with explicit hosts
deepspeed --num_gpus=8 --num_nodes=4 \
    --hostfile=hostfile \
    train.py --deepspeed ds_config.json

Monitoring and Profiling

GPU Monitoring

# Real-time GPU stats
watch -n 1 nvidia-smi

# Detailed memory usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1

# DCGM for cluster monitoring
dcgmi dmon -e 1001,1002,1003,1004,1005

PyTorch Profiler

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(batch)
        prof.step()

NCCL Debugging

# Enable NCCL debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Common issues:
# - "NCCL WARN Connect to ... failed" → Network/firewall issue
# - "NCCL timeout" → Increase NCCL_TIMEOUT
# - "NCCL WARN Call to ibv_..." → InfiniBand configuration

Performance Optimization

Communication Overlap

# FSDP with prefetch
model = FSDP(
    model,
    forward_prefetch=True,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
)

# DeepSpeed overlap
{
    "zero_optimization": {
        "overlap_comm": true
    }
}

Gradient Accumulation

# Reduce communication frequency
gradient_accumulation_steps = 8

for i, batch in enumerate(dataloader):
    loss = model(batch) / gradient_accumulation_steps
    loss.backward()

    if (i + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Optimal Batch Size

# Find maximum batch size
def find_max_batch_size(model, start=1, max_batch=128):
    batch_size = start
    while batch_size <= max_batch:
        try:
            dummy_input = torch.randn(batch_size, ...).cuda()
            loss = model(dummy_input).sum()
            loss.backward()
            torch.cuda.synchronize()
            torch.cuda.empty_cache()
            batch_size *= 2
        except RuntimeError:
            return batch_size // 2
    return batch_size

Troubleshooting

Common Issues

1. Processes hang at initialization

# Check if all nodes can reach master
nc -zv master_node 29500

# Ensure same PyTorch/NCCL versions
python -c "import torch; print(torch.__version__)"

2. OOM on some GPUs

# Check memory distribution
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.memory_allocated(i) / 1e9:.2f} GB")

3. Slow multi-node training

# Check network bandwidth
iperf3 -c other_node

# Ensure InfiniBand is used
ibstat

4. NCCL errors

# Increase timeout
export NCCL_TIMEOUT=1800

# Try different algorithms
export NCCL_ALGO=Ring  # or Tree

Cloud Setup

AWS

# Use p4d.24xlarge (8x A100)
# Enable EFA (Elastic Fabric Adapter)
aws ec2 run-instances \
    --instance-type p4d.24xlarge \
    --placement GroupName=my-cluster

GCP

# Use a2-ultragpu-8g (8x A100)
gcloud compute instances create trainer \
    --machine-type=a2-ultragpu-8g \
    --accelerator=count=8,type=nvidia-a100-80gb

References

  1. PyTorch. (2025). "Distributed Data Parallel." PyTorch Documentation

  2. NVIDIA. (2025). "NCCL Documentation." NVIDIA Developer

  3. Meta. (2025). "torchrun Documentation." PyTorch Documentation

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel