Setting up multi-GPU training infrastructure requires careful attention to hardware, networking, and software configuration. This guide walks through the complete setup process from single node to multi-node clusters.
Hardware Requirements
GPU Selection
| GPU | VRAM | Interconnect | Best For |
|---|---|---|---|
| RTX 4090 | 24 GB | PCIe 4.0 | Budget training |
| A100 40GB | 40 GB | NVLink 3.0 | Production training |
| A100 80GB | 80 GB | NVLink 3.0 | Large models |
| H100 80GB | 80 GB | NVLink 4.0 | Maximum performance |
Interconnect Comparison
| Interconnect | Bandwidth | Latency | Use Case |
|---|---|---|---|
| PCIe 4.0 | 64 GB/s | ~1-2 μs | Consumer GPUs |
| NVLink 3.0 | 600 GB/s | ~0.7 μs | A100 |
| NVLink 4.0 | 900 GB/s | ~0.5 μs | H100 |
| InfiniBand HDR | 200 Gb/s | ~1 μs | Multi-node |
| InfiniBand NDR | 400 Gb/s | ~0.5 μs | High-performance |
Single Node Configuration
For a single training node:
# Check GPU topology
nvidia-smi topo -m
# Example output (8x A100 with NVSwitch):
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
# GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12
# GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12
# ...
Software Setup
CUDA and Drivers
# Check CUDA version
nvcc --version
# Check driver version
nvidia-smi
# Recommended versions (as of 2024)
# CUDA: 12.1+
# Driver: 535+
PyTorch Installation
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.cuda.device_count())"
NCCL Configuration
# Check NCCL version
python -c "import torch; print(torch.cuda.nccl.version())"
# Key environment variables
export NCCL_DEBUG=INFO # Debug output
export NCCL_IB_DISABLE=0 # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5 # GPU Direct RDMA
export NCCL_SOCKET_IFNAME=eth0 # Network interface
export NCCL_P2P_LEVEL=NVL # NVLink for P2P
Single Node Training
Basic Setup with torchrun
# train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
# Initialize process group
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
world_size = dist.get_world_size()
local_rank = int(os.environ["LOCAL_RANK"])
# Set device
torch.cuda.set_device(local_rank)
# Create model
model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])
# Training loop
...
dist.destroy_process_group()
if __name__ == "__main__":
main()
Launch with:
# 4 GPUs on single node
torchrun --nproc_per_node=4 train.py
# All available GPUs
torchrun --nproc_per_node=gpu train.py
Using Accelerate
# Configure accelerate
accelerate config
# Launch training
accelerate launch train.py
Multi-Node Setup
Network Configuration
# On each node, check network
ip addr show
# Test connectivity between nodes
ping node2
ib_write_bw --all # InfiniBand bandwidth test
SSH Setup
# Generate SSH key
ssh-keygen -t rsa
# Copy to all nodes
ssh-copy-id user@node1
ssh-copy-id user@node2
# Verify passwordless SSH
ssh node1 hostname
ssh node2 hostname
Hostfile Configuration
# hostfile
node1 slots=8
node2 slots=8
node3 slots=8
node4 slots=8
Launch Multi-Node Training
# Using torchrun (recommended)
# On node 0 (master):
torchrun \
--nproc_per_node=8 \
--nnodes=4 \
--node_rank=0 \
--master_addr=node1 \
--master_port=29500 \
train.py
# On node 1:
torchrun \
--nproc_per_node=8 \
--nnodes=4 \
--node_rank=1 \
--master_addr=node1 \
--master_port=29500 \
train.py
Using DeepSpeed Launcher
# With hostfile
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json
# Or with explicit hosts
deepspeed --num_gpus=8 --num_nodes=4 \
--hostfile=hostfile \
train.py --deepspeed ds_config.json
Monitoring and Profiling
GPU Monitoring
# Real-time GPU stats
watch -n 1 nvidia-smi
# Detailed memory usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1
# DCGM for cluster monitoring
dcgmi dmon -e 1001,1002,1003,1004,1005
PyTorch Profiler
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'),
record_shapes=True,
profile_memory=True,
) as prof:
for step, batch in enumerate(dataloader):
train_step(batch)
prof.step()
NCCL Debugging
# Enable NCCL debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Common issues:
# - "NCCL WARN Connect to ... failed" → Network/firewall issue
# - "NCCL timeout" → Increase NCCL_TIMEOUT
# - "NCCL WARN Call to ibv_..." → InfiniBand configuration
Performance Optimization
Communication Overlap
# FSDP with prefetch
model = FSDP(
model,
forward_prefetch=True,
backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
)
# DeepSpeed overlap
{
"zero_optimization": {
"overlap_comm": true
}
}
Gradient Accumulation
# Reduce communication frequency
gradient_accumulation_steps = 8
for i, batch in enumerate(dataloader):
loss = model(batch) / gradient_accumulation_steps
loss.backward()
if (i + 1) % gradient_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Optimal Batch Size
# Find maximum batch size
def find_max_batch_size(model, start=1, max_batch=128):
batch_size = start
while batch_size <= max_batch:
try:
dummy_input = torch.randn(batch_size, ...).cuda()
loss = model(dummy_input).sum()
loss.backward()
torch.cuda.synchronize()
torch.cuda.empty_cache()
batch_size *= 2
except RuntimeError:
return batch_size // 2
return batch_size
Troubleshooting
Common Issues
1. Processes hang at initialization
# Check if all nodes can reach master
nc -zv master_node 29500
# Ensure same PyTorch/NCCL versions
python -c "import torch; print(torch.__version__)"
2. OOM on some GPUs
# Check memory distribution
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.memory_allocated(i) / 1e9:.2f} GB")
3. Slow multi-node training
# Check network bandwidth
iperf3 -c other_node
# Ensure InfiniBand is used
ibstat
4. NCCL errors
# Increase timeout
export NCCL_TIMEOUT=1800
# Try different algorithms
export NCCL_ALGO=Ring # or Tree
Cloud Setup
AWS
# Use p4d.24xlarge (8x A100)
# Enable EFA (Elastic Fabric Adapter)
aws ec2 run-instances \
--instance-type p4d.24xlarge \
--placement GroupName=my-cluster
GCP
# Use a2-ultragpu-8g (8x A100)
gcloud compute instances create trainer \
--machine-type=a2-ultragpu-8g \
--accelerator=count=8,type=nvidia-a100-80gb
References
-
PyTorch. (2025). "Distributed Data Parallel." PyTorch Documentation
-
NVIDIA. (2025). "NCCL Documentation." NVIDIA Developer
-
Meta. (2025). "torchrun Documentation." PyTorch Documentation