How many GPUs do I need for training a 7B model?

With full fine-tuning in BF16, you need about 100GB total—achievable with 2x A100 80GB or 4x A100 40GB using FSDP/DeepSpeed. With QLoRA, a single 24GB GPU (RTX 4090) is sufficient. More GPUs increase throughput but aren't strictly required.

Should I use NVLink or InfiniBand?

NVLink is for GPU-to-GPU communication within a node (automatic, no configuration needed). InfiniBand is for node-to-node communication in clusters. For single-node training, NVLink is used automatically. For multi-node, InfiniBand provides much better performance than Ethernet.

Why is my multi-GPU training not faster?

Common causes: (1) Communication overhead exceeding compute benefit—try larger batch sizes, (2) GPU utilization is low—profile to find bottlenecks, (3) I/O bottleneck—use more DataLoader workers, (4) Uneven batch distribution—ensure equal work per GPU.

How do I save checkpoints in distributed training?

Only save from rank 0 to avoid duplicates. Use barriers to ensure all ranks are synchronized. For FSDP/DeepSpeed, use their built-in checkpoint methods to handle sharded state correctly.

Multi-GPU Training Setup: From Single Node to Cluster

Setting up multi-GPU training infrastructure requires careful attention to hardware, networking, and software configuration. This guide walks through the complete setup process from single node to multi-node clusters.

Hardware Requirements

GPU Selection

GPU	VRAM	Interconnect	Best For
RTX 4090	24 GB	PCIe 4.0	Budget training
A100 40GB	40 GB	NVLink 3.0	Production training
A100 80GB	80 GB	NVLink 3.0	Large models
H100 80GB	80 GB	NVLink 4.0	Maximum performance

Interconnect Comparison

Interconnect	Bandwidth	Latency	Use Case
PCIe 4.0	64 GB/s	~1-2 μs	Consumer GPUs
NVLink 3.0	600 GB/s	~0.7 μs	A100
NVLink 4.0	900 GB/s	~0.5 μs	H100
InfiniBand HDR	200 Gb/s	~1 μs	Multi-node
InfiniBand NDR	400 Gb/s	~0.5 μs	High-performance

Single Node Configuration

For a single training node:

# Check GPU topology
nvidia-smi topo -m

# Example output (8x A100 with NVSwitch):
#         GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
# GPU0     X    NV12  NV12  NV12  NV12  NV12  NV12  NV12
# GPU1    NV12   X    NV12  NV12  NV12  NV12  NV12  NV12
# ...

Software Setup

CUDA and Drivers

# Check CUDA version
nvcc --version

# Check driver version
nvidia-smi

# Recommended versions (as of 2024)
# CUDA: 12.1+
# Driver: 535+

PyTorch Installation

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.cuda.device_count())"

NCCL Configuration

# Check NCCL version
python -c "import torch; print(torch.cuda.nccl.version())"

# Key environment variables
export NCCL_DEBUG=INFO              # Debug output
export NCCL_IB_DISABLE=0            # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5         # GPU Direct RDMA
export NCCL_SOCKET_IFNAME=eth0      # Network interface
export NCCL_P2P_LEVEL=NVL           # NVLink for P2P

Single Node Training

Basic Setup with torchrun

# train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    # Initialize process group
    dist.init_process_group(backend="nccl")

    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])

    # Set device
    torch.cuda.set_device(local_rank)

    # Create model
    model = MyModel().cuda()
    model = DDP(model, device_ids=[local_rank])

    # Training loop
    ...

    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Launch with:

# 4 GPUs on single node
torchrun --nproc_per_node=4 train.py

# All available GPUs
torchrun --nproc_per_node=gpu train.py

Using Accelerate

# Configure accelerate
accelerate config

# Launch training
accelerate launch train.py

Multi-Node Setup

Network Configuration

# On each node, check network
ip addr show

# Test connectivity between nodes
ping node2
ib_write_bw --all  # InfiniBand bandwidth test

SSH Setup

# Generate SSH key
ssh-keygen -t rsa

# Copy to all nodes
ssh-copy-id user@node1
ssh-copy-id user@node2

# Verify passwordless SSH
ssh node1 hostname
ssh node2 hostname

Hostfile Configuration

# hostfile
node1 slots=8
node2 slots=8
node3 slots=8
node4 slots=8

Launch Multi-Node Training

# Using torchrun (recommended)
# On node 0 (master):
torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=0 \
    --master_addr=node1 \
    --master_port=29500 \
    train.py

# On node 1:
torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=1 \
    --master_addr=node1 \
    --master_port=29500 \
    train.py

Using DeepSpeed Launcher

# With hostfile
deepspeed --hostfile=hostfile train.py --deepspeed ds_config.json

# Or with explicit hosts
deepspeed --num_gpus=8 --num_nodes=4 \
    --hostfile=hostfile \
    train.py --deepspeed ds_config.json

Monitoring and Profiling

GPU Monitoring

# Real-time GPU stats
watch -n 1 nvidia-smi

# Detailed memory usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1

# DCGM for cluster monitoring
dcgmi dmon -e 1001,1002,1003,1004,1005

PyTorch Profiler

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(batch)
        prof.step()

NCCL Debugging

# Enable NCCL debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Common issues:
# - "NCCL WARN Connect to ... failed" → Network/firewall issue
# - "NCCL timeout" → Increase NCCL_TIMEOUT
# - "NCCL WARN Call to ibv_..." → InfiniBand configuration

Performance Optimization

Communication Overlap

# FSDP with prefetch
model = FSDP(
    model,
    forward_prefetch=True,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
)

# DeepSpeed overlap
{
    "zero_optimization": {
        "overlap_comm": true
    }
}

Gradient Accumulation

# Reduce communication frequency
gradient_accumulation_steps = 8

for i, batch in enumerate(dataloader):
    loss = model(batch) / gradient_accumulation_steps
    loss.backward()

    if (i + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Optimal Batch Size

# Find maximum batch size
def find_max_batch_size(model, start=1, max_batch=128):
    batch_size = start
    while batch_size <= max_batch:
        try:
            dummy_input = torch.randn(batch_size, ...).cuda()
            loss = model(dummy_input).sum()
            loss.backward()
            torch.cuda.synchronize()
            torch.cuda.empty_cache()
            batch_size *= 2
        except RuntimeError:
            return batch_size // 2
    return batch_size

Troubleshooting

Common Issues

1. Processes hang at initialization

# Check if all nodes can reach master
nc -zv master_node 29500

# Ensure same PyTorch/NCCL versions
python -c "import torch; print(torch.__version__)"

2. OOM on some GPUs

# Check memory distribution
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.memory_allocated(i) / 1e9:.2f} GB")

3. Slow multi-node training

# Check network bandwidth
iperf3 -c other_node

# Ensure InfiniBand is used
ibstat

4. NCCL errors

# Increase timeout
export NCCL_TIMEOUT=1800

# Try different algorithms
export NCCL_ALGO=Ring  # or Tree

Cloud Setup

AWS

# Use p4d.24xlarge (8x A100)
# Enable EFA (Elastic Fabric Adapter)
aws ec2 run-instances \
    --instance-type p4d.24xlarge \
    --placement GroupName=my-cluster

GCP

# Use a2-ultragpu-8g (8x A100)
gcloud compute instances create trainer \
    --machine-type=a2-ultragpu-8g \
    --accelerator=count=8,type=nvidia-a100-80gb

References

PyTorch. (2025). "Distributed Data Parallel." PyTorch Documentation
NVIDIA. (2025). "NCCL Documentation." NVIDIA Developer
Meta. (2025). "torchrun Documentation." PyTorch Documentation