Understanding CUDA fundamentals helps you write faster ML code and debug performance issues. This guide covers what every ML engineer should know about GPU architecture and optimization.
GPU Architecture Overview
Processing Hierarchy
GPU
├── Streaming Multiprocessors (SMs): 108 on A100
│ ├── CUDA Cores: 64 per SM
│ ├── Tensor Cores: 4 per SM
│ ├── Shared Memory: 164KB per SM
│ └── Registers: 65536 per SM
└── Global Memory (HBM): 80GB
Execution Model
Kernel Launch
├── Grid (all blocks)
│ ├── Block 0 → SM 0
│ ├── Block 1 → SM 1
│ └── ...
└── Each Block
└── Warps (32 threads each)
└── Threads execute in lockstep
Memory Hierarchy
Memory Types and Characteristics
| Memory | Size | Bandwidth | Latency | Scope |
|---|---|---|---|---|
| Registers | ~256KB/SM | ~20 TB/s | 0 cycles | Per thread |
| Shared/L1 | 164KB/SM | ~19 TB/s | ~20 cycles | Per block |
| L2 Cache | 40MB | ~5 TB/s | ~200 cycles | Global |
| HBM | 80GB | 2 TB/s | ~400 cycles | Global |
| System RAM | TBs | 25 GB/s | ~10,000 cycles | Host |
Memory Access Patterns
# BAD: Non-coalesced access (threads access scattered memory)
# Each thread accesses memory at different cache lines
for i in range(n):
result[i] = data[indices[i]] # Random access
# GOOD: Coalesced access (adjacent threads access adjacent memory)
# Threads 0-31 access consecutive memory → single memory transaction
for i in range(n):
result[i] = data[i] # Sequential access
Impact of Memory Access
| Access Pattern | Effective Bandwidth |
|---|---|
| Coalesced | 2000 GB/s |
| Strided (stride=2) | 1000 GB/s |
| Random | 50-100 GB/s |
Tensor Cores
What They Do
Tensor Cores perform matrix multiply-accumulate in hardware:
D = A × B + C
Where A, B, C, D are small matrices (e.g., 16×16)
Single instruction computes entire result
Precision Support (A100)
| Input | Accumulate | TFLOPS |
|---|---|---|
| FP16 | FP32 | 312 |
| BF16 | FP32 | 312 |
| TF32 | FP32 | 156 |
| INT8 | INT32 | 624 |
| FP64 | FP64 | 19.5 |
Using Tensor Cores in PyTorch
# Tensor cores are used automatically when:
# 1. Dimensions are multiples of 8 (FP16) or 16 (INT8)
# 2. Using appropriate dtypes
# 3. Using matmul or conv operations
# Ensure tensor core usage
x = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
y = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
z = torch.matmul(x, y) # Uses Tensor Cores
# Enable TF32 for FP32 operations (default on Ampere+)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
Profiling PyTorch Code
Using torch.profiler
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
model(input)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Using NVIDIA Nsight
# Profile entire script
nsys profile -o report python train.py
# Profile specific section
nsys profile --stats=true python train.py
Identifying Bottlenecks
# Check if compute-bound or memory-bound
# Compute-bound: GPU utilization high, memory bandwidth low
# Memory-bound: GPU utilization low, memory bandwidth high
import torch
# Measure kernel time
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = operation(input)
end.record()
torch.cuda.synchronize()
print(f"Time: {start.elapsed_time(end):.2f} ms")
Common Optimizations
1. Kernel Fusion
# BAD: Multiple kernel launches
x = input + bias
x = F.relu(x)
x = F.dropout(x, p=0.1)
# GOOD: Fused operation (single kernel)
# PyTorch can fuse some operations automatically
# Or use torch.compile for automatic fusion
model = torch.compile(model)
2. Avoiding CPU-GPU Synchronization
# BAD: Forces synchronization
loss = model(input)
print(f"Loss: {loss.item()}") # .item() syncs!
# GOOD: Batch synchronization points
losses = []
for batch in dataloader:
losses.append(model(batch))
# Sync once at end
total_loss = torch.stack(losses).mean().item()
3. Memory Pre-allocation
# BAD: Allocate every iteration
for i in range(1000):
output = torch.empty(size, device='cuda')
operation(input, out=output)
# GOOD: Pre-allocate and reuse
output = torch.empty(size, device='cuda')
for i in range(1000):
operation(input, out=output) # Reuse buffer
4. Optimal Data Types
# Use BF16 for training (Ampere+)
model = model.to(torch.bfloat16)
# Use FP16 for inference
model = model.half()
# Use channels_last for CNNs
input = input.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)
CUDA Streams
Concurrent Execution
# Create streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
# Operations on different streams can overlap
with torch.cuda.stream(stream1):
result1 = operation1(data1)
with torch.cuda.stream(stream2):
result2 = operation2(data2)
# Synchronize when needed
torch.cuda.synchronize()
Overlap Compute and Transfer
stream = torch.cuda.Stream()
for i, batch in enumerate(dataloader):
# Start next transfer while computing current
if i + 1 < len(dataloader):
next_batch = dataloader[i + 1]
with torch.cuda.stream(stream):
next_batch = next_batch.to('cuda', non_blocking=True)
# Compute on current batch
output = model(batch)
Understanding GPU Utilization
What 100% Utilization Means
GPU Utilization = Time with active kernels / Total time
100% utilization doesn't mean optimal performance!
- Could be running slow kernels
- Could be memory-bound
- Check SM efficiency, memory throughput
Key Metrics to Monitor
| Metric | Optimal | Issue If Low |
|---|---|---|
| SM Efficiency | >80% | Poor parallelism |
| Memory Throughput | >80% peak | Memory access pattern |
| Tensor Core Usage | >50% | Not using TC-enabled ops |
| PCIe Throughput | Low | Data transfer bottleneck |
PyTorch 2.0 torch.compile
Automatic Optimization
model = torch.compile(model)
# Benefits:
# - Kernel fusion
# - Memory planning
# - Automatic mixed precision handling
# - Hardware-specific optimizations
Compile Modes
# Default: Balance of compile time and runtime
model = torch.compile(model, mode="default")
# Maximum optimization (longer compile)
model = torch.compile(model, mode="max-autotune")
# Faster compilation
model = torch.compile(model, mode="reduce-overhead")
References
-
NVIDIA. (2025). "CUDA Programming Guide." NVIDIA Developer
-
NVIDIA. (2025). "Nsight Systems Documentation." NVIDIA Developer
-
PyTorch. (2025). "CUDA Semantics." PyTorch Documentation
-
Dao, T. (2023). "FlashAttention: Fast and Memory-Efficient Exact Attention." arXiv:2205.14135