Back to all articles
GPU Optimization

CUDA for ML Engineers: Memory Hierarchy and Optimization Basics

Essential CUDA knowledge for ML engineers. Learn GPU architecture, memory hierarchy, kernel optimization, and how to profile and optimize PyTorch code for maximum performance.

Flash Attention TeamJanuary 8, 20267 min read
CUDAGPU programmingmemory hierarchykernel optimizationPyTorch

Understanding CUDA fundamentals helps you write faster ML code and debug performance issues. This guide covers what every ML engineer should know about GPU architecture and optimization.

GPU Architecture Overview

Processing Hierarchy

GPU
├── Streaming Multiprocessors (SMs): 108 on A100
│   ├── CUDA Cores: 64 per SM
│   ├── Tensor Cores: 4 per SM
│   ├── Shared Memory: 164KB per SM
│   └── Registers: 65536 per SM
└── Global Memory (HBM): 80GB

Execution Model

Kernel Launch
    ├── Grid (all blocks)
    │   ├── Block 0 → SM 0
    │   ├── Block 1 → SM 1
    │   └── ...
    └── Each Block
        └── Warps (32 threads each)
            └── Threads execute in lockstep

Memory Hierarchy

Memory Types and Characteristics

MemorySizeBandwidthLatencyScope
Registers~256KB/SM~20 TB/s0 cyclesPer thread
Shared/L1164KB/SM~19 TB/s~20 cyclesPer block
L2 Cache40MB~5 TB/s~200 cyclesGlobal
HBM80GB2 TB/s~400 cyclesGlobal
System RAMTBs25 GB/s~10,000 cyclesHost

Memory Access Patterns

# BAD: Non-coalesced access (threads access scattered memory)
# Each thread accesses memory at different cache lines
for i in range(n):
    result[i] = data[indices[i]]  # Random access

# GOOD: Coalesced access (adjacent threads access adjacent memory)
# Threads 0-31 access consecutive memory → single memory transaction
for i in range(n):
    result[i] = data[i]  # Sequential access

Impact of Memory Access

Access PatternEffective Bandwidth
Coalesced2000 GB/s
Strided (stride=2)1000 GB/s
Random50-100 GB/s

Tensor Cores

What They Do

Tensor Cores perform matrix multiply-accumulate in hardware:

D = A × B + C

Where A, B, C, D are small matrices (e.g., 16×16)
Single instruction computes entire result

Precision Support (A100)

InputAccumulateTFLOPS
FP16FP32312
BF16FP32312
TF32FP32156
INT8INT32624
FP64FP6419.5

Using Tensor Cores in PyTorch

# Tensor cores are used automatically when:
# 1. Dimensions are multiples of 8 (FP16) or 16 (INT8)
# 2. Using appropriate dtypes
# 3. Using matmul or conv operations

# Ensure tensor core usage
x = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
y = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
z = torch.matmul(x, y)  # Uses Tensor Cores

# Enable TF32 for FP32 operations (default on Ampere+)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Profiling PyTorch Code

Using torch.profiler

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    model(input)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Using NVIDIA Nsight

# Profile entire script
nsys profile -o report python train.py

# Profile specific section
nsys profile --stats=true python train.py

Identifying Bottlenecks

# Check if compute-bound or memory-bound
# Compute-bound: GPU utilization high, memory bandwidth low
# Memory-bound: GPU utilization low, memory bandwidth high

import torch

# Measure kernel time
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
result = operation(input)
end.record()
torch.cuda.synchronize()

print(f"Time: {start.elapsed_time(end):.2f} ms")

Common Optimizations

1. Kernel Fusion

# BAD: Multiple kernel launches
x = input + bias
x = F.relu(x)
x = F.dropout(x, p=0.1)

# GOOD: Fused operation (single kernel)
# PyTorch can fuse some operations automatically
# Or use torch.compile for automatic fusion
model = torch.compile(model)

2. Avoiding CPU-GPU Synchronization

# BAD: Forces synchronization
loss = model(input)
print(f"Loss: {loss.item()}")  # .item() syncs!

# GOOD: Batch synchronization points
losses = []
for batch in dataloader:
    losses.append(model(batch))

# Sync once at end
total_loss = torch.stack(losses).mean().item()

3. Memory Pre-allocation

# BAD: Allocate every iteration
for i in range(1000):
    output = torch.empty(size, device='cuda')
    operation(input, out=output)

# GOOD: Pre-allocate and reuse
output = torch.empty(size, device='cuda')
for i in range(1000):
    operation(input, out=output)  # Reuse buffer

4. Optimal Data Types

# Use BF16 for training (Ampere+)
model = model.to(torch.bfloat16)

# Use FP16 for inference
model = model.half()

# Use channels_last for CNNs
input = input.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

CUDA Streams

Concurrent Execution

# Create streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

# Operations on different streams can overlap
with torch.cuda.stream(stream1):
    result1 = operation1(data1)

with torch.cuda.stream(stream2):
    result2 = operation2(data2)

# Synchronize when needed
torch.cuda.synchronize()

Overlap Compute and Transfer

stream = torch.cuda.Stream()

for i, batch in enumerate(dataloader):
    # Start next transfer while computing current
    if i + 1 < len(dataloader):
        next_batch = dataloader[i + 1]
        with torch.cuda.stream(stream):
            next_batch = next_batch.to('cuda', non_blocking=True)

    # Compute on current batch
    output = model(batch)

Understanding GPU Utilization

What 100% Utilization Means

GPU Utilization = Time with active kernels / Total time

100% utilization doesn't mean optimal performance!
- Could be running slow kernels
- Could be memory-bound
- Check SM efficiency, memory throughput

Key Metrics to Monitor

MetricOptimalIssue If Low
SM Efficiency>80%Poor parallelism
Memory Throughput>80% peakMemory access pattern
Tensor Core Usage>50%Not using TC-enabled ops
PCIe ThroughputLowData transfer bottleneck

PyTorch 2.0 torch.compile

Automatic Optimization

model = torch.compile(model)

# Benefits:
# - Kernel fusion
# - Memory planning
# - Automatic mixed precision handling
# - Hardware-specific optimizations

Compile Modes

# Default: Balance of compile time and runtime
model = torch.compile(model, mode="default")

# Maximum optimization (longer compile)
model = torch.compile(model, mode="max-autotune")

# Faster compilation
model = torch.compile(model, mode="reduce-overhead")

References

  1. NVIDIA. (2025). "CUDA Programming Guide." NVIDIA Developer

  2. NVIDIA. (2025). "Nsight Systems Documentation." NVIDIA Developer

  3. PyTorch. (2025). "CUDA Semantics." PyTorch Documentation

  4. Dao, T. (2023). "FlashAttention: Fast and Memory-Efficient Exact Attention." arXiv:2205.14135

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel