Why is my GPU utilization at 100% but training is slow?

100% GPU utilization means kernels are always running, but they might be inefficient. Check memory throughput (are you memory-bound?), Tensor Core usage (are you using TC-optimized ops?), and kernel efficiency. Use Nsight to identify specific bottlenecks.

When should I use shared memory?

Use shared memory when multiple threads in a block need to access the same data multiple times, or when you need fast inter-thread communication. It's 10-100x faster than global memory but limited in size (~160KB per SM). Flash Attention uses shared memory to avoid storing the full attention matrix.

How do I know if my code uses Tensor Cores?

Use Nsight Compute to check Tensor Core utilization. In PyTorch, Tensor Cores are used automatically for matmul/conv with FP16/BF16 when dimensions are aligned (multiples of 8). Enable TF32 for FP32 operations to use Tensor Cores with FP32 inputs.

What's the difference between CUDA cores and Tensor Cores?

CUDA cores are general-purpose compute units executing one operation per thread per cycle. Tensor Cores are specialized matrix units that compute entire matrix tiles (e.g., 16×16) per cycle. Tensor Cores provide 10-20x higher throughput for matrix operations but only work for specific operations and data types.

CUDA for ML Engineers: Memory Hierarchy and Optimization Basics

Understanding CUDA fundamentals helps you write faster ML code and debug performance issues. This guide covers what every ML engineer should know about GPU architecture and optimization.

GPU Architecture Overview

Processing Hierarchy

GPU
├── Streaming Multiprocessors (SMs): 108 on A100
│   ├── CUDA Cores: 64 per SM
│   ├── Tensor Cores: 4 per SM
│   ├── Shared Memory: 164KB per SM
│   └── Registers: 65536 per SM
└── Global Memory (HBM): 80GB

Execution Model

Kernel Launch
    ├── Grid (all blocks)
    │   ├── Block 0 → SM 0
    │   ├── Block 1 → SM 1
    │   └── ...
    └── Each Block
        └── Warps (32 threads each)
            └── Threads execute in lockstep

Memory Hierarchy

Memory Types and Characteristics

Memory	Size	Bandwidth	Latency	Scope
Registers	~256KB/SM	~20 TB/s	0 cycles	Per thread
Shared/L1	164KB/SM	~19 TB/s	~20 cycles	Per block
L2 Cache	40MB	~5 TB/s	~200 cycles	Global
HBM	80GB	2 TB/s	~400 cycles	Global
System RAM	TBs	25 GB/s	~10,000 cycles	Host

Memory Access Patterns

# BAD: Non-coalesced access (threads access scattered memory)
# Each thread accesses memory at different cache lines
for i in range(n):
    result[i] = data[indices[i]]  # Random access

# GOOD: Coalesced access (adjacent threads access adjacent memory)
# Threads 0-31 access consecutive memory → single memory transaction
for i in range(n):
    result[i] = data[i]  # Sequential access

Impact of Memory Access

Access Pattern	Effective Bandwidth
Coalesced	2000 GB/s
Strided (stride=2)	1000 GB/s
Random	50-100 GB/s

Tensor Cores

What They Do

Tensor Cores perform matrix multiply-accumulate in hardware:

D = A × B + C

Where A, B, C, D are small matrices (e.g., 16×16)
Single instruction computes entire result

Precision Support (A100)

Input	Accumulate	TFLOPS
FP16	FP32	312
BF16	FP32	312
TF32	FP32	156
INT8	INT32	624
FP64	FP64	19.5

Using Tensor Cores in PyTorch

# Tensor cores are used automatically when:
# 1. Dimensions are multiples of 8 (FP16) or 16 (INT8)
# 2. Using appropriate dtypes
# 3. Using matmul or conv operations

# Ensure tensor core usage
x = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
y = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
z = torch.matmul(x, y)  # Uses Tensor Cores

# Enable TF32 for FP32 operations (default on Ampere+)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Profiling PyTorch Code

Using torch.profiler

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    model(input)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Using NVIDIA Nsight

# Profile entire script
nsys profile -o report python train.py

# Profile specific section
nsys profile --stats=true python train.py

Identifying Bottlenecks

# Check if compute-bound or memory-bound
# Compute-bound: GPU utilization high, memory bandwidth low
# Memory-bound: GPU utilization low, memory bandwidth high

import torch

# Measure kernel time
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
result = operation(input)
end.record()
torch.cuda.synchronize()

print(f"Time: {start.elapsed_time(end):.2f} ms")

Common Optimizations

1. Kernel Fusion

# BAD: Multiple kernel launches
x = input + bias
x = F.relu(x)
x = F.dropout(x, p=0.1)

# GOOD: Fused operation (single kernel)
# PyTorch can fuse some operations automatically
# Or use torch.compile for automatic fusion
model = torch.compile(model)

2. Avoiding CPU-GPU Synchronization

# BAD: Forces synchronization
loss = model(input)
print(f"Loss: {loss.item()}")  # .item() syncs!

# GOOD: Batch synchronization points
losses = []
for batch in dataloader:
    losses.append(model(batch))

# Sync once at end
total_loss = torch.stack(losses).mean().item()

3. Memory Pre-allocation

# BAD: Allocate every iteration
for i in range(1000):
    output = torch.empty(size, device='cuda')
    operation(input, out=output)

# GOOD: Pre-allocate and reuse
output = torch.empty(size, device='cuda')
for i in range(1000):
    operation(input, out=output)  # Reuse buffer

4. Optimal Data Types

# Use BF16 for training (Ampere+)
model = model.to(torch.bfloat16)

# Use FP16 for inference
model = model.half()

# Use channels_last for CNNs
input = input.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

CUDA Streams

Concurrent Execution

# Create streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

# Operations on different streams can overlap
with torch.cuda.stream(stream1):
    result1 = operation1(data1)

with torch.cuda.stream(stream2):
    result2 = operation2(data2)

# Synchronize when needed
torch.cuda.synchronize()

Overlap Compute and Transfer

stream = torch.cuda.Stream()

for i, batch in enumerate(dataloader):
    # Start next transfer while computing current
    if i + 1 < len(dataloader):
        next_batch = dataloader[i + 1]
        with torch.cuda.stream(stream):
            next_batch = next_batch.to('cuda', non_blocking=True)

    # Compute on current batch
    output = model(batch)

Understanding GPU Utilization

What 100% Utilization Means

GPU Utilization = Time with active kernels / Total time

100% utilization doesn't mean optimal performance!
- Could be running slow kernels
- Could be memory-bound
- Check SM efficiency, memory throughput

Key Metrics to Monitor

Metric	Optimal	Issue If Low
SM Efficiency	>80%	Poor parallelism
Memory Throughput	>80% peak	Memory access pattern
Tensor Core Usage	>50%	Not using TC-enabled ops
PCIe Throughput	Low	Data transfer bottleneck

PyTorch 2.0 torch.compile

Automatic Optimization

model = torch.compile(model)

# Benefits:
# - Kernel fusion
# - Memory planning
# - Automatic mixed precision handling
# - Hardware-specific optimizations

Compile Modes

# Default: Balance of compile time and runtime
model = torch.compile(model, mode="default")

# Maximum optimization (longer compile)
model = torch.compile(model, mode="max-autotune")

# Faster compilation
model = torch.compile(model, mode="reduce-overhead")

References

NVIDIA. (2025). "CUDA Programming Guide." NVIDIA Developer
NVIDIA. (2025). "Nsight Systems Documentation." NVIDIA Developer
PyTorch. (2025). "CUDA Semantics." PyTorch Documentation
Dao, T. (2023). "FlashAttention: Fast and Memory-Efficient Exact Attention." arXiv:2205.14135