domain · domain · coding
CUDA Shared Memory & Memory Coalescing
GPU performance is bound by memory bandwidth. Shared memory + coalesced global access are the two highest-leverage CUDA optimizations. Required depth at NVIDIA and Tesla Autopilot.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
GPU compute is cheap; GPU memory traffic is not. Every NVIDIA optimization comes back to "feed the cores faster." Shared memory is a programmer-managed L1, fast (≈100x faster than global) but tiny (≈48–164 KB/SM depending on architecture). Coalescing means making sure 32 threads in a warp issue a single 128-byte transaction instead of 32 separate ones.
The memory hierarchy from fastest to slowest: registers (per-thread), shared memory (per-block, programmer-managed, banked), L1/L2 cache, global memory (HBM). Coalescing rule: when a warp of 32 threads accesses contiguous, aligned addresses, the hardware issues ONE memory transaction. When threads access strided or scattered addresses, the hardware serializes, performance collapses. Shared memory is divided into 32 banks; if two threads in a warp hit the same bank, they serialize (bank conflict). Avoid via padding or transposed access patterns.
When to use
Reused data within a thread block (tiling matmul, stencil, reduction). Any kernel where you can identify "I am loading this value, and 31 of my warp-mates need it too", load once to shared.
When not to
Single-pass kernels with no reuse, shared memory adds copy cost without benefit. Memory-bound kernels where global access is already coalesced and you have nothing to amortize.
Bad (strided, no coalescing): Thread 0 reads addr 0 Thread 1 reads addr 1024 ← scattered → 32 separate transactions ... Good (coalesced): Thread 0 reads addr 0 Thread 1 reads addr 4 ← contiguous 4-byte floats → 1 transaction (128 B) ... Thread 31 reads addr 124
Key insights
- Coalescing trumps occupancy, a kernel at 25% occupancy with coalesced access often beats 100% occupancy with strided access.
- Shared memory bank conflicts are the silent killer. Pad inner dimension by 1 (e.g., __shared__ float tile[32][33]) to eliminate stride-32 conflicts.
- Always profile with Nsight Compute. "Roofline" model tells you whether you are compute-bound or memory-bound, optimizations differ.
- Warp-level primitives (__shfl_sync, __shfl_xor_sync) skip shared memory entirely for intra-warp data exchange, even faster.
- On H100 (Hopper) Distributed Shared Memory lets blocks within a cluster read each other's shared memory, new optimization axis.
- On Blackwell (B200) Tensor Memory Accelerator (TMA) replaces hand-rolled cp.async copies, know the API by name.