Permutation answers: "Each thread needs N elements. How should those elements be grouped in memory?"
2. Why It's Needed
Given (step 1 example):
Atom shape: 1×1 (one element per atom)
atoms_layout: 16×16 (256 atoms = 256 threads)
CTA tile: 128×128 (16384 elements)
Native coverage = 16×16 = 256 elements
Need to cover = 128×128 = 16384 elements
GAP: 256 vs 16384 — each thread must handle 64 elements, not 1!
Permutation fills this gap by telling each thread:
"You're responsible for 8 M-elements and 8 N-elements. Here's the pattern."
You have 16 painters (atoms) for a 128-meter wall. Each painter MUST paint 8 meters. But do they paint 8 consecutive meters? Or 2 sections of 4 meters with a gap? Or 4 sections of 2 meters? Permutation is this choice — same total work, different arrangement.
3. The Derivation Chain (Only ONE Free Choice)
F = atoms_in_M = 16 fixed by atoms_layout
×
R = ? (your choice) consecutive group size
=
period = F × R one cycle of pattern
→
reps = BM / period how many cycles
→
shape = (R, reps) output shape in M
The only constraint on R:
F × R must evenly divide BM. Since F=16 and BM=128, R must divide 128/16 = 8.
shape per thread in M = (4, 2) = (R, reps) = 8 elements
permutation_M = (16, 4):(/4, 1) = (F, R):(R, 1)
Thread 0's M-elements (out of 128 rows) — 2 groups of 4 consecutive
Thread 0's elements
Other threads' elements
Period boundary
5. Why R=4? (Performance Reasoning)
Memory transaction sizes on NVIDIA GPUs:
32 bytes = minimum transaction (8 × f32... wait, 32/4 = 8 f32 per cache sector)
128 bytes = one full cache line = 32 × f32
When storing C (row-major, stride=128 for the tile):
Each row: threads write to consecutive columns.
Thread 0 writes cols {0,1,2,3, 64,65,66,67} — two bursts of 4.
With 16 threads writing to the same row:
T0: cols 0-3, T1: cols 4-7, ..., T15: cols 60-63
= 64 consecutive elements = 256 bytes = 2 full cache lines, perfectly coalesced!Compare R=1 (scattered):
Thread 0 writes cols {0, 8, 16, 24, 32, 40, 48, 56}
Thread 1 writes cols {1, 9, 17, 25, 33, 41, 49, 57}
Each individual write is just 4 bytes, spread across many cache lines.
16 threads hit 8 different cache lines per store = terrible coalescing!Compare R=8 (one big block):
Thread 0 writes cols {0,1,2,3,4,5,6,7}
Thread 1 writes cols {8,9,10,11,12,13,14,15}
Also coalesced! 16 threads × 8 = 128 elements = 512 bytes.
R=8 is also good. But R=4 uses fewer registers for address computation.
Rule of thumb:
R = 4 for fp32 (4 × 4 bytes = 16 bytes = 128 bits = one vector register)
R = 8 for fp16 (8 × 2 bytes = 16 bytes = 128 bits = one vector register)
Match R to the vector load/store width of your data type.
6. The Period Visualized
7. All Valid Choices Compared
R
Period (F×R)
Repetitions
Thread 0's pattern
Shape
Coalescing
1
16
8
[0], [16], [32], [48], [64], [80], [96], [112]
(1, 8)
Bad — scattered writes
2
32
4
[0,1], [32,33], [64,65], [96,97]
(2, 4)
OK — 64-bit stores
4
64
2
[0,1,2,3], [64,65,66,67]
(4, 2)
Great — 128-bit vector stores
8
128
1
[0,1,2,3,4,5,6,7]
(8, 1)
Great — 256-bit, one contiguous block
8. Permutation for Tensor Cores (mma.sync.m16n8k16)
Different atom = different permutation style
With tensor cores, the atom is already big (16×8×16, 32 threads). Permutation is no longer a layout — it's just integer sizes specifying total coverage.
Step 4 Configuration:
Atom: mma.sync.m16n8k16 (16 rows × 8 cols, 32 threads per atom)
atoms_layout: (2, 2, 1) (2 in M, 2 in N, 1 in K = 4 atoms = 4 warps)
CTA tile: 128 × 128 × 32
NUM_THREADS: 4 atoms × 32 threads/atom = 128 threads
Native coverage (atoms_layout × atom_shape):
M: 2 atoms × 16 rows/atom = 32 rows
N: 2 atoms × 8 cols/atom = 16 cols
K: 1 atom × 16 depth = 16
Native coverage = 32 × 16 = 512 elements per "step"
Need to cover: 128 × 128 = 16384 elements
GAP: 512 vs 16384
M-dimension:
Native M coverage = atoms_M(2) × atom_M(16) = 32
permutation_M = 32
permutation_M = native M coverage → no expansion in M!
Each thread keeps exactly what 2 atoms give it in M.
Iterations in M: BM / permutation_M = 128 / 32 = 4N-dimension:
Native N coverage = atoms_N(2) × atom_N(8) = 16
permutation_N = 32
permutation_N > native N coverage → expansion by factor 2!
permutation_N / native_N = 32 / 16 = 2
Each thread visits 2 N-positions per atom (doubles N responsibility).
Iterations in N: BN / permutation_N = 128 / 32 = 4K-dimension:
permutation_K = 16 = atom_K → one MMA per K-slot
K-iterations within BK: BK / permutation_K = 32 / 16 = 2 MMA calls per k-tile
Total iterations to cover 128×128:
M-iterations × N-iterations = 4 × 4 = 16 MMA calls per k-step
× K-iterations per tile = × 2
= 32 MMA calls per k-tile
× num k-tiles (K/BK) = depends on problem K
Elements Per Thread
How many C-elements does one thread accumulate?
Each atom invocation: 1 thread produces V=4 output elements
N-expansion factor: 2 (from permutation)
M-iterations: 4
N-iterations: 4
But wait — iterations are across the TILE, not per-thread storage.
Each thread stores: V(4) × M_rest × N_rest from partition_C
From the atom's perspective (1 warp = 32 threads):
One atom: 16×8 = 128 elements / 32 threads = 4 per thread (V=4)
4 atoms per CTA: each thread participates in 1 atom (warp-level)
From the tiling perspective:
permutation_M = 32, BM = 128 → each thread has 128/32 = 4 "M-slots"
permutation_N = 32, BN = 128 → each thread has 128/32 = 4 "N-slots"
V per atom = 4
Total per thread: 4 × 4 × 4 = 64 accumulator registers (f32)
= 256 bytes of registers per thread for C alone!
Comparison: SIMT vs Tensor Core Permutation
Step 1 (Scalar FMA)
Step 4 (mma.sync.m16n8k16)
Atom size
1×1×1 (trivial)
16×8×16 (big)
Permutation format
Layout: (16,4):(4,1)
Integer sizes: (32, 32, 16)
What it controls
Element grouping pattern (R=4 for coalescing)
Total coverage size per thread
Why a Layout for step 1?
Because with trivial atoms, there's no hardware-defined element pattern — YOU choose it via the stride arrangement
Why integers for step 4?
Because the atom already defines the element pattern (hardware TV layout). You just say "how much total" and CuTe tiles it.
Coalescing control
Permutation's R value
Handled by ldmatrix + swizzled SMEM (not by permutation)
Native coverage
16×16 = 256 (tiny)
32×16 = 512 (already decent)
Expansion needed
×64 (massive)
×32 (moderate)
Key Insight for Tensor Cores
With tensor cores, the atom's TV layout already defines the element access pattern (it's in the silicon). The permutation's job simplifies to just: "how many times do I repeat the atom across my tile?" You no longer get to choose the grouping — NVIDIA chose it for you at the hardware level.
Memory coalescing for tensor cores is instead handled by swizzled shared memory (bank-conflict-free) and ldmatrix (warp-level SMEM→register load that matches the atom's layout).
━━━ Case 1: SIMT (scalar FMA, step 1) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Atom is trivial (1×1×1) → permutation defines the PATTERN.
FIXED: atoms_in_M = 16, BM = 128, elements/thread = 8
FREE: R = consecutive group size (choose 4 for 128-bit stores)
DERIVED: period = 16×R, reps = BM/period, shape = (R, reps)
Permutation format: Layout — (F, R):(R, 1)
You control: grouping pattern for coalesced memory access.
━━━ Case 2: Tensor Core (mma.sync.m16n8k16, step 4) ━━━━━━━━━━━━━━━
Atom is complex (16×8×16) → permutation defines the SIZE only.
FIXED: atom shape = 16×8, atoms_layout = (2,2,1), BM=BN=128
FREE: permutation_N multiplier (×2 in step4)
DERIVED: iterations = BM/perm_M × BN/perm_N = 4×4 = 16 steps
Permutation format: Integer tuple — (32, 32, 16)
You control: how much each thread accumulates (more = more registers).
You DON'T control: element access pattern (that's in the atom's TV layout).
━━━ The Common Principle ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
In BOTH cases, permutation bridges the gap:
native coverage (atoms × atom_shape) → PERMUTATION → full CTA tile
The difference is what "bridging" means:
SIMT: expand + arrange pattern (layout)
TC: expand + let hardware arrangement stand (integer sizes)