Permutation — The Bridge Between Atom and Tile

1. The One-Sentence Summary

Permutation answers: "Each thread needs N elements. How should those elements be grouped in memory?"

2. Why It's Needed

Given (step 1 example): Atom shape: 1×1 (one element per atom) atoms_layout: 16×16 (256 atoms = 256 threads) CTA tile: 128×128 (16384 elements) Native coverage = 16×16 = 256 elements Need to cover = 128×128 = 16384 elements GAP: 256 vs 16384 — each thread must handle 64 elements, not 1! Permutation fills this gap by telling each thread: "You're responsible for 8 M-elements and 8 N-elements. Here's the pattern."
You have 16 painters (atoms) for a 128-meter wall. Each painter MUST paint 8 meters. But do they paint 8 consecutive meters? Or 2 sections of 4 meters with a gap? Or 4 sections of 2 meters? Permutation is this choice — same total work, different arrangement.

3. The Derivation Chain (Only ONE Free Choice)

F = atoms_in_M = 16
fixed by atoms_layout
×
R = ? (your choice)
consecutive group size
=
period = F × R
one cycle of pattern
reps = BM / period
how many cycles
shape = (R, reps)
output shape in M

The only constraint on R:

F × R must evenly divide BM. Since F=16 and BM=128, R must divide 128/16 = 8.

Valid choices: R ∈ {1, 2, 4, 8}

4. Interactive: See How R Changes Everything

F (atoms in M) = 16 ← fixed
R (your choice) = 4 the only free parameter
period = F × R = 64 ← derived
repetitions = BM / period = 128 / 64 = 2 ← derived
shape per thread in M = (4, 2) = (R, reps) = 8 elements
permutation_M = (16, 4):(/4, 1) = (F, R):(R, 1)

Thread 0's M-elements (out of 128 rows) — 2 groups of 4 consecutive

Thread 0's elements
Other threads' elements
Period boundary

5. Why R=4? (Performance Reasoning)

Memory transaction sizes on NVIDIA GPUs: 32 bytes = minimum transaction (8 × f32... wait, 32/4 = 8 f32 per cache sector) 128 bytes = one full cache line = 32 × f32 When storing C (row-major, stride=128 for the tile): Each row: threads write to consecutive columns. Thread 0 writes cols {0,1,2,3, 64,65,66,67} — two bursts of 4. With 16 threads writing to the same row: T0: cols 0-3, T1: cols 4-7, ..., T15: cols 60-63 = 64 consecutive elements = 256 bytes = 2 full cache lines, perfectly coalesced! Compare R=1 (scattered): Thread 0 writes cols {0, 8, 16, 24, 32, 40, 48, 56} Thread 1 writes cols {1, 9, 17, 25, 33, 41, 49, 57} Each individual write is just 4 bytes, spread across many cache lines. 16 threads hit 8 different cache lines per store = terrible coalescing! Compare R=8 (one big block): Thread 0 writes cols {0,1,2,3,4,5,6,7} Thread 1 writes cols {8,9,10,11,12,13,14,15} Also coalesced! 16 threads × 8 = 128 elements = 512 bytes. R=8 is also good. But R=4 uses fewer registers for address computation. Rule of thumb: R = 4 for fp32 (4 × 4 bytes = 16 bytes = 128 bits = one vector register) R = 8 for fp16 (8 × 2 bytes = 16 bytes = 128 bits = one vector register) Match R to the vector load/store width of your data type.

6. The Period Visualized

7. All Valid Choices Compared

R Period (F×R) Repetitions Thread 0's pattern Shape Coalescing
1 16 8 [0], [16], [32], [48], [64], [80], [96], [112] (1, 8) Bad — scattered writes
2 32 4 [0,1], [32,33], [64,65], [96,97] (2, 4) OK — 64-bit stores
4 64 2 [0,1,2,3], [64,65,66,67] (4, 2) Great — 128-bit vector stores
8 128 1 [0,1,2,3,4,5,6,7] (8, 1) Great — 256-bit, one contiguous block

8. Permutation for Tensor Cores (mma.sync.m16n8k16)

Different atom = different permutation style

With tensor cores, the atom is already big (16×8×16, 32 threads). Permutation is no longer a layout — it's just integer sizes specifying total coverage.

Step 4 Configuration: Atom: mma.sync.m16n8k16 (16 rows × 8 cols, 32 threads per atom) atoms_layout: (2, 2, 1) (2 in M, 2 in N, 1 in K = 4 atoms = 4 warps) CTA tile: 128 × 128 × 32 NUM_THREADS: 4 atoms × 32 threads/atom = 128 threads Native coverage (atoms_layout × atom_shape): M: 2 atoms × 16 rows/atom = 32 rows N: 2 atoms × 8 cols/atom = 16 cols K: 1 atom × 16 depth = 16 Native coverage = 32 × 16 = 512 elements per "step" Need to cover: 128 × 128 = 16384 elements GAP: 512 vs 16384

Permutation as Integer Sizes

Step 4's code: permutation_mnk = ( ATOM_M * MMA_M, # 2 * 16 = 32 ATOM_N * MMA_N * 2, # 2 * 8 * 2 = 32 ATOM_K * MMA_K, # 1 * 16 = 16 ) This says: "one thread's total responsibility per k-step covers 32M × 32N elements."

Why 32 × 32? Tracing the Math

M-dimension: Native M coverage = atoms_M(2) × atom_M(16) = 32 permutation_M = 32 permutation_M = native M coverage → no expansion in M! Each thread keeps exactly what 2 atoms give it in M. Iterations in M: BM / permutation_M = 128 / 32 = 4 N-dimension: Native N coverage = atoms_N(2) × atom_N(8) = 16 permutation_N = 32 permutation_N > native N coverage → expansion by factor 2! permutation_N / native_N = 32 / 16 = 2 Each thread visits 2 N-positions per atom (doubles N responsibility). Iterations in N: BN / permutation_N = 128 / 32 = 4 K-dimension: permutation_K = 16 = atom_K → one MMA per K-slot K-iterations within BK: BK / permutation_K = 32 / 16 = 2 MMA calls per k-tile Total iterations to cover 128×128: M-iterations × N-iterations = 4 × 4 = 16 MMA calls per k-step × K-iterations per tile = × 2 = 32 MMA calls per k-tile × num k-tiles (K/BK) = depends on problem K

Elements Per Thread

How many C-elements does one thread accumulate? Each atom invocation: 1 thread produces V=4 output elements N-expansion factor: 2 (from permutation) M-iterations: 4 N-iterations: 4 But wait — iterations are across the TILE, not per-thread storage. Each thread stores: V(4) × M_rest × N_rest from partition_C From the atom's perspective (1 warp = 32 threads): One atom: 16×8 = 128 elements / 32 threads = 4 per thread (V=4) 4 atoms per CTA: each thread participates in 1 atom (warp-level) From the tiling perspective: permutation_M = 32, BM = 128 → each thread has 128/32 = 4 "M-slots" permutation_N = 32, BN = 128 → each thread has 128/32 = 4 "N-slots" V per atom = 4 Total per thread: 4 × 4 × 4 = 64 accumulator registers (f32) = 256 bytes of registers per thread for C alone!

Comparison: SIMT vs Tensor Core Permutation

Step 1 (Scalar FMA) Step 4 (mma.sync.m16n8k16)
Atom size 1×1×1 (trivial) 16×8×16 (big)
Permutation format Layout: (16,4):(4,1) Integer sizes: (32, 32, 16)
What it controls Element grouping pattern (R=4 for coalescing) Total coverage size per thread
Why a Layout for step 1? Because with trivial atoms, there's no hardware-defined element pattern — YOU choose it via the stride arrangement
Why integers for step 4? Because the atom already defines the element pattern (hardware TV layout). You just say "how much total" and CuTe tiles it.
Coalescing control Permutation's R value Handled by ldmatrix + swizzled SMEM (not by permutation)
Native coverage 16×16 = 256 (tiny) 32×16 = 512 (already decent)
Expansion needed ×64 (massive) ×32 (moderate)

Key Insight for Tensor Cores

With tensor cores, the atom's TV layout already defines the element access pattern (it's in the silicon). The permutation's job simplifies to just: "how many times do I repeat the atom across my tile?" You no longer get to choose the grouping — NVIDIA chose it for you at the hardware level.

Memory coalescing for tensor cores is instead handled by swizzled shared memory (bank-conflict-free) and ldmatrix (warp-level SMEM→register load that matches the atom's layout).

Visual: How 4 Warps Cover 128×128 in 16 Steps

The 128×128 CTA tile, covered in 4×4 = 16 iterations: Each colored block = one step's coverage (32×32): N: 0-31 32-63 64-95 96-127 ┌─────────┬─────────┬─────────┬─────────┐ M: │ step 0 │ step 1 │ step 2 │ step 3 │ rows 0-31 0-31 │ 32×32 │ 32×32 │ 32×32 │ 32×32 │ ├─────────┼─────────┼─────────┼─────────┤ 32-63 │ step 4 │ step 5 │ step 6 │ step 7 │ rows 32-63 ├─────────┼─────────┼─────────┼─────────┤ 64-95 │ step 8 │ step 9 │ step 10 │ step 11 │ rows 64-95 ├─────────┼─────────┼─────────┼─────────┤ 96-127│ step 12 │ step 13 │ step 14 │ step 15 │ rows 96-127 └─────────┴─────────┴─────────┴─────────┘ Within each 32×32 step: 4 warps execute their atoms (2 in M × 2 in N): ┌───────────┬───────────┐ │ Warp0 │ Warp1 │ atoms_N = 2 → 2×8 = 16 cols │ 16×8 │ 16×8 │ ├───────────┼───────────┤ + N-expansion ×2 │ Warp2 │ Warp3 │ → covers 16×2 = 32 cols total │ 16×8 │ 16×8 │ └───────────┴───────────┘ atoms_M = 2 → 2×16 = 32 rows N-expansion ×2 (from permutation_N=32 vs native 16): Each warp's atom visits 2 adjacent 8-col blocks → 16 cols per warp 2 warps in N → 32 cols total

9. Final Mental Model — Both Cases

━━━ Case 1: SIMT (scalar FMA, step 1) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Atom is trivial (1×1×1) → permutation defines the PATTERN. FIXED: atoms_in_M = 16, BM = 128, elements/thread = 8 FREE: R = consecutive group size (choose 4 for 128-bit stores) DERIVED: period = 16×R, reps = BM/period, shape = (R, reps) Permutation format: Layout — (F, R):(R, 1) You control: grouping pattern for coalesced memory access. ━━━ Case 2: Tensor Core (mma.sync.m16n8k16, step 4) ━━━━━━━━━━━━━━━ Atom is complex (16×8×16) → permutation defines the SIZE only. FIXED: atom shape = 16×8, atoms_layout = (2,2,1), BM=BN=128 FREE: permutation_N multiplier (×2 in step4) DERIVED: iterations = BM/perm_M × BN/perm_N = 4×4 = 16 steps Permutation format: Integer tuple — (32, 32, 16) You control: how much each thread accumulates (more = more registers). You DON'T control: element access pattern (that's in the atom's TV layout). ━━━ The Common Principle ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ In BOTH cases, permutation bridges the gap: native coverage (atoms × atom_shape) → PERMUTATION → full CTA tile The difference is what "bridging" means: SIMT: expand + arrange pattern (layout) TC: expand + let hardware arrangement stand (integer sizes)