TV Layout & MMA Atom Explained

Each thread holds some elements in its registers (private storage). The TV Layout is the map that says:

2. "Fixed by NVIDIA's ISA" — What Does That Mean?

Property	Step 1: Scalar FMA	Step 4: Tensor Core HMMA
PTX instruction	`fma.rn.f32`	`mma.sync.aligned.m16n8k16.f32.f16`
Output size	1×1 = 1 element	16×8 = 128 elements
Threads involved	1	32 (full warp)
Values per thread	1	4
TV Layout shape	`(1, 1)` — trivial	`(32, 4)` — complex, ISA-defined
Thread→element mapping	Obvious (1:1)	Irregular (NVIDIA decides)

3. The Full Mapping: mma.sync.m16n8k16 Output

Here's the exact assignment NVIDIA hardwired for the 16×8 output matrix C.
Each cell shows Tn:vi = thread n, value index i.

	col 0	col 1	col 2	col 3	col 4	col 5	col 6	col 7
row 0	T0:v0	T1:v0	T2:v0	T3:v0	T4:v0	T5:v0	T6:v0	T7:v0
row 1	T0:v1	T1:v1	T2:v1	T3:v1	T4:v1	T5:v1	T6:v1	T7:v1
row 2	T8:v0	T9:v0	T10:v0	T11:v0	T12:v0	T13:v0	T14:v0	T15:v0
row 3	T8:v1	T9:v1	T10:v1	T11:v1	T12:v1	T13:v1	T14:v1	T15:v1
row 4	T16:v0	T17:v0	T18:v0	T19:v0	T20:v0	T21:v0	T22:v0	T23:v0
row 5	T16:v1	T17:v1	T18:v1	T19:v1	T20:v1	T21:v1	T22:v1	T23:v1
row 6	T24:v0	T25:v0	T26:v0	T27:v0	T28:v0	T29:v0	T30:v0	T31:v0
row 7	T24:v1	T25:v1	T26:v1	T27:v1	T28:v1	T29:v1	T30:v1	T31:v1

row 8	T0:v2	T1:v2	T2:v2	T3:v2	T4:v2	T5:v2	T6:v2	T7:v2
row 9	T0:v3	T1:v3	T2:v3	T3:v3	T4:v3	T5:v3	T6:v3	T7:v3
row 10	T8:v2	T9:v2	T10:v2	T11:v2	T12:v2	T13:v2	T14:v2	T15:v2
row 11	T8:v3	T9:v3	T10:v3	T11:v3	T12:v3	T13:v3	T14:v3	T15:v3
row 12	T16:v2	T17:v2	T18:v2	T19:v2	T20:v2	T21:v2	T22:v2	T23:v2
row 13	T16:v3	T17:v3	T18:v3	T19:v3	T20:v3	T21:v3	T22:v3	T23:v3
row 14	T24:v2	T25:v2	T26:v2	T27:v2	T28:v2	T29:v2	T30:v2	T31:v2
row 15	T24:v3	T25:v3	T26:v3	T27:v3	T28:v3	T29:v3	T30:v3	T31:v3

Color key: T0-T7 T8-T15 T16-T23 T24-T31 Bright = v0,v1 (rows 0-7) Dim = v2,v3 (rows 8-15)

4. Zoom In: Thread 0's Four Elements

Thread 0 owns 4 values in its registers after the MMA executes: Register Matrix Position Why these specific cells? ──────── ───────────────── ───────────────────────── v0 → C[ 0, 0] NVIDIA's hardware wiring v1 → C[ 1, 0] decided this pattern. v2 → C[ 8, 0] You can't change it. v3 → C[ 9, 0] It's in the silicon. Visually in the 16×8 matrix: col 0 row 0: [v0] ← pair 1 row 1: [v1] ← row 2: . ... . (other threads' elements) row 7: . row 8: [v2] ← pair 2 (8 rows below pair 1) row 9: [v3] ← ... . row 15: .

5. The TV Layout IS This Table (as Math)

CuTe doesn't store the table — it encodes it as a layout function with fixed strides:

The exact TV_layout_C for mma.sync.m16n8k16 (accumulator D/C): Shape: ( (4, 8), (2, 2) ) Stride: ( (16, 1), (8, 64) ) ─┬──── ─┬──── T V (threads) (values) Both T and V are hierarchical (nested tuples): T = (4, 8) means: thread_id is decomposed as (T_outer, T_inner) T_inner = tid % 8 (which column, 0-7) T_outer = tid / 8 (which row-group, 0-3) V = (2, 2) means: value_index is decomposed as (V_inner, V_outer) V_inner = vid % 2 (which row within a pair) V_outer = vid / 2 (which half: top or bottom)

How the strides map to (row, col) in the 16×8 matrix

The 16×8 matrix is stored in row-major as a flat array: offset = row * 8 + col Now trace each stride: T_inner stride = 1 tid % 8 contributes offset += (tid%8) * 1 → this IS the column! (col = tid % 8) T_outer stride = 16 tid / 8 contributes offset += (tid/8) * 16 16 in row-major(8 cols) = 2 rows → row += (tid/8) * 2 V_inner stride = 8 vid % 2 contributes offset += (vid%2) * 8 8 in row-major(8 cols) = 1 row → row += (vid%2) * 1 V_outer stride = 64 vid / 2 contributes offset += (vid/2) * 64 64 in row-major(8 cols) = 8 rows → row += (vid/2) * 8

Putting it together

Formula (derived from shape + stride): offset(tid, vid) = (tid%8)*1 + (tid/8)*16 + (vid%2)*8 + (vid/2)*64 Converting to (row, col) in the 16×8 matrix: col = tid % 8 row = (tid / 8) * 2 + (vid % 2) + (vid / 2) * 8 Verification: (tid=0, vid=0): col=0, row= 0*2 + 0 + 0*8 = 0 → C[ 0, 0] ✓ (tid=0, vid=1): col=0, row= 0*2 + 1 + 0*8 = 1 → C[ 1, 0] ✓ (tid=0, vid=2): col=0, row= 0*2 + 0 + 1*8 = 8 → C[ 8, 0] ✓ (tid=0, vid=3): col=0, row= 0*2 + 1 + 1*8 = 9 → C[ 9, 0] ✓ (tid=5, vid=1): col=5, row= 0*2 + 1 + 0*8 = 1 → C[ 1, 5] ✓ (tid=8, vid=0): col=0, row= 1*2 + 0 + 0*8 = 2 → C[ 2, 0] ✓ (tid=24, vid=2): col=0, row= 3*2 + 0 + 1*8 = 14 → C[14, 0] ✓ (tid=31, vid=3): col=7, row= 3*2 + 1 + 1*8 = 15 → C[15, 7] ✓

Why hierarchical? Why not just (32, 4)?

If the mapping were simply "stride s_t per thread, stride s_v per value": offset = tid * s_t + vid * s_v Then threads and values would need to be evenly spaced — but they're NOT: Thread 0 → col 0 (offset 0) Thread 1 → col 1 (offset 1) ... Thread 7 → col 7 (offset 7) Thread 8 → row 2 (offset 16) ← NOT offset 8! A single flat stride can't express this jump. The hierarchical shape (4, 8) lets CuTe encode the two different stride patterns: - within a group of 8: stride 1 (columns) - between groups: stride 16 (skip 2 rows)

6. Why CuTe Encodes This (The Payoff)

7. Summary: The Three Layers (Annotated)

┌────────────────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 1: Hardware (NVIDIA ISA) ────────────────────── 🔒 FIXED BY HARDWARE │ │ │ │ "mma.sync.m16n8k16 gives thread 0 elements at (0,0),(1,0),(8,0),(9,0)" │ │ Fixed. Can't change. Different per instruction. │ │ │ │ What's fixed: │ │ • Atom shape (16, 8, 16) in MNK │ │ • Threads per atom (32 = 1 warp) │ │ • Values per thread (4) │ │ • TV_layout strides (16, 1, 8, 64) — the wiring │ │ • Input data format (fp16) and accumulator format (fp32) │ │ │ ├────────────────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 2: Atom (CuTe encoding) ────────────────────── 🔒 FIXED (wrapper) │ │ │ │ TV_layout_C = ((4,8), (2,2)) : ((16,1), (8,64)) │ │ Just a CuTe-readable encoding of the hardware truth. │ │ No user choice here — one-to-one mapping from Layer 1. │ │ │ │ What's fixed: │ │ • Everything from Layer 1, written as CuTe Layout objects │ │ • TV_layout_A, TV_layout_B, TV_layout_C — all defined by ISA │ │ │ ├────────────────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 3: TiledMMA (user-level) ──────────────── 🎨 USER DESIGN CHOICES │ │ │ │ This is where YOUR workload meets the hardware. │ │ You choose how to tile the atom across your CTA tile. │ │ │ │ ┌──────────────────────────────────────────────────────────────────────────┐ │ │ │ Inputs from your problem (not free choices): │ │ │ │ │ │ │ │ • CTA tile size (BM=128, BN=128, BK=32) ← tuning knob, │ │ │ │ but constrained by SMEM size, register budget, occupancy │ │ │ │ • Total threads per block (128) ← must match atoms_layout × 32 │ │ │ │ • Problem size (M, N, K) ← given by the user's matrix dims │ │ │ │ • Data layout of A, B, C in memory ← given by caller │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────────┐ │ │ │ Your free design parameters: │ │ │ │ │ │ │ │ 1. atoms_layout = (2, 2, 1) │ │ │ │ "How many atoms, arranged how?" │ │ │ │ → 2 in M, 2 in N, 1 in K = 4 atoms = 4 warps = 128 threads │ │ │ │ │ │ │ │ Coverage per step (atoms_layout × atom_shape): │ │ │ │ M: 2 atoms × 16 rows/atom = 32 rows │ │ │ │ N: 2 atoms × 8 cols/atom = 16 cols │ │ │ │ K: 1 atom × 16 depth/atom = 16 reduction │ │ │ │ ─────────────────────────────────────── │ │ │ │ One step covers: 32 × 16 = 512 output elements │ │ │ │ (but need 128×128 = 16384 total, so we iterate) │ │ │ │ │ │ │ │ Tradeoff: more atoms = more parallelism but more threads │ │ │ │ fewer atoms = fewer threads but more work per thread │ │ │ │ │ │ │ │ 2. permutation = (32, 32, 16) │ │ │ │ "How much total coverage per thread across all iterations?" │ │ │ │ → permutation_M = 32, permutation_N = 32, permutation_K = 16 │ │ │ │ │ │ │ │ These must be ≥ atoms_layout × atom_shape: │ │ │ │ M: permutation_M(32) = atoms(2) × atom_M(16) = 32 ✓ exact │ │ │ │ N: permutation_N(32) = atoms(2) × atom_N(8) × 2 = 32 │ │ │ │ ↑ the ×2 means each atom "visits" 2 N-positions │ │ │ │ │ │ │ │ Iterations to cover the full CTA tile: │ │ │ │ M: BM / permutation_M = 128 / 32 = 4 iterations in M │ │ │ │ N: BN / permutation_N = 128 / 32 = 4 iterations in N │ │ │ │ Total: 4 × 4 = 16 iterations to cover full 128×128 │ │ │ │ │ │ │ │ Tradeoff: larger permutation = more elements per thread │ │ │ │ = more registers = potentially lower occupancy │ │ │ │ │ │ │ │ 3. BK (tile depth) = 32 │ │ │ │ "How much K-reduction per SMEM load?" │ │ │ │ → BK / atom_K = 32 / 16 = 2 MMA calls per K-tile │ │ │ │ → Total K iterations: K / BK (outer loop over K) │ │ │ │ │ │ │ │ Tradeoff: larger BK = fewer SMEM loads but more SMEM used │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────────┐ │ │ │ Derived automatically (you don't compute these): │ │ │ │ │ │ │ │ • Thread→element mapping for all threads (from atom TV + tiling) │ │ │ │ • partition_C/A/B output shapes and strides │ │ │ │ • Grid dimensions = ceil(M/BM) × ceil(N/BN) │ │ │ │ • Number of K-iterations = K / BK │ │ │ │ • Registers per thread = V × M_rest × N_rest (from partition_C) │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────────────────┘

Parameter	Source	Who decides?	Can you change it?
`mma.sync.m16n8k16`	NVIDIA hardware	NVIDIA (silicon)	No. Pick a different instruction if you want different.
TV_layout strides `(16,1,8,64)`	ISA specification	NVIDIA (ISA doc)	No. It's how the tensor core routes data.
Atom shape `(16, 8, 16)`	Instruction definition	NVIDIA (fixed)	No. Choose a different MMA for different shape.
Threads per atom `32`	Instruction definition	NVIDIA (fixed)	No. It's always 1 warp for mma.sync.
`atoms_layout = (2,2,1)`	Kernel design	YOU (kernel author)	Yes! Could be (1,1,1), (2,4,1), (4,2,1)...
`permutation = (32,32,16)`	Kernel design	YOU (kernel author)	Yes! Controls coverage and element layout.
`BM, BN, BK = 128,128,32`	Kernel design	YOU (kernel author)	Yes! Must be multiples of atom dims. Tuning knobs.
`NUM_THREADS = 128`	Kernel design	YOU (constrained)	= atoms_layout product × 32. Follows from atoms_layout.
Problem size `(M, N, K)`	Workload	User/application	Given to you. Must be divisible by tile (or add predication).
Data layouts `(M,K):(1,M)`	Input format	User/application	Given to you. Affects copy atom choice.
`partition_C` output shape	Derived	Auto (CuTe)	You don't choose this — it falls out of the above.
Grid dimensions	Derived	Auto (CuTe)	= ceil(M/BM) × ceil(N/BN). Determined by tile + problem.
Registers per thread	Derived	Auto (CuTe)	Consequence of your choices above. Check occupancy!

TV Layout & MMA Atom — Demystified

1. Three Things, One Mapping

Threads

Matrix Elements

2. "Fixed by NVIDIA's ISA" — What Does That Mean?

ISA = Instruction Set Architecture = the hardware's rule book

3. The Full Mapping: `mma.sync.m16n8k16` Output

4. Zoom In: Thread 0's Four Elements

Pattern for ALL threads (read from the table above):

5. The TV Layout IS This Table (as Math)

How the strides map to (row, col) in the 16×8 matrix

Putting it together

Key Insight: The Strides ARE the Hardware Wiring

Why hierarchical? Why not just (32, 4)?

6. Why CuTe Encodes This (The Payoff)

Without CuTe (Raw CUDA)

With CuTe

7. Summary: The Three Layers (Annotated)

8. Design Decision Map

The Mental Model at Layer 3

One-sentence summary:

TV Layout & MMA Atom — Demystified

1. Three Things, One Mapping

Threads

Matrix Elements

2. "Fixed by NVIDIA's ISA" — What Does That Mean?

ISA = Instruction Set Architecture = the hardware's rule book

3. The Full Mapping: mma.sync.m16n8k16 Output

4. Zoom In: Thread 0's Four Elements

Pattern for ALL threads (read from the table above):

5. The TV Layout IS This Table (as Math)

How the strides map to (row, col) in the 16×8 matrix

Putting it together

Key Insight: The Strides ARE the Hardware Wiring

Why hierarchical? Why not just (32, 4)?

6. Why CuTe Encodes This (The Payoff)

Without CuTe (Raw CUDA)

With CuTe

7. Summary: The Three Layers (Annotated)

8. Design Decision Map

The Mental Model at Layer 3

One-sentence summary:

3. The Full Mapping: `mma.sync.m16n8k16` Output