TV Layout & MMA Atom — Demystified

1. Three Things, One Mapping

An MMA instruction involves three things:

Threads

32 threads in a warp
(the workers)

Matrix Elements

16×8 = 128 output values
(the work)

Each thread holds some elements in its registers (private storage). The TV Layout is the map that says:

TV Layout = given (thread_id, value_index), which (row, col) in the matrix?

Think of a warehouse with 128 packages (matrix elements) and 32 workers (threads). Each worker carries exactly 4 packages in their backpack (registers). The TV layout is the assignment sheet that says "Worker #5, your 4 packages are at shelf positions (0,5), (1,5), (8,5), (9,5)."

2. "Fixed by NVIDIA's ISA" — What Does That Mean?

ISA = Instruction Set Architecture = the hardware's rule book

When NVIDIA designed the mma.sync.m16n8k16 instruction, they hardwired which thread gets which output element. It's baked into the chip's silicon. You cannot change it, just like you can't change that 1 + 1 = 2 on your CPU.

The warehouse manager (NVIDIA) has already assigned the packages to workers. You (the programmer) just got a sheet telling you the assignments. You can't reassign — the conveyor belt is already built that way.

Compare the two atom types in our tutorial:

Property Step 1: Scalar FMA Step 4: Tensor Core HMMA
PTX instruction fma.rn.f32 mma.sync.aligned.m16n8k16.f32.f16
Output size 1×1 = 1 element 16×8 = 128 elements
Threads involved 1 32 (full warp)
Values per thread 1 4
TV Layout shape (1, 1) — trivial (32, 4) — complex, ISA-defined
Thread→element mapping Obvious (1:1) Irregular (NVIDIA decides)

3. The Full Mapping: mma.sync.m16n8k16 Output

Here's the exact assignment NVIDIA hardwired for the 16×8 output matrix C.
Each cell shows Tn:vi = thread n, value index i.

col 0col 1col 2col 3 col 4col 5col 6col 7
row 0 T0:v0T1:v0T2:v0T3:v0 T4:v0T5:v0T6:v0T7:v0
row 1 T0:v1T1:v1T2:v1T3:v1 T4:v1T5:v1T6:v1T7:v1
row 2 T8:v0T9:v0T10:v0T11:v0 T12:v0T13:v0T14:v0T15:v0
row 3 T8:v1T9:v1T10:v1T11:v1 T12:v1T13:v1T14:v1T15:v1
row 4 T16:v0T17:v0T18:v0T19:v0 T20:v0T21:v0T22:v0T23:v0
row 5 T16:v1T17:v1T18:v1T19:v1 T20:v1T21:v1T22:v1T23:v1
row 6 T24:v0T25:v0T26:v0T27:v0 T28:v0T29:v0T30:v0T31:v0
row 7 T24:v1T25:v1T26:v1T27:v1 T28:v1T29:v1T30:v1T31:v1
row 8 T0:v2T1:v2T2:v2T3:v2 T4:v2T5:v2T6:v2T7:v2
row 9 T0:v3T1:v3T2:v3T3:v3 T4:v3T5:v3T6:v3T7:v3
row 10 T8:v2T9:v2T10:v2T11:v2 T12:v2T13:v2T14:v2T15:v2
row 11 T8:v3T9:v3T10:v3T11:v3 T12:v3T13:v3T14:v3T15:v3
row 12 T16:v2T17:v2T18:v2T19:v2 T20:v2T21:v2T22:v2T23:v2
row 13 T16:v3T17:v3T18:v3T19:v3 T20:v3T21:v3T22:v3T23:v3
row 14 T24:v2T25:v2T26:v2T27:v2 T28:v2T29:v2T30:v2T31:v2
row 15 T24:v3T25:v3T26:v3T27:v3 T28:v3T29:v3T30:v3T31:v3

Color key: T0-T7 T8-T15 T16-T23 T24-T31   Bright = v0,v1 (rows 0-7)   Dim = v2,v3 (rows 8-15)

4. Zoom In: Thread 0's Four Elements

Thread 0 owns 4 values in its registers after the MMA executes: Register Matrix Position Why these specific cells? ──────── ───────────────── ───────────────────────── v0 → C[ 0, 0] NVIDIA's hardware wiring v1 → C[ 1, 0] decided this pattern. v2 → C[ 8, 0] You can't change it. v3 → C[ 9, 0] It's in the silicon. Visually in the 16×8 matrix: col 0 row 0: [v0] ← pair 1 row 1: [v1] ← row 2: . ... . (other threads' elements) row 7: . row 8: [v2] ← pair 2 (8 rows below pair 1) row 9: [v3] ← ... . row 15: .

Pattern for ALL threads (read from the table above):

5. The TV Layout IS This Table (as Math)

CuTe doesn't store the table — it encodes it as a layout function with fixed strides:

The exact TV_layout_C for mma.sync.m16n8k16 (accumulator D/C): Shape: ( (4, 8), (2, 2) ) Stride: ( (16, 1), (8, 64) ) ─┬──── ─┬──── T V (threads) (values) Both T and V are hierarchical (nested tuples): T = (4, 8) means: thread_id is decomposed as (T_outer, T_inner) T_inner = tid % 8 (which column, 0-7) T_outer = tid / 8 (which row-group, 0-3) V = (2, 2) means: value_index is decomposed as (V_inner, V_outer) V_inner = vid % 2 (which row within a pair) V_outer = vid / 2 (which half: top or bottom)

How the strides map to (row, col) in the 16×8 matrix

The 16×8 matrix is stored in row-major as a flat array: offset = row * 8 + col Now trace each stride: T_inner stride = 1 tid % 8 contributes offset += (tid%8) * 1 → this IS the column! (col = tid % 8) T_outer stride = 16 tid / 8 contributes offset += (tid/8) * 16 16 in row-major(8 cols) = 2 rows → row += (tid/8) * 2 V_inner stride = 8 vid % 2 contributes offset += (vid%2) * 8 8 in row-major(8 cols) = 1 row → row += (vid%2) * 1 V_outer stride = 64 vid / 2 contributes offset += (vid/2) * 64 64 in row-major(8 cols) = 8 rows → row += (vid/2) * 8

Putting it together

Formula (derived from shape + stride): offset(tid, vid) = (tid%8)*1 + (tid/8)*16 + (vid%2)*8 + (vid/2)*64 Converting to (row, col) in the 16×8 matrix: col = tid % 8 row = (tid / 8) * 2 + (vid % 2) + (vid / 2) * 8 Verification: (tid=0, vid=0): col=0, row= 0*2 + 0 + 0*8 = 0 → C[ 0, 0] (tid=0, vid=1): col=0, row= 0*2 + 1 + 0*8 = 1 → C[ 1, 0] (tid=0, vid=2): col=0, row= 0*2 + 0 + 1*8 = 8 → C[ 8, 0] (tid=0, vid=3): col=0, row= 0*2 + 1 + 1*8 = 9 → C[ 9, 0] (tid=5, vid=1): col=5, row= 0*2 + 1 + 0*8 = 1 → C[ 1, 5] (tid=8, vid=0): col=0, row= 1*2 + 0 + 0*8 = 2 → C[ 2, 0] (tid=24, vid=2): col=0, row= 3*2 + 0 + 1*8 = 14 → C[14, 0] (tid=31, vid=3): col=7, row= 3*2 + 1 + 1*8 = 15 → C[15, 7]

Key Insight: The Strides ARE the Hardware Wiring

The four numbers (16, 1, 8, 64) are permanently fixed for mma.sync.m16n8k16. They encode exactly how NVIDIA's tensor core routes data to thread registers. A different instruction (e.g., mma.sync.m16n8k8 or wgmma.m64n256k16) would have different fixed strides.

Why hierarchical? Why not just (32, 4)?

If the mapping were simply "stride s_t per thread, stride s_v per value": offset = tid * s_t + vid * s_v Then threads and values would need to be evenly spaced — but they're NOT: Thread 0 → col 0 (offset 0) Thread 1 → col 1 (offset 1) ... Thread 7 → col 7 (offset 7) Thread 8 → row 2 (offset 16) ← NOT offset 8! A single flat stride can't express this jump. The hierarchical shape (4, 8) lets CuTe encode the two different stride patterns: - within a group of 8: stride 1 (columns) - between groups: stride 16 (skip 2 rows)
The TV layout is like a formula version of the lookup table. Instead of storing all 128 entries, it uses 4 numbers (the strides 16, 1, 8, 64) that can compute any entry on the fly. CuTe's hierarchical layout makes this possible because the pattern, while irregular-looking, is built from regular sub-patterns.

6. Why CuTe Encodes This (The Payoff)

Without CuTe (Raw CUDA)

// After mma.sync, results are in
// registers d0, d1, d2, d3.
// You must KNOW the mapping to store:
int lane = threadIdx.x % 32;
int col = lane % 8;
int row_base = (lane / 8) * 2;

// Store pair 1 (top half)
C[(row_base + 0) * N + col] = d0;
C[(row_base + 1) * N + col] = d1;
// Store pair 2 (bottom half, +8)
C[(row_base + 8) * N + col] = d2;
C[(row_base + 9) * N + col] = d3;

// Change to mma.m16n8k8?
// ALL index math must be rewritten!
// Change to wmma.m16n16k16?
// Completely different layout!

With CuTe

# The atom KNOWS the mapping
atom = MmaF16BF16Op(f16, f32, (16,8,16))

# partition_C uses the atom's TV layout
thr_mma = tiled_mma.get_slice(tidx)
tCgC = thr_mma.partition_C(gC)

# Copy results to global memory
cute.copy(copy_atom, tCrC, tCgC)
# ^^^ correct for ANY atom!

# Change to a different MMA?
# Just swap the atom.
# partition_C and copy auto-adapt.
# Zero index math to update.

7. Summary: The Three Layers (Annotated)

┌────────────────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 1: Hardware (NVIDIA ISA) ────────────────────── 🔒 FIXED BY HARDWARE │ │ │ │ "mma.sync.m16n8k16 gives thread 0 elements at (0,0),(1,0),(8,0),(9,0)" │ │ Fixed. Can't change. Different per instruction. │ │ │ │ What's fixed: │ │ • Atom shape (16, 8, 16) in MNK │ │ • Threads per atom (32 = 1 warp) │ │ • Values per thread (4) │ │ • TV_layout strides (16, 1, 8, 64) — the wiring │ │ • Input data format (fp16) and accumulator format (fp32) │ │ │ ├────────────────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 2: Atom (CuTe encoding) ────────────────────── 🔒 FIXED (wrapper) │ │ │ │ TV_layout_C = ((4,8), (2,2)) : ((16,1), (8,64)) │ │ Just a CuTe-readable encoding of the hardware truth. │ │ No user choice here — one-to-one mapping from Layer 1. │ │ │ │ What's fixed: │ │ • Everything from Layer 1, written as CuTe Layout objects │ │ • TV_layout_A, TV_layout_B, TV_layout_C — all defined by ISA │ │ │ ├────────────────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 3: TiledMMA (user-level) ──────────────── 🎨 USER DESIGN CHOICES │ │ │ │ This is where YOUR workload meets the hardware. │ │ You choose how to tile the atom across your CTA tile. │ │ │ │ ┌──────────────────────────────────────────────────────────────────────────┐ │ │ │ Inputs from your problem (not free choices): │ │ │ │ │ │ │ │ • CTA tile size (BM=128, BN=128, BK=32) ← tuning knob, │ │ │ │ but constrained by SMEM size, register budget, occupancy │ │ │ │ • Total threads per block (128) ← must match atoms_layout × 32 │ │ │ │ • Problem size (M, N, K) ← given by the user's matrix dims │ │ │ │ • Data layout of A, B, C in memory ← given by caller │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────────┐ │ │ │ Your free design parameters: │ │ │ │ │ │ │ │ 1. atoms_layout = (2, 2, 1) │ │ │ │ "How many atoms, arranged how?" │ │ │ │ → 2 in M, 2 in N, 1 in K = 4 atoms = 4 warps = 128 threads │ │ │ │ │ │ │ │ Coverage per step (atoms_layout × atom_shape): │ │ │ │ M: 2 atoms × 16 rows/atom = 32 rows │ │ │ │ N: 2 atoms × 8 cols/atom = 16 cols │ │ │ │ K: 1 atom × 16 depth/atom = 16 reduction │ │ │ │ ─────────────────────────────────────── │ │ │ │ One step covers: 32 × 16 = 512 output elements │ │ │ │ (but need 128×128 = 16384 total, so we iterate) │ │ │ │ │ │ │ │ Tradeoff: more atoms = more parallelism but more threads │ │ │ │ fewer atoms = fewer threads but more work per thread │ │ │ │ │ │ │ │ 2. permutation = (32, 32, 16) │ │ │ │ "How much total coverage per thread across all iterations?" │ │ │ │ → permutation_M = 32, permutation_N = 32, permutation_K = 16 │ │ │ │ │ │ │ │ These must be ≥ atoms_layout × atom_shape: │ │ │ │ M: permutation_M(32) = atoms(2) × atom_M(16) = 32 ✓ exact │ │ │ │ N: permutation_N(32) = atoms(2) × atom_N(8) × 2 = 32 │ │ │ │ ↑ the ×2 means each atom "visits" 2 N-positions │ │ │ │ │ │ │ │ Iterations to cover the full CTA tile: │ │ │ │ M: BM / permutation_M = 128 / 32 = 4 iterations in M │ │ │ │ N: BN / permutation_N = 128 / 32 = 4 iterations in N │ │ │ │ Total: 4 × 4 = 16 iterations to cover full 128×128 │ │ │ │ │ │ │ │ Tradeoff: larger permutation = more elements per thread │ │ │ │ = more registers = potentially lower occupancy │ │ │ │ │ │ │ │ 3. BK (tile depth) = 32 │ │ │ │ "How much K-reduction per SMEM load?" │ │ │ │ → BK / atom_K = 32 / 16 = 2 MMA calls per K-tile │ │ │ │ → Total K iterations: K / BK (outer loop over K) │ │ │ │ │ │ │ │ Tradeoff: larger BK = fewer SMEM loads but more SMEM used │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────────┐ │ │ │ Derived automatically (you don't compute these): │ │ │ │ │ │ │ │ • Thread→element mapping for all threads (from atom TV + tiling) │ │ │ │ • partition_C/A/B output shapes and strides │ │ │ │ • Grid dimensions = ceil(M/BM) × ceil(N/BN) │ │ │ │ • Number of K-iterations = K / BK │ │ │ │ • Registers per thread = V × M_rest × N_rest (from partition_C) │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────────────────┘

8. Design Decision Map

Parameter Source Who decides? Can you change it?
mma.sync.m16n8k16 NVIDIA hardware NVIDIA (silicon) No. Pick a different instruction if you want different.
TV_layout strides (16,1,8,64) ISA specification NVIDIA (ISA doc) No. It's how the tensor core routes data.
Atom shape (16, 8, 16) Instruction definition NVIDIA (fixed) No. Choose a different MMA for different shape.
Threads per atom 32 Instruction definition NVIDIA (fixed) No. It's always 1 warp for mma.sync.
atoms_layout = (2,2,1) Kernel design YOU (kernel author) Yes! Could be (1,1,1), (2,4,1), (4,2,1)...
permutation = (32,32,16) Kernel design YOU (kernel author) Yes! Controls coverage and element layout.
BM, BN, BK = 128,128,32 Kernel design YOU (kernel author) Yes! Must be multiples of atom dims. Tuning knobs.
NUM_THREADS = 128 Kernel design YOU (constrained) = atoms_layout product × 32. Follows from atoms_layout.
Problem size (M, N, K) Workload User/application Given to you. Must be divisible by tile (or add predication).
Data layouts (M,K):(1,M) Input format User/application Given to you. Affects copy atom choice.
partition_C output shape Derived Auto (CuTe) You don't choose this — it falls out of the above.
Grid dimensions Derived Auto (CuTe) = ceil(M/BM) × ceil(N/BN). Determined by tile + problem.
Registers per thread Derived Auto (CuTe) Consequence of your choices above. Check occupancy!

The Mental Model at Layer 3

You're solving a tiling puzzle: given a fixed-size stamp (the atom) and a big rectangle (your CTA tile), decide:

  1. How many stamps to use simultaneously → atoms_layout
  2. How to iterate the stamps across the rectangle → permutation
  3. How deep to go in K before switching tiles → BK

The constraints are: shared memory capacity (limits BM×BK + BN×BK), register file (limits elements per thread), and occupancy (limits threads per SM). Your job is finding the sweet spot.

One-sentence summary:

The TV layout is CuTe's encoding of NVIDIA's hardware-mandated thread→element assignment, turning an opaque ISA rule into composable layout algebra that makes partition_C work automatically for any MMA instruction.