TV Layout & MMA Atom — Demystified
1. Three Things, One Mapping
An MMA instruction involves three things:
Threads
32 threads in a warp
(the workers)
↔
Matrix Elements
16×8 = 128 output values
(the work)
Each thread holds some elements in its registers (private storage). The TV Layout is the map that says:
TV Layout = given (thread_id, value_index), which (row, col) in the matrix?
Think of a warehouse with 128 packages (matrix elements) and 32 workers (threads). Each worker carries exactly 4 packages in their backpack (registers). The TV layout is the assignment sheet that says "Worker #5, your 4 packages are at shelf positions (0,5), (1,5), (8,5), (9,5)."
2. "Fixed by NVIDIA's ISA" — What Does That Mean?
ISA = Instruction Set Architecture = the hardware's rule book
When NVIDIA designed the mma.sync.m16n8k16 instruction, they hardwired which thread gets which output element. It's baked into the chip's silicon. You cannot change it, just like you can't change that 1 + 1 = 2 on your CPU.
The warehouse manager (NVIDIA) has already assigned the packages to workers. You (the programmer) just got a sheet telling you the assignments. You can't reassign — the conveyor belt is already built that way.
Compare the two atom types in our tutorial:
| Property |
Step 1: Scalar FMA |
Step 4: Tensor Core HMMA |
| PTX instruction |
fma.rn.f32 |
mma.sync.aligned.m16n8k16.f32.f16 |
| Output size |
1×1 = 1 element |
16×8 = 128 elements |
| Threads involved |
1 |
32 (full warp) |
| Values per thread |
1 |
4 |
| TV Layout shape |
(1, 1) — trivial |
(32, 4) — complex, ISA-defined |
| Thread→element mapping |
Obvious (1:1) |
Irregular (NVIDIA decides) |
3. The Full Mapping: mma.sync.m16n8k16 Output
Here's the exact assignment NVIDIA hardwired for the 16×8 output matrix C.
Each cell shows Tn:vi = thread n, value index i.
|
col 0 | col 1 | col 2 | col 3 |
col 4 | col 5 | col 6 | col 7 |
| row 0 |
T0:v0 | T1:v0 | T2:v0 | T3:v0 |
T4:v0 | T5:v0 | T6:v0 | T7:v0 |
| row 1 |
T0:v1 | T1:v1 | T2:v1 | T3:v1 |
T4:v1 | T5:v1 | T6:v1 | T7:v1 |
| row 2 |
T8:v0 | T9:v0 | T10:v0 | T11:v0 |
T12:v0 | T13:v0 | T14:v0 | T15:v0 |
| row 3 |
T8:v1 | T9:v1 | T10:v1 | T11:v1 |
T12:v1 | T13:v1 | T14:v1 | T15:v1 |
| row 4 |
T16:v0 | T17:v0 | T18:v0 | T19:v0 |
T20:v0 | T21:v0 | T22:v0 | T23:v0 |
| row 5 |
T16:v1 | T17:v1 | T18:v1 | T19:v1 |
T20:v1 | T21:v1 | T22:v1 | T23:v1 |
| row 6 |
T24:v0 | T25:v0 | T26:v0 | T27:v0 |
T28:v0 | T29:v0 | T30:v0 | T31:v0 |
| row 7 |
T24:v1 | T25:v1 | T26:v1 | T27:v1 |
T28:v1 | T29:v1 | T30:v1 | T31:v1 |
|
| row 8 |
T0:v2 | T1:v2 | T2:v2 | T3:v2 |
T4:v2 | T5:v2 | T6:v2 | T7:v2 |
| row 9 |
T0:v3 | T1:v3 | T2:v3 | T3:v3 |
T4:v3 | T5:v3 | T6:v3 | T7:v3 |
| row 10 |
T8:v2 | T9:v2 | T10:v2 | T11:v2 |
T12:v2 | T13:v2 | T14:v2 | T15:v2 |
| row 11 |
T8:v3 | T9:v3 | T10:v3 | T11:v3 |
T12:v3 | T13:v3 | T14:v3 | T15:v3 |
| row 12 |
T16:v2 | T17:v2 | T18:v2 | T19:v2 |
T20:v2 | T21:v2 | T22:v2 | T23:v2 |
| row 13 |
T16:v3 | T17:v3 | T18:v3 | T19:v3 |
T20:v3 | T21:v3 | T22:v3 | T23:v3 |
| row 14 |
T24:v2 | T25:v2 | T26:v2 | T27:v2 |
T28:v2 | T29:v2 | T30:v2 | T31:v2 |
| row 15 |
T24:v3 | T25:v3 | T26:v3 | T27:v3 |
T28:v3 | T29:v3 | T30:v3 | T31:v3 |
Color key: T0-T7
T8-T15
T16-T23
T24-T31
Bright = v0,v1 (rows 0-7) Dim = v2,v3 (rows 8-15)
4. Zoom In: Thread 0's Four Elements
Thread 0 owns 4 values in its registers after the MMA executes:
Register Matrix Position Why these specific cells?
──────── ───────────────── ─────────────────────────
v0 → C[ 0, 0] NVIDIA's hardware wiring
v1 → C[ 1, 0] decided this pattern.
v2 → C[ 8, 0] You can't change it.
v3 → C[ 9, 0] It's in the silicon.
Visually in the 16×8 matrix:
col 0
row 0: [v0] ← pair 1
row 1: [v1] ←
row 2: .
... . (other threads' elements)
row 7: .
row 8: [v2] ← pair 2 (8 rows below pair 1)
row 9: [v3] ←
... .
row 15: .
Pattern for ALL threads (read from the table above):
- Column =
thread_id % 8 (threads 0-7 own columns 0-7)
- Row group =
(thread_id / 8) * 2 (threads 0-7 on rows 0-1, threads 8-15 on rows 2-3, etc.)
- v0,v1 in top half (rows 0-7), v2,v3 in bottom half (rows 8-15), same column
- Each thread always has 2 pairs of 2 adjacent rows, 8 rows apart
5. The TV Layout IS This Table (as Math)
CuTe doesn't store the table — it encodes it as a layout function with fixed strides:
The exact TV_layout_C for mma.sync.m16n8k16 (accumulator D/C):
Shape: ( (4, 8), (2, 2) )
Stride: ( (16, 1), (8, 64) )
─┬──── ─┬────
T V
(threads) (values)
Both T and V are hierarchical (nested tuples):
T = (4, 8) means: thread_id is decomposed as (T_outer, T_inner)
T_inner = tid % 8 (which column, 0-7)
T_outer = tid / 8 (which row-group, 0-3)
V = (2, 2) means: value_index is decomposed as (V_inner, V_outer)
V_inner = vid % 2 (which row within a pair)
V_outer = vid / 2 (which half: top or bottom)
How the strides map to (row, col) in the 16×8 matrix
The 16×8 matrix is stored in row-major as a flat array: offset = row * 8 + col
Now trace each stride:
T_inner stride = 1
tid % 8 contributes offset += (tid%8) * 1
→ this IS the column! (col = tid % 8)
T_outer stride = 16
tid / 8 contributes offset += (tid/8) * 16
16 in row-major(8 cols) = 2 rows → row += (tid/8) * 2
V_inner stride = 8
vid % 2 contributes offset += (vid%2) * 8
8 in row-major(8 cols) = 1 row → row += (vid%2) * 1
V_outer stride = 64
vid / 2 contributes offset += (vid/2) * 64
64 in row-major(8 cols) = 8 rows → row += (vid/2) * 8
Putting it together
Formula (derived from shape + stride):
offset(tid, vid) = (tid%8)*1 + (tid/8)*16 + (vid%2)*8 + (vid/2)*64
Converting to (row, col) in the 16×8 matrix:
col = tid % 8
row = (tid / 8) * 2 + (vid % 2) + (vid / 2) * 8
Verification:
(tid=0, vid=0): col=0, row= 0*2 + 0 + 0*8 = 0 → C[ 0, 0] ✓
(tid=0, vid=1): col=0, row= 0*2 + 1 + 0*8 = 1 → C[ 1, 0] ✓
(tid=0, vid=2): col=0, row= 0*2 + 0 + 1*8 = 8 → C[ 8, 0] ✓
(tid=0, vid=3): col=0, row= 0*2 + 1 + 1*8 = 9 → C[ 9, 0] ✓
(tid=5, vid=1): col=5, row= 0*2 + 1 + 0*8 = 1 → C[ 1, 5] ✓
(tid=8, vid=0): col=0, row= 1*2 + 0 + 0*8 = 2 → C[ 2, 0] ✓
(tid=24, vid=2): col=0, row= 3*2 + 0 + 1*8 = 14 → C[14, 0] ✓
(tid=31, vid=3): col=7, row= 3*2 + 1 + 1*8 = 15 → C[15, 7] ✓
Key Insight: The Strides ARE the Hardware Wiring
The four numbers (16, 1, 8, 64) are permanently fixed for mma.sync.m16n8k16. They encode exactly how NVIDIA's tensor core routes data to thread registers. A different instruction (e.g., mma.sync.m16n8k8 or wgmma.m64n256k16) would have different fixed strides.
Why hierarchical? Why not just (32, 4)?
If the mapping were simply "stride s_t per thread, stride s_v per value":
offset = tid * s_t + vid * s_v
Then threads and values would need to be evenly spaced — but they're NOT:
Thread 0 → col 0 (offset 0)
Thread 1 → col 1 (offset 1)
...
Thread 7 → col 7 (offset 7)
Thread 8 → row 2 (offset 16) ← NOT offset 8!
A single flat stride can't express this jump. The hierarchical shape (4, 8)
lets CuTe encode the two different stride patterns:
- within a group of 8: stride 1 (columns)
- between groups: stride 16 (skip 2 rows)
The TV layout is like a formula version of the lookup table. Instead of storing all 128 entries, it uses 4 numbers (the strides 16, 1, 8, 64) that can compute any entry on the fly. CuTe's hierarchical layout makes this possible because the pattern, while irregular-looking, is built from regular sub-patterns.
6. Why CuTe Encodes This (The Payoff)
Without CuTe (Raw CUDA)
// After mma.sync, results are in
// registers d0, d1, d2, d3.
// You must KNOW the mapping to store:
int lane = threadIdx.x % 32;
int col = lane % 8;
int row_base = (lane / 8) * 2;
// Store pair 1 (top half)
C[(row_base + 0) * N + col] = d0;
C[(row_base + 1) * N + col] = d1;
// Store pair 2 (bottom half, +8)
C[(row_base + 8) * N + col] = d2;
C[(row_base + 9) * N + col] = d3;
// Change to mma.m16n8k8?
// ALL index math must be rewritten!
// Change to wmma.m16n16k16?
// Completely different layout!
With CuTe
# The atom KNOWS the mapping
atom = MmaF16BF16Op(f16, f32, (16,8,16))
# partition_C uses the atom's TV layout
thr_mma = tiled_mma.get_slice(tidx)
tCgC = thr_mma.partition_C(gC)
# Copy results to global memory
cute.copy(copy_atom, tCrC, tCgC)
# ^^^ correct for ANY atom!
# Change to a different MMA?
# Just swap the atom.
# partition_C and copy auto-adapt.
# Zero index math to update.
7. Summary: The Three Layers (Annotated)
┌────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Layer 1: Hardware (NVIDIA ISA) ────────────────────── 🔒 FIXED BY HARDWARE │
│ │
│ "mma.sync.m16n8k16 gives thread 0 elements at (0,0),(1,0),(8,0),(9,0)" │
│ Fixed. Can't change. Different per instruction. │
│ │
│ What's fixed: │
│ • Atom shape (16, 8, 16) in MNK │
│ • Threads per atom (32 = 1 warp) │
│ • Values per thread (4) │
│ • TV_layout strides (16, 1, 8, 64) — the wiring │
│ • Input data format (fp16) and accumulator format (fp32) │
│ │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 2: Atom (CuTe encoding) ────────────────────── 🔒 FIXED (wrapper) │
│ │
│ TV_layout_C = ((4,8), (2,2)) : ((16,1), (8,64)) │
│ Just a CuTe-readable encoding of the hardware truth. │
│ No user choice here — one-to-one mapping from Layer 1. │
│ │
│ What's fixed: │
│ • Everything from Layer 1, written as CuTe Layout objects │
│ • TV_layout_A, TV_layout_B, TV_layout_C — all defined by ISA │
│ │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 3: TiledMMA (user-level) ──────────────── 🎨 USER DESIGN CHOICES │
│ │
│ This is where YOUR workload meets the hardware. │
│ You choose how to tile the atom across your CTA tile. │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ Inputs from your problem (not free choices): │ │
│ │ │ │
│ │ • CTA tile size (BM=128, BN=128, BK=32) ← tuning knob, │ │
│ │ but constrained by SMEM size, register budget, occupancy │ │
│ │ • Total threads per block (128) ← must match atoms_layout × 32 │ │
│ │ • Problem size (M, N, K) ← given by the user's matrix dims │ │
│ │ • Data layout of A, B, C in memory ← given by caller │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ Your free design parameters: │ │
│ │ │ │
│ │ 1. atoms_layout = (2, 2, 1) │ │
│ │ "How many atoms, arranged how?" │ │
│ │ → 2 in M, 2 in N, 1 in K = 4 atoms = 4 warps = 128 threads │ │
│ │ │ │
│ │ Coverage per step (atoms_layout × atom_shape): │ │
│ │ M: 2 atoms × 16 rows/atom = 32 rows │ │
│ │ N: 2 atoms × 8 cols/atom = 16 cols │ │
│ │ K: 1 atom × 16 depth/atom = 16 reduction │ │
│ │ ─────────────────────────────────────── │ │
│ │ One step covers: 32 × 16 = 512 output elements │ │
│ │ (but need 128×128 = 16384 total, so we iterate) │ │
│ │ │ │
│ │ Tradeoff: more atoms = more parallelism but more threads │ │
│ │ fewer atoms = fewer threads but more work per thread │ │
│ │ │ │
│ │ 2. permutation = (32, 32, 16) │ │
│ │ "How much total coverage per thread across all iterations?" │ │
│ │ → permutation_M = 32, permutation_N = 32, permutation_K = 16 │ │
│ │ │ │
│ │ These must be ≥ atoms_layout × atom_shape: │ │
│ │ M: permutation_M(32) = atoms(2) × atom_M(16) = 32 ✓ exact │ │
│ │ N: permutation_N(32) = atoms(2) × atom_N(8) × 2 = 32 │ │
│ │ ↑ the ×2 means each atom "visits" 2 N-positions │ │
│ │ │ │
│ │ Iterations to cover the full CTA tile: │ │
│ │ M: BM / permutation_M = 128 / 32 = 4 iterations in M │ │
│ │ N: BN / permutation_N = 128 / 32 = 4 iterations in N │ │
│ │ Total: 4 × 4 = 16 iterations to cover full 128×128 │ │
│ │ │ │
│ │ Tradeoff: larger permutation = more elements per thread │ │
│ │ = more registers = potentially lower occupancy │ │
│ │ │ │
│ │ 3. BK (tile depth) = 32 │ │
│ │ "How much K-reduction per SMEM load?" │ │
│ │ → BK / atom_K = 32 / 16 = 2 MMA calls per K-tile │ │
│ │ → Total K iterations: K / BK (outer loop over K) │ │
│ │ │ │
│ │ Tradeoff: larger BK = fewer SMEM loads but more SMEM used │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ Derived automatically (you don't compute these): │ │
│ │ │ │
│ │ • Thread→element mapping for all threads (from atom TV + tiling) │ │
│ │ • partition_C/A/B output shapes and strides │ │
│ │ • Grid dimensions = ceil(M/BM) × ceil(N/BN) │ │
│ │ • Number of K-iterations = K / BK │ │
│ │ • Registers per thread = V × M_rest × N_rest (from partition_C) │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘
8. Design Decision Map
| Parameter |
Source |
Who decides? |
Can you change it? |
mma.sync.m16n8k16 |
NVIDIA hardware |
NVIDIA (silicon) |
No. Pick a different instruction if you want different. |
TV_layout strides (16,1,8,64) |
ISA specification |
NVIDIA (ISA doc) |
No. It's how the tensor core routes data. |
Atom shape (16, 8, 16) |
Instruction definition |
NVIDIA (fixed) |
No. Choose a different MMA for different shape. |
Threads per atom 32 |
Instruction definition |
NVIDIA (fixed) |
No. It's always 1 warp for mma.sync. |
atoms_layout = (2,2,1) |
Kernel design |
YOU (kernel author) |
Yes! Could be (1,1,1), (2,4,1), (4,2,1)... |
permutation = (32,32,16) |
Kernel design |
YOU (kernel author) |
Yes! Controls coverage and element layout. |
BM, BN, BK = 128,128,32 |
Kernel design |
YOU (kernel author) |
Yes! Must be multiples of atom dims. Tuning knobs. |
NUM_THREADS = 128 |
Kernel design |
YOU (constrained) |
= atoms_layout product × 32. Follows from atoms_layout. |
Problem size (M, N, K) |
Workload |
User/application |
Given to you. Must be divisible by tile (or add predication). |
Data layouts (M,K):(1,M) |
Input format |
User/application |
Given to you. Affects copy atom choice. |
partition_C output shape |
Derived |
Auto (CuTe) |
You don't choose this — it falls out of the above. |
| Grid dimensions |
Derived |
Auto (CuTe) |
= ceil(M/BM) × ceil(N/BN). Determined by tile + problem. |
| Registers per thread |
Derived |
Auto (CuTe) |
Consequence of your choices above. Check occupancy! |
The Mental Model at Layer 3
You're solving a tiling puzzle: given a fixed-size stamp (the atom) and a big rectangle (your CTA tile), decide:
- How many stamps to use simultaneously →
atoms_layout
- How to iterate the stamps across the rectangle →
permutation
- How deep to go in K before switching tiles →
BK
The constraints are: shared memory capacity (limits BM×BK + BN×BK), register file (limits elements per thread), and occupancy (limits threads per SM). Your job is finding the sweet spot.
One-sentence summary:
The TV layout is CuTe's encoding of NVIDIA's hardware-mandated thread→element assignment, turning an opaque ISA rule into composable layout algebra that makes partition_C work automatically for any MMA instruction.