Step 1: Understanding CuTe DSL Through Its CUDA Equivalent
You know CUDA. This maps each CuTe abstraction to the CUDA code it replaces β so you see what CuTe "means" operationally.
1. The Big Picture: What CuTe Hides
In CUDA you manually compute: "which M,N positions does this thread own?" and "what's the global address for position (m,k)?"
In CuTe, both questions are answered by Layout objects. The kernel never writes index arithmetic β it calls partition_C/A/B and gets back tensors pointing to exactly its data.
The generated PTX is the same. CuTe's value: you can swap the MMA atom (scalarβtensor core) and the partitioning auto-adjusts. In CUDA you'd rewrite all index math.
2. Host Setup: What You'd Put Before the <<<>>> Launch
op = cute.nvgpu.MmaUniversalOp(Float32)
atoms_layout = make_layout((16,16,1), (16,1,0))
perm_M = make_layout((16,4), (4,1))
perm_N = make_layout((16,4), (4,1))
tiled_mma = make_tiled_mma(op, atoms_layout,
permutation_mnk=(perm_M, perm_N, None))
grid = ceil_div(mC.shape, (BM, BN))
block = (256, 1, 1)
gemm_kernel(..., tiled_mma).launch(grid, block)
// All encoded as compile-time constants:
#define GROUP_SIZE 4 // β from R in permutation
#define GROUP_GAP 64 // β from F*R / threads_per_dim
// thr_m = tidx/16, thr_n = tidx%16 β from atoms_layout
dim3 grid(M/128, N/128, 1);
dim3 block(256, 1, 1);
gemm_naive<<<grid, block>>>(A, B, C, M, N, K);
KEY
tiled_mma is a compile-time descriptor β it encodes the threadβelement mapping rules. In CUDA, these rules become your hardcoded index formulas (tidx/16, *4, +64). CuTe makes them composable and swappable.
3. local_tile β "Give This CTA Its Slice"
gA = cute.local_tile(mA,
tiler=cta_tiler, # (128, 128, 8)
coord=(bidx, bidy, None),
proj=(1, None, 1)) # keep M, drop N, keep K
# Result: Tensor<gmem, (128, 8, 4):(1, M, M*BK)>
# Still global-memory pointer, just new layout
const float* A_cta = A + bidx * BM;
// Then index as: A_cta[m + k*M]
// k_tile selection: kt*BK offset in k
// That's it β pointer + manual offset math
// No "tensor object" in CUDA
VIEW
Zero hardware work. CuTe returns a new tensor with adjusted pointer + layout that encodes the tiling. In CUDA, you just do A + bidx*BM and remember to multiply k*M later. Same addresses, different bookkeeping.
CuTe's local_tile also splits the K dimension into tiles and returns a 3D tensor (BM, BK, num_k_tiles). In CUDA you'd handle this with the k_tile loop variable. CuTe makes the "K is tiled" fact part of the type.
4. get_slice + partition_C β "Which Output Elements Are Mine?"
thr_mma = tiled_mma.get_slice(tidx)
tCgC = thr_mma.partition_C(gC)
# tCgC: Tensor<gmem, (1,(4,2),(4,2)):(0,(128,8192),(1,64))>
# V M_rest N_rest
# This IS the ownership pattern β encoded as a Layout
int thr_m = tidx / 16;
int thr_n = tidx % 16;
int m_pos[8], n_pos[8];
for (g = 0; g < 2; g++)
for (i = 0; i < 4; i++) {
m_pos[g*4+i] = thr_m*4 + g*64 + i;
n_pos[g*4+i] = thr_n*4 + g*64 + i;
}
// You manually decode the pattern
// CuTe encodes it as the stride tuple
VIEW
No data moved. partition_C returns a tensor that, when indexed, gives you the global address of your element. The nested shape (4,2) = "4 consecutive, then repeat with gap". In CUDA, you wrote that as the loop structure.
Why (4,2) instead of flat 8?
The nested shape preserves the structure of the permutation. The inner 4 = R (consecutive group), outer 2 = repetitions. This matters because:
β’ Inner-4 elements have stride 1 (or 128 for M) β vectorizable as LDG.128
β’ The gap between groups (64 columns, 8192 bytes for M) is a separate access
CuTe keeps this structure so autovec_copy can emit vector loads for the consecutive parts.
5. make_fragment_C + fill β "Allocate My Accumulators"
tCrC = tiled_mma.make_fragment_C(tCgC)
tCrC.fill(0.0)
# tCrC: Tensor<rmem, (1,(4,2),(4,2))>
# Same shape as tCgC but in register file
# 64 floats = 256 bytes of registers
float acc[8][8];
for (m = 0; m < 8; m++)
for (n = 0; n < 8; n++)
acc[m][n] = 0.0f;
// 64 floats, same thing
// Compiler puts in registers
ALLOC
Real work: register allocation + zero-init. make_fragment_C uses tCgC only for its shape β it allocates a register tensor with identical logical structure but different (packed) strides.
6. The K-Loop: partition_A/B + copy + gemm
for k_tile in range(num_k_tiles): # 4
tCgA = thr_mma.partition_A(gA[:,:,k_tile])
tCrA = make_fragment_like(tCgA)
cute.autovec_copy(tCgA, tCrA) # β LOAD
tCgB = thr_mma.partition_B(gB[:,:,k_tile])
tCrB = make_fragment_like(tCgB)
cute.autovec_copy(tCgB, tCrB) # β LOAD
for k_block in range(8): # BK
cute.gemm(tiled_mma, tCrC,
tCrA[:,:,k_block],
tCrB[:,:,k_block], tCrC) # β COMPUTE
for (kt = 0; kt < K/BK; kt++) { // 4
float rA[8][8], rB[8][8];
for (m = 0; m < 8; m++)
for (k = 0; k < 8; k++)
rA[m][k] = A_cta[m_pos[m] + (kt*8+k)*M];
for (n = 0; n < 8; n++)
for (k = 0; k < 8; k++)
rB[n][k] = B_cta[n_pos[n] + (kt*8+k)*N];
for (k = 0; k < 8; k++)
for (m = 0; m < 8; m++)
for (n = 0; n < 8; n++)
acc[m][n] += rA[m][k] * rB[n][k];
}
LOAD autovec_copy = the LDG instructions. This is where bandwidth is consumed.
COMPUTE cute.gemm = the FFMA instructions. For scalar atom, it's just the triple-nested multiply-add.
partition_A gives shape (1, (4,2), 8) = (V, M_rest, K). The K=8 dimension is the full BK β this thread needs ALL K values for its M-positions. In CUDA that's your inner k loop.
Why does CuTe separate partition from copy?
In CUDA you fuse "compute address" and "load" into one line: rA[m][k] = A[addr].
CuTe splits them:
1. partition_A β creates a gmem tensor with your addresses (VIEW, free)
2. autovec_copy β issues the loads (REAL WORK)
This separation lets you later insert shared memory between them (step2) or use async copy (step3) without changing the partition logic. The "what's mine" question stays the same; only "where do I read from" changes.
7. Epilogue: copy back
atom = make_copy_atom(CopyUniversalOp(), f32)
cute.copy(atom, tCrC, tCgC)
# tCrC (registers) β tCgC (global memory)
# Uses the SAME layout/strides from partition_C
# so each value lands at its correct global position
for (m = 0; m < 8; m++) {
int gm = bidx*128 + m_pos[m];
for (n = 0; n < 8; n++) {
int gn = bidy*128 + n_pos[n];
C[gm*N + gn] = acc[m][n];
}
}
// STG instructions, 64 stores
STORE cute.copy walks both tensors element-by-element: reads tCrC[i] (register), writes to tCgC[i] (global address computed from partition_C's strides). Same as your nested store loops.
8. Summary: What CuTe Abstracts Away
| Your CUDA responsibility | CuTe equivalent | Why it matters |
| Compute thread β (thr_m, thr_n) | atoms_layout + get_slice | Change atom count? Layout auto-adjusts |
| Compute m_pos[], n_pos[] arrays | permutation + partition_C | Change grouping? Just change R |
| A_cta pointer + k*M offset | local_tile + partition_A | Change tile size or memory layout? Same API |
| Manage rA[8][8] declaration | make_fragment_like | Shape auto-derived from partition |
| LDG loop | autovec_copy | Vectorizes when strides allow it |
| FMA triple loop | cute.gemm | Swap atom to tensor core β same call |
| STG loop | cute.copy | Same copy API for any epilogue pattern |
The promise of CuTe: Your CUDA kernel has index math baked for one specific configuration. To switch from scalar FMA to mma.sync.m16n8k16, you'd rewrite every index formula.
In CuTe: change MmaUniversalOp β MmaF16BF16Op, adjust atoms_layout to (2,2,1), and the kernel body stays identical. The partitions, copies, and gemm calls all auto-adapt because they're driven by the TiledMMA's layout algebra, not hardcoded constants.