Step 1: Understanding CuTe DSL Through Its CUDA Equivalent

You know CUDA. This maps each CuTe abstraction to the CUDA code it replaces β€” so you see what CuTe "means" operationally.

1. The Big Picture: What CuTe Hides

In CUDA you manually compute: "which M,N positions does this thread own?" and "what's the global address for position (m,k)?"

In CuTe, both questions are answered by Layout objects. The kernel never writes index arithmetic β€” it calls partition_C/A/B and gets back tensors pointing to exactly its data.

The generated PTX is the same. CuTe's value: you can swap the MMA atom (scalar→tensor core) and the partitioning auto-adjusts. In CUDA you'd rewrite all index math.

2. Host Setup: What You'd Put Before the <<<>>> Launch

CuTe DSL (host_gemm)
CUDA equivalent
op = cute.nvgpu.MmaUniversalOp(Float32) atoms_layout = make_layout((16,16,1), (16,1,0)) perm_M = make_layout((16,4), (4,1)) perm_N = make_layout((16,4), (4,1)) tiled_mma = make_tiled_mma(op, atoms_layout, permutation_mnk=(perm_M, perm_N, None)) grid = ceil_div(mC.shape, (BM, BN)) block = (256, 1, 1) gemm_kernel(..., tiled_mma).launch(grid, block)
// All encoded as compile-time constants: #define GROUP_SIZE 4 // ← from R in permutation #define GROUP_GAP 64 // ← from F*R / threads_per_dim // thr_m = tidx/16, thr_n = tidx%16 ← from atoms_layout dim3 grid(M/128, N/128, 1); dim3 block(256, 1, 1); gemm_naive<<<grid, block>>>(A, B, C, M, N, K);
KEY tiled_mma is a compile-time descriptor β€” it encodes the thread↔element mapping rules. In CUDA, these rules become your hardcoded index formulas (tidx/16, *4, +64). CuTe makes them composable and swappable.

3. local_tile β€” "Give This CTA Its Slice"

CuTe DSL
CUDA
gA = cute.local_tile(mA, tiler=cta_tiler, # (128, 128, 8) coord=(bidx, bidy, None), proj=(1, None, 1)) # keep M, drop N, keep K # Result: Tensor<gmem, (128, 8, 4):(1, M, M*BK)> # Still global-memory pointer, just new layout
const float* A_cta = A + bidx * BM; // Then index as: A_cta[m + k*M] // k_tile selection: kt*BK offset in k // That's it β€” pointer + manual offset math // No "tensor object" in CUDA
VIEW Zero hardware work. CuTe returns a new tensor with adjusted pointer + layout that encodes the tiling. In CUDA, you just do A + bidx*BM and remember to multiply k*M later. Same addresses, different bookkeeping.
CuTe's local_tile also splits the K dimension into tiles and returns a 3D tensor (BM, BK, num_k_tiles). In CUDA you'd handle this with the k_tile loop variable. CuTe makes the "K is tiled" fact part of the type.

4. get_slice + partition_C β€” "Which Output Elements Are Mine?"

CuTe DSL
CUDA
thr_mma = tiled_mma.get_slice(tidx) tCgC = thr_mma.partition_C(gC) # tCgC: Tensor<gmem, (1,(4,2),(4,2)):(0,(128,8192),(1,64))> # V M_rest N_rest # This IS the ownership pattern β€” encoded as a Layout
int thr_m = tidx / 16; int thr_n = tidx % 16; int m_pos[8], n_pos[8]; for (g = 0; g < 2; g++) for (i = 0; i < 4; i++) { m_pos[g*4+i] = thr_m*4 + g*64 + i; n_pos[g*4+i] = thr_n*4 + g*64 + i; } // You manually decode the pattern // CuTe encodes it as the stride tuple
VIEW No data moved. partition_C returns a tensor that, when indexed, gives you the global address of your element. The nested shape (4,2) = "4 consecutive, then repeat with gap". In CUDA, you wrote that as the loop structure.
Why (4,2) instead of flat 8?
The nested shape preserves the structure of the permutation. The inner 4 = R (consecutive group), outer 2 = repetitions. This matters because:
β€’ Inner-4 elements have stride 1 (or 128 for M) β†’ vectorizable as LDG.128
β€’ The gap between groups (64 columns, 8192 bytes for M) is a separate access

CuTe keeps this structure so autovec_copy can emit vector loads for the consecutive parts.

5. make_fragment_C + fill β€” "Allocate My Accumulators"

CuTe DSL
CUDA
tCrC = tiled_mma.make_fragment_C(tCgC) tCrC.fill(0.0) # tCrC: Tensor<rmem, (1,(4,2),(4,2))> # Same shape as tCgC but in register file # 64 floats = 256 bytes of registers
float acc[8][8]; for (m = 0; m < 8; m++) for (n = 0; n < 8; n++) acc[m][n] = 0.0f; // 64 floats, same thing // Compiler puts in registers
ALLOC Real work: register allocation + zero-init. make_fragment_C uses tCgC only for its shape β€” it allocates a register tensor with identical logical structure but different (packed) strides.

6. The K-Loop: partition_A/B + copy + gemm

CuTe DSL
CUDA
for k_tile in range(num_k_tiles): # 4 tCgA = thr_mma.partition_A(gA[:,:,k_tile]) tCrA = make_fragment_like(tCgA) cute.autovec_copy(tCgA, tCrA) # ← LOAD tCgB = thr_mma.partition_B(gB[:,:,k_tile]) tCrB = make_fragment_like(tCgB) cute.autovec_copy(tCgB, tCrB) # ← LOAD for k_block in range(8): # BK cute.gemm(tiled_mma, tCrC, tCrA[:,:,k_block], tCrB[:,:,k_block], tCrC) # ← COMPUTE
for (kt = 0; kt < K/BK; kt++) { // 4 float rA[8][8], rB[8][8]; for (m = 0; m < 8; m++) for (k = 0; k < 8; k++) rA[m][k] = A_cta[m_pos[m] + (kt*8+k)*M]; for (n = 0; n < 8; n++) for (k = 0; k < 8; k++) rB[n][k] = B_cta[n_pos[n] + (kt*8+k)*N]; for (k = 0; k < 8; k++) for (m = 0; m < 8; m++) for (n = 0; n < 8; n++) acc[m][n] += rA[m][k] * rB[n][k]; }
LOAD autovec_copy = the LDG instructions. This is where bandwidth is consumed.
COMPUTE cute.gemm = the FFMA instructions. For scalar atom, it's just the triple-nested multiply-add.
partition_A gives shape (1, (4,2), 8) = (V, M_rest, K). The K=8 dimension is the full BK β€” this thread needs ALL K values for its M-positions. In CUDA that's your inner k loop.
Why does CuTe separate partition from copy?

In CUDA you fuse "compute address" and "load" into one line: rA[m][k] = A[addr].
CuTe splits them:
1. partition_A β†’ creates a gmem tensor with your addresses (VIEW, free)
2. autovec_copy β†’ issues the loads (REAL WORK)

This separation lets you later insert shared memory between them (step2) or use async copy (step3) without changing the partition logic. The "what's mine" question stays the same; only "where do I read from" changes.

7. Epilogue: copy back

CuTe DSL
CUDA
atom = make_copy_atom(CopyUniversalOp(), f32) cute.copy(atom, tCrC, tCgC) # tCrC (registers) β†’ tCgC (global memory) # Uses the SAME layout/strides from partition_C # so each value lands at its correct global position
for (m = 0; m < 8; m++) { int gm = bidx*128 + m_pos[m]; for (n = 0; n < 8; n++) { int gn = bidy*128 + n_pos[n]; C[gm*N + gn] = acc[m][n]; } } // STG instructions, 64 stores
STORE cute.copy walks both tensors element-by-element: reads tCrC[i] (register), writes to tCgC[i] (global address computed from partition_C's strides). Same as your nested store loops.

8. Summary: What CuTe Abstracts Away

Your CUDA responsibilityCuTe equivalentWhy it matters
Compute thread β†’ (thr_m, thr_n)atoms_layout + get_sliceChange atom count? Layout auto-adjusts
Compute m_pos[], n_pos[] arrayspermutation + partition_CChange grouping? Just change R
A_cta pointer + k*M offsetlocal_tile + partition_AChange tile size or memory layout? Same API
Manage rA[8][8] declarationmake_fragment_likeShape auto-derived from partition
LDG loopautovec_copyVectorizes when strides allow it
FMA triple loopcute.gemmSwap atom to tensor core β†’ same call
STG loopcute.copySame copy API for any epilogue pattern
The promise of CuTe: Your CUDA kernel has index math baked for one specific configuration. To switch from scalar FMA to mma.sync.m16n8k16, you'd rewrite every index formula.

In CuTe: change MmaUniversalOp β†’ MmaF16BF16Op, adjust atoms_layout to (2,2,1), and the kernel body stays identical. The partitions, copies, and gemm calls all auto-adapt because they're driven by the TiledMMA's layout algebra, not hardcoded constants.