Step 1: CuTe DSL Explained via CUDA Mapping

You know CUDA. This maps each CuTe abstraction to the CUDA code it replaces — so you see what CuTe "means" operationally.

1. The Big Picture: What CuTe Hides

In CUDA you manually compute: "which M,N positions does this thread own?" and "what's the global address for position (m,k)?"

In CuTe, both questions are answered by Layout objects. The kernel never writes index arithmetic — it calls partition_C/A/B and gets back tensors pointing to exactly its data.

The generated PTX is the same. CuTe's value: you can swap the MMA atom (scalar→tensor core) and the partitioning auto-adjusts. In CUDA you'd rewrite all index math.

2. Host Setup: What You'd Put Before the <<<>>> Launch

CuTe DSL (host_gemm)

CUDA equivalent

op = cute.nvgpu.MmaUniversalOp(Float32)
atoms_layout = make_layout((16,16,1), (16,1,0))
perm_M = make_layout((16,4), (4,1))
perm_N = make_layout((16,4), (4,1))
tiled_mma = make_tiled_mma(op, atoms_layout,
    permutation_mnk=(perm_M, perm_N, None))

grid = ceil_div(mC.shape, (BM, BN))
block = (256, 1, 1)
gemm_kernel(..., tiled_mma).launch(grid, block)

// All encoded as compile-time constants:
#define GROUP_SIZE 4   // ← from R in permutation
#define GROUP_GAP 64   // ← from F*R / threads_per_dim
// thr_m = tidx/16, thr_n = tidx%16  ← from atoms_layout

dim3 grid(M/128, N/128, 1);
dim3 block(256, 1, 1);
gemm_naive<<<grid, block>>>(A, B, C, M, N, K);

KEY tiled_mma is a compile-time descriptor — it encodes the thread↔element mapping rules. In CUDA, these rules become your hardcoded index formulas (tidx/16, *4, +64). CuTe makes them composable and swappable.

3. local_tile — "Give This CTA Its Slice"

CuTe DSL

CUDA

gA = cute.local_tile(mA,
    tiler=cta_tiler,      # (128, 128, 8)
    coord=(bidx, bidy, None),
    proj=(1, None, 1))    # keep M, drop N, keep K

# Result: Tensor<gmem, (128, 8, 4):(1, M, M*BK)>
# Still global-memory pointer, just new layout

const float* A_cta = A + bidx * BM;
// Then index as: A_cta[m + k*M]
// k_tile selection: kt*BK offset in k

// That's it — pointer + manual offset math
// No "tensor object" in CUDA

VIEW Zero hardware work. CuTe returns a new tensor with adjusted pointer + layout that encodes the tiling. In CUDA, you just do A + bidx*BM and remember to multiply k*M later. Same addresses, different bookkeeping.

CuTe's local_tile also splits the K dimension into tiles and returns a 3D tensor (BM, BK, num_k_tiles). In CUDA you'd handle this with the k_tile loop variable. CuTe makes the "K is tiled" fact part of the type.

4. get_slice + partition_C — "Which Output Elements Are Mine?"

CuTe DSL

CUDA

thr_mma = tiled_mma.get_slice(tidx)
tCgC = thr_mma.partition_C(gC)

# tCgC: Tensor<gmem, (1,(4,2),(4,2)):(0,(128,8192),(1,64))>
#                      V  M_rest N_rest
# This IS the ownership pattern — encoded as a Layout

int thr_m = tidx / 16;
int thr_n = tidx % 16;
int m_pos[8], n_pos[8];
for (g = 0; g < 2; g++)
  for (i = 0; i < 4; i++) {
    m_pos[g*4+i] = thr_m*4 + g*64 + i;
    n_pos[g*4+i] = thr_n*4 + g*64 + i;
  }

// You manually decode the pattern
// CuTe encodes it as the stride tuple

VIEW No data moved. partition_C returns a tensor that, when indexed, gives you the global address of your element. The nested shape (4,2) = "4 consecutive, then repeat with gap". In CUDA, you wrote that as the loop structure.

Why (4,2) instead of flat 8?
The nested shape preserves the structure of the permutation. The inner 4 = R (consecutive group), outer 2 = repetitions. This matters because:
• Inner-4 elements have stride 1 (or 128 for M) → vectorizable as LDG.128
• The gap between groups (64 columns, 8192 bytes for M) is a separate access

CuTe keeps this structure so autovec_copy can emit vector loads for the consecutive parts.

5. make_fragment_C + fill — "Allocate My Accumulators"

CuTe DSL

CUDA

tCrC = tiled_mma.make_fragment_C(tCgC)
tCrC.fill(0.0)

# tCrC: Tensor<rmem, (1,(4,2),(4,2))>
# Same shape as tCgC but in register file
# 64 floats = 256 bytes of registers

float acc[8][8];
for (m = 0; m < 8; m++)
  for (n = 0; n < 8; n++)
    acc[m][n] = 0.0f;

// 64 floats, same thing
// Compiler puts in registers

ALLOC Real work: register allocation + zero-init. make_fragment_C uses tCgC only for its shape — it allocates a register tensor with identical logical structure but different (packed) strides.

6. The K-Loop: partition_A/B + copy + gemm

CuTe DSL

CUDA

for k_tile in range(num_k_tiles):      # 4
  tCgA = thr_mma.partition_A(gA[:,:,k_tile])
  tCrA = make_fragment_like(tCgA)
  cute.autovec_copy(tCgA, tCrA)       # ← LOAD

  tCgB = thr_mma.partition_B(gB[:,:,k_tile])
  tCrB = make_fragment_like(tCgB)
  cute.autovec_copy(tCgB, tCrB)       # ← LOAD

  for k_block in range(8):             # BK
    cute.gemm(tiled_mma, tCrC,
      tCrA[:,:,k_block],
      tCrB[:,:,k_block], tCrC)         # ← COMPUTE

for (kt = 0; kt < K/BK; kt++) {       // 4
  float rA[8][8], rB[8][8];

  for (m = 0; m < 8; m++)
    for (k = 0; k < 8; k++)
      rA[m][k] = A_cta[m_pos[m] + (kt*8+k)*M];

  for (n = 0; n < 8; n++)
    for (k = 0; k < 8; k++)
      rB[n][k] = B_cta[n_pos[n] + (kt*8+k)*N];

  for (k = 0; k < 8; k++)
    for (m = 0; m < 8; m++)
      for (n = 0; n < 8; n++)
        acc[m][n] += rA[m][k] * rB[n][k];
}

COMPUTE cute.gemm = the FFMA instructions. For scalar atom, it's just the triple-nested multiply-add.

partition_A gives shape (1, (4,2), 8) = (V, M_rest, K). The K=8 dimension is the full BK — this thread needs ALL K values for its M-positions. In CUDA that's your inner k loop.

Why does CuTe separate partition from copy?

In CUDA you fuse "compute address" and "load" into one line: rA[m][k] = A[addr].
CuTe splits them:
1. partition_A → creates a gmem tensor with your addresses (VIEW, free)
2. autovec_copy → issues the loads (REAL WORK)

This separation lets you later insert shared memory between them (step2) or use async copy (step3) without changing the partition logic. The "what's mine" question stays the same; only "where do I read from" changes.

7. Epilogue: copy back

CuTe DSL

CUDA

atom = make_copy_atom(CopyUniversalOp(), f32)
cute.copy(atom, tCrC, tCgC)

# tCrC (registers) → tCgC (global memory)
# Uses the SAME layout/strides from partition_C
# so each value lands at its correct global position

for (m = 0; m < 8; m++) {
  int gm = bidx*128 + m_pos[m];
  for (n = 0; n < 8; n++) {
    int gn = bidy*128 + n_pos[n];
    C[gm*N + gn] = acc[m][n];
  }
}

// STG instructions, 64 stores

STORE cute.copy walks both tensors element-by-element: reads tCrC[i] (register), writes to tCgC[i] (global address computed from partition_C's strides). Same as your nested store loops.

8. Summary: What CuTe Abstracts Away

Your CUDA responsibility	CuTe equivalent	Why it matters
Compute thread → (thr_m, thr_n)	atoms_layout + get_slice	Change atom count? Layout auto-adjusts
Compute m_pos[], n_pos[] arrays	permutation + partition_C	Change grouping? Just change R
A_cta pointer + k*M offset	local_tile + partition_A	Change tile size or memory layout? Same API
Manage rA[8][8] declaration	make_fragment_like	Shape auto-derived from partition
LDG loop	autovec_copy	Vectorizes when strides allow it
FMA triple loop	cute.gemm	Swap atom to tensor core → same call
STG loop	cute.copy	Same copy API for any epilogue pattern

The promise of CuTe: Your CUDA kernel has index math baked for one specific configuration. To switch from scalar FMA to mma.sync.m16n8k16, you'd rewrite every index formula.

In CuTe: change MmaUniversalOp → MmaF16BF16Op, adjust atoms_layout to (2,2,1), and the kernel body stays identical. The partitions, copies, and gemm calls all auto-adapt because they're driven by the TiledMMA's layout algebra, not hardcoded constants.

Step 1: Understanding CuTe DSL Through Its CUDA Equivalent

1. The Big Picture: What CuTe Hides

2. Host Setup: What You'd Put Before the <<<>>> Launch

3. local_tile — "Give This CTA Its Slice"

4. get_slice + partition_C — "Which Output Elements Are Mine?"

5. make_fragment_C + fill — "Allocate My Accumulators"

6. The K-Loop: partition_A/B + copy + gemm

7. Epilogue: copy back

8. Summary: What CuTe Abstracts Away