CuTe DSL vs Pure CUDA GEMM

1. Overall Structure

Aspect	Pure CUDA	CuTe DSL
Language	C++ / CUDA C	Python (compiles to PTX via MLIR)
Thread↔Data mapping	Manual index arithmetic	Encoded in TiledMMA layouts — automatic
Tiling	Manual offset calculation per block	`cute.local_tile(tensor, tiler, coord, proj)`
MMA dispatch	Explicit PTX inline asm or wmma API	Swap the Atom — everything re-derives
Shared memory layout	Manual padding / swizzle for bank conflicts	`make_composed_layout(swizzle, ...)`
Changing arch	Rewrite kernel from scratch	Change atom + copy atoms, rest adapts

2. Kernel Code — Side by Side

🔴 Pure CUDA — Naive GEMM

// Manual index math everywhere
__global__ void gemm_naive(
    float *A, float *B, float *C,
    int M, int N, int K)
{
  // Step A: Which tile am I?
  int bx = blockIdx.x;
  int by = blockIdx.y;
  int tx = threadIdx.x;

  // Step B: Manual thread→element mapping
  int row = bx * BM + (tx / 16) * 4;
  int col = by * BN + (tx % 16) * 4;

  // Step C: K-loop
  float acc[4][4] = {0};
  for (int k = 0; k < K; k += BK) {
    for (int i = 0; i < 4; i++)
      for (int j = 0; j < 4; j++)
        for (int kk = 0; kk < BK; kk++)
          // YOU compute the address
          acc[i][j] += A[(row+i)*K + k+kk]
                     * B[(col+j)*K + k+kk];
  }

  // Step D: Store — manual global offset
  for (int i = 0; i < 4; i++)
    for (int j = 0; j < 4; j++)
      C[(row+i)*N + col+j] = acc[i][j];
}

🟢 CuTe DSL — Same GEMM

# Layouts encode the mapping
@cute.kernel
def gemm_kernel(mA, mB, mC, tiled_mma):
    tidx = cute.arch.thread_idx()[0]
    bidx, bidy = cute.arch.block_idx()[:2]

    # Step A: CuTe tiles for you
    gA = cute.local_tile(mA, tiler, (bidx,bidy,None),
                         proj=(1,None,1))
    gB = cute.local_tile(mB, tiler, (bidx,bidy,None),
                         proj=(None,1,1))
    gC = cute.local_tile(mC, tiler, (bidx,bidy,None),
                         proj=(1,1,None))

    # Step B: TiledMMA knows the mapping
    thr_mma = tiled_mma.get_slice(tidx)
    tCgC = thr_mma.partition_C(gC)
    tCrC = tiled_mma.make_fragment_C(tCgC)
    tCrC.fill(0.0)

    # Step C: K-loop — no index math
    for k in range(num_k_tiles):
        tCrA = cute.make_fragment_like(...)
        cute.copy(tCgA, tCrA)
        cute.gemm(tiled_mma, tCrC, tCrA, tCrB, tCrC)

    # Step D: Epilogue — copy handles layout
    cute.copy(atom, tCrC, tCgC)

The CUDA version has index math scattered everywhere — and it's wrong the moment you change the data layout, tile size, or MMA instruction. In CuTe, the layout algebra handles all of this. You describe "what" not "how."

3. The Key Difference: Who Computes Addresses?

Pure CUDA: Thread 42 wants A[row][k]: → programmer manually computes: ptr + row*K + k → changes layout? rewrite formula → changes arch? rewrite kernel CuTe DSL: Thread 42 wants its A elements: → TiledMMA layout says: "thread 42 owns these (V,M,K) coords" → partition_A applies layout composition: coords → offset → changes layout? layout algebra adapts → changes arch? swap the atom, rest re-derives

4. TiledMMA Construction — The "Missing Piece" in CUDA

In pure CUDA, the thread↔data mapping is implicit in your index math. CuTe makes it explicit and composable:

# Host side (@cute.jit) — compile-time constant
tiled_mma = cute.make_tiled_mma(
    atom,            # What hardware instruction? (shape + thread↔element contract)
    atoms_layout,    # How many, tiled how in (M,N,K)?
    permutation_mnk  # Reorder for memory access patterns
)

Component	CUDA Equivalent	CuTe DSL
atom	The FMA/HMMA/WMMA you call in inline asm	`MmaUniversalOp(f32)` or `MmaF16BF16Op(f16,f32,(16,8,16))`
atoms_layout	Your mental picture of "256 threads form a 16×16 grid"	`make_layout((16,16,1), stride=(16,1,0))`
permutation	That tricky index remapping you hardcode for coalescing	`((16,4):(4,1), (16,4):(4,1), None)`
partition_A/B/C	`row = bxBM + (tx/16)4;` etc.	`thr_mma.partition_C(gC)` — derived from layouts

In CUDA, you are the layout algebra — you do the math in your head and encode it as index expressions. CuTe makes the algebra explicit so the compiler can derive everything from the atom.

5. What Happens When You Scale Up (SMEM + Tensor Cores)

🔴 CUDA: Everything Changes

Add __shared__ arrays with manual padding
Rewrite all load loops for cp.async
Manually compute swizzled SMEM addresses
Rewrite compute loop for mma.sync PTX
Fragment layout is arch-specific — hardcoded
Pipeline stages = copy-paste with index offsets

Result: ~500 lines, fragile, arch-locked

🟢 CuTe: Swap Components

Add SMEM allocator + composed swizzle layout
Swap copy atom: CopyUniversalOp → CopyG2SOp
Swap MMA atom: MmaUniversalOp → MmaF16BF16Op
Add S2R copy atom: LdMatrix8x8x16bOp
Pipeline = extra SMEM dimension + commit/wait
Kernel structure (A→B→C→D) unchanged

Result: Same structure, new atoms, works

6. The Progression Summarized

Pure CUDA CuTe DSL ───────── ──────── Step 1: Manual index math local_tile + partition + gemm Step 2: + manual SMEM + syncthreads + SmemAllocator + tiled_copy Step 3: + cp.async inline asm + CopyG2SOp atom + commit/wait Step 4: + mma.sync PTX + swizzle hack + MmaF16BF16Op + composed_layout Step 5: + vectorized asm + reg staging + 128b copy atom + retile Each CUDA step = rewrite. Each CuTe step = swap a component.

7. Mental Model

┌─────────────────────────────────────────────────────────────────────┐ │ CUDA programmer's job: │ │ │ │ "I am the compiler. I manually derive addresses from tile coords,│ │ thread IDs, warp lanes, and hardware instruction layouts." │ │ │ │ CuTe programmer's job: │ │ │ │ "I describe the hardware contract (atom) and the tiling strategy │ │ (atoms_layout). CuTe derives all the address math for me." │ │ │ │ The atom IS the hardware manual page, encoded as layouts. │ │ Everything else is layout composition. │ └─────────────────────────────────────────────────────────────────────┘