CuTe DSL vs Pure CUDA GEMM

How CuTe abstracts away the manual index math you'd write in raw CUDA

1. Overall Structure

Aspect Pure CUDA CuTe DSL
Language C++ / CUDA C Python (compiles to PTX via MLIR)
Thread↔Data mapping Manual index arithmetic Encoded in TiledMMA layouts β€” automatic
Tiling Manual offset calculation per block cute.local_tile(tensor, tiler, coord, proj)
MMA dispatch Explicit PTX inline asm or wmma API Swap the Atom β€” everything re-derives
Shared memory layout Manual padding / swizzle for bank conflicts make_composed_layout(swizzle, ...)
Changing arch Rewrite kernel from scratch Change atom + copy atoms, rest adapts

2. Kernel Code β€” Side by Side

πŸ”΄ Pure CUDA β€” Naive GEMM

// Manual index math everywhere
__global__ void gemm_naive(
    float *A, float *B, float *C,
    int M, int N, int K)
{
  // Step A: Which tile am I?
  int bx = blockIdx.x;
  int by = blockIdx.y;
  int tx = threadIdx.x;

  // Step B: Manual thread→element mapping
  int row = bx * BM + (tx / 16) * 4;
  int col = by * BN + (tx % 16) * 4;

  // Step C: K-loop
  float acc[4][4] = {0};
  for (int k = 0; k < K; k += BK) {
    for (int i = 0; i < 4; i++)
      for (int j = 0; j < 4; j++)
        for (int kk = 0; kk < BK; kk++)
          // YOU compute the address
          acc[i][j] += A[(row+i)*K + k+kk]
                     * B[(col+j)*K + k+kk];
  }

  // Step D: Store β€” manual global offset
  for (int i = 0; i < 4; i++)
    for (int j = 0; j < 4; j++)
      C[(row+i)*N + col+j] = acc[i][j];
}

🟒 CuTe DSL β€” Same GEMM

# Layouts encode the mapping
@cute.kernel
def gemm_kernel(mA, mB, mC, tiled_mma):
    tidx = cute.arch.thread_idx()[0]
    bidx, bidy = cute.arch.block_idx()[:2]

    # Step A: CuTe tiles for you
    gA = cute.local_tile(mA, tiler, (bidx,bidy,None),
                         proj=(1,None,1))
    gB = cute.local_tile(mB, tiler, (bidx,bidy,None),
                         proj=(None,1,1))
    gC = cute.local_tile(mC, tiler, (bidx,bidy,None),
                         proj=(1,1,None))

    # Step B: TiledMMA knows the mapping
    thr_mma = tiled_mma.get_slice(tidx)
    tCgC = thr_mma.partition_C(gC)
    tCrC = tiled_mma.make_fragment_C(tCgC)
    tCrC.fill(0.0)

    # Step C: K-loop β€” no index math
    for k in range(num_k_tiles):
        tCrA = cute.make_fragment_like(...)
        cute.copy(tCgA, tCrA)
        cute.gemm(tiled_mma, tCrC, tCrA, tCrB, tCrC)

    # Step D: Epilogue β€” copy handles layout
    cute.copy(atom, tCrC, tCgC)
The CUDA version has index math scattered everywhere β€” and it's wrong the moment you change the data layout, tile size, or MMA instruction. In CuTe, the layout algebra handles all of this. You describe "what" not "how."

3. The Key Difference: Who Computes Addresses?

Pure CUDA: Thread 42 wants A[row][k]: β†’ programmer manually computes: ptr + row*K + k β†’ changes layout? rewrite formula β†’ changes arch? rewrite kernel CuTe DSL: Thread 42 wants its A elements: β†’ TiledMMA layout says: "thread 42 owns these (V,M,K) coords" β†’ partition_A applies layout composition: coords β†’ offset β†’ changes layout? layout algebra adapts β†’ changes arch? swap the atom, rest re-derives

4. TiledMMA Construction β€” The "Missing Piece" in CUDA

In pure CUDA, the thread↔data mapping is implicit in your index math. CuTe makes it explicit and composable:

# Host side (@cute.jit) β€” compile-time constant
tiled_mma = cute.make_tiled_mma(
    atom,            # What hardware instruction? (shape + thread↔element contract)
    atoms_layout,    # How many, tiled how in (M,N,K)?
    permutation_mnk  # Reorder for memory access patterns
)
Component CUDA Equivalent CuTe DSL
atom The FMA/HMMA/WMMA you call in inline asm MmaUniversalOp(f32) or MmaF16BF16Op(f16,f32,(16,8,16))
atoms_layout Your mental picture of "256 threads form a 16Γ—16 grid" make_layout((16,16,1), stride=(16,1,0))
permutation That tricky index remapping you hardcode for coalescing ((16,4):(4,1), (16,4):(4,1), None)
partition_A/B/C row = bx*BM + (tx/16)*4; etc. thr_mma.partition_C(gC) β€” derived from layouts
In CUDA, you are the layout algebra β€” you do the math in your head and encode it as index expressions. CuTe makes the algebra explicit so the compiler can derive everything from the atom.

5. What Happens When You Scale Up (SMEM + Tensor Cores)

πŸ”΄ CUDA: Everything Changes

  • Add __shared__ arrays with manual padding
  • Rewrite all load loops for cp.async
  • Manually compute swizzled SMEM addresses
  • Rewrite compute loop for mma.sync PTX
  • Fragment layout is arch-specific β€” hardcoded
  • Pipeline stages = copy-paste with index offsets

Result: ~500 lines, fragile, arch-locked

🟒 CuTe: Swap Components

  • Add SMEM allocator + composed swizzle layout
  • Swap copy atom: CopyUniversalOp β†’ CopyG2SOp
  • Swap MMA atom: MmaUniversalOp β†’ MmaF16BF16Op
  • Add S2R copy atom: LdMatrix8x8x16bOp
  • Pipeline = extra SMEM dimension + commit/wait
  • Kernel structure (Aβ†’Bβ†’Cβ†’D) unchanged

Result: Same structure, new atoms, works

6. The Progression Summarized

Pure CUDA CuTe DSL ───────── ──────── Step 1: Manual index math local_tile + partition + gemm Step 2: + manual SMEM + syncthreads + SmemAllocator + tiled_copy Step 3: + cp.async inline asm + CopyG2SOp atom + commit/wait Step 4: + mma.sync PTX + swizzle hack + MmaF16BF16Op + composed_layout Step 5: + vectorized asm + reg staging + 128b copy atom + retile Each CUDA step = rewrite. Each CuTe step = swap a component.

7. Mental Model

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CUDA programmer's job: β”‚ β”‚ β”‚ β”‚ "I am the compiler. I manually derive addresses from tile coords,β”‚ β”‚ thread IDs, warp lanes, and hardware instruction layouts." β”‚ β”‚ β”‚ β”‚ CuTe programmer's job: β”‚ β”‚ β”‚ β”‚ "I describe the hardware contract (atom) and the tiling strategy β”‚ β”‚ (atoms_layout). CuTe derives all the address math for me." β”‚ β”‚ β”‚ β”‚ The atom IS the hardware manual page, encoded as layouts. β”‚ β”‚ Everything else is layout composition. β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜