Steps 2–5: CuTe DSL vs Pure CUDA Comparison

Each step introduces one optimization. Here's how CuTe abstracts what you'd write by hand in CUDA.

Step 2: Shared Memory Tiling

New CuTe concepts → CUDA equivalents

CuTe DSL

CUDA

smem = SmemAllocator() sA = smem.allocate_tensor(dtype, sA_layout, align) sB = smem.allocate_tensor(dtype, sB_layout, align)

__shared__ float sA[128 * 8]; // (BM, BK) flattened __shared__ float sB[128 * 8]; // (BN, BK) flattened

VIEW SmemAllocator = __shared__ declaration. CuTe adds layout metadata; CUDA just gets raw bytes.

CuTe DSL

CUDA

thr_copy = tiled_copy.get_slice(tidx) tAgA = thr_copy.partition_S(gA) # source (gmem) tAsA = thr_copy.partition_D(sA) # dest (smem) cute.copy(tiled_copy, tAgA, tAsA)

// Thread layout (32, 8): 32 in M, 8 in K int copy_m = (tidx % 32) * 4; // 4 consecutive M-rows int copy_k = tidx / 32; // 1 K-column for (int i = 0; i < 4; i++) sA[(copy_m+i) + copy_k*BM] = A_cta[(copy_m+i) + k*M];

LOAD make_tiled_copy_tv + partition_S/D encodes the cooperative thread→element mapping. In CUDA, you manually compute which elements each thread loads. CuTe derives it from the thread layout.

Step 3: Multi-Stage Async Pipeline

New CuTe concepts → CUDA equivalents

CuTe DSL

CUDA

# Copy atom with cp.async hardware DMA atom = make_copy_atom(CopyG2SOp(), dtype, num_bits_per_copy=32) # Issues cp.async — thread doesn't stall! cute.copy(tiled_copy, tAgA[..., k_tile], tAsA[..., stage]) cute.arch.cp_async_commit_group() cute.arch.cp_async_wait_group(NUM_STAGES - 2)

// cp.async PTX inline — bypasses thread data path asm("cp.async.ca.shared.global [%0], [%1], 4;" :: "r"(smem_addr), "l"(gmem_ptr)); asm("cp.async.commit_group;"); asm("cp.async.wait_group 1;"); // wait until ≤1 group inflight

LOAD CopyG2SOp = cp.async instruction. The GPU's DMA engine copies data without occupying thread execution slots. Thread can compute on old data while new data arrives.

CuTe DSL

CUDA

# SMEM has extra dimension for stages sA_layout = tile_to_shape(atom, (BM, BK, NUM_STAGES)) # Index with stage: tAsA[None, None, None, stage] # select stage

__shared__ float sA[NUM_STAGES][BM * BK]; // Index: sA[stage][m + k*BM] // Circular: read_stage, write_stage advance modulo NUM_STAGES

Pipeline overlaps: while computing on stage[read], DMA fills stage[write]. With 3 stages and NUM_STAGES-2 wait, at most 1 group finishes before compute starts → full overlap of copy+compute after warmup.

What CuTe hides: The circular buffer pointer management (read_stage = (read_stage+1) % 3) and commit/wait bookkeeping. In CuTe, the copy just targets the right partition and the framework tracks pipeline state. In CUDA, you manage the state machine manually.

Step 4: Tensor Cores + Swizzled SMEM

Data path: GMEM →(cp.async)→ SMEM[swizzled] →(ldmatrix)→ RMEM → HMMA → GMEM

New CuTe concepts → CUDA equivalents

CuTe DSL

CUDA

# MMA atom: hardware tensor core instruction op = MmaF16BF16Op(FP16, FP32, (16, 8, 16)) tC = make_layout((2, 2, 1)) # atom_layout: 4 warps tiled_mma = make_tiled_mma(op, tC, permutation_mnk=(32, 32, 16)) # Partition from SMEM (not GMEM like step1-3) tCsA = thr_mma.partition_A(sA) tCsB = thr_mma.partition_B(sB)

// wmma (or raw mma.sync.m16n8k16 PTX) // 4 warps in (2,2) layout: warp_m = warp_id/2 // Each warp computes 64×64 of the 128×128 tile // = 4×4 wmma tiles (each 16×16) wmma::fragment<matrix_a, 16,16,16, half, col_major> fa; wmma::load_matrix_sync(fa, &sA[offset], BM); wmma::mma_sync(acc, fa, fb, acc);

COMPUTE MmaF16BF16Op wraps the hardware's mma.sync.m16n8k16. CuTe's atom_layout (2,2,1) + permutation (32,32,16) tiles this across all 128 threads. In CUDA: you manually assign warp→tile mapping. The fragment load pattern (which SMEM address each lane reads) is what ldmatrix + CuTe's TV layout encode automatically.

CuTe DSL

CUDA

# Swizzled SMEM layout — bank-conflict-free sA_layout = make_composed_layout( make_swizzle(3, 3, 3), # 3-bit XOR 0, # applied to mode 0 layout_atom(64, 8):(1, 64)) # G2S copy writes to swizzled addresses cute.copy(tiled_copy, gA_partition, sA_partition)

// Swizzle: XOR-based address transform // Within each 64×8 atom: k_swizzled = k ^ ((m>>3) & 7) // This spreads bank accesses across 32 banks int smem_idx(int m, int k, int stage) { int ml = m % 64, kl = k % 8; int ks = kl ^ ((ml >> 3) & 7); // THE SWIZZLE return ... + ml + ks * 64; }

VIEW make_composed_layout wraps a regular layout with a swizzle function. CuTe applies it transparently to all reads/writes. In CUDA, you XOR address bits manually in every indexing expression.

CuTe DSL

CUDA

# S2R via ldmatrix (warp-level cooperative load) atom_s2r = make_copy_atom(LdMatrix8x8x16bOp(True, 4), dtype) tiled_copy_s2r = make_tiled_copy_A(atom_s2r, tiled_mma) # Retile register to match ldmatrix output layout tCrA_copy = thr_s2r.retile(tCrA) cute.copy(tiled_copy_s2r, tCsA_copy, tCrA_copy)

// In our CUDA version: de-swizzle into linear buffer, // then wmma::load_matrix_sync (which internally uses ldmatrix) for (int i = lane; i < 256; i += 32) tempA[i] = sA[smem_idx(row, col, stage)]; __syncwarp(); wmma::load_matrix_sync(frag_a, tempA, ldm);

LOAD ldmatrix is a warp-cooperative instruction — 32 threads collectively load a matrix fragment from SMEM. CuTe's make_tiled_copy_A(atom, tiled_mma) DERIVES the correct lane→address mapping from the MMA's TV layout. Without CuTe, you'd need to understand the exact PTX fragment layout per thread.

This is CuTe's biggest value proposition: when you change the MMA atom (e.g., m16n8k16 → m16n8k32), the derived S2R copy automatically adjusts. In CUDA, you'd rewrite all the fragment loading logic.

Step 5: Vectorized Loads + SMEM Epilogue

G2S: 128-bit cp.async (8 FP16/copy). Epilogue: RMEM → SMEM → retile → 128-bit STG

New CuTe concepts → CUDA equivalents

CuTe DSL

CUDA

# 128-bit cp.async (8 FP16 per transfer) atom = make_copy_atom(CopyG2SOp(), dtype, num_bits_per_copy=128) # Thread layout (16, 8) with val_layout (8, 1) # → each thread copies 8 consecutive elements per call

// cp.async.cg 16 bytes — loads 8 FP16 at once asm("cp.async.cg.shared.global [%0], [%1], 16;" :: "r"(smem_addr), "l"(gmem_ptr)); // Thread layout: (16, 8) → 16 groups of 8 M-elements // 128 threads × 8 elements = 1024 per round, need 4 rounds

LOAD 128-bit cp.async = 4× fewer instructions than step4's 32-bit version. Requires 16B-aligned source and contiguous elements. CuTe's val_layout (8,1) tells the compiler "8 consecutive elements per copy atom."

CuTe DSL

CUDA

# SMEM Epilogue: retile for coalesced stores # 1. Write fragments to SMEM (fragment order) autovec_copy(tCrC_fp16, sC_partition) sync_threads() # 2. Re-read from SMEM in coalesced order autovec_copy(sC_retiled, tCrC_retiled) # 3. Vectorized 128-bit store to GMEM cute.copy(tiled_copy_C, tCrC_retiled, gC_partition)

// 1. wmma::store to SMEM (row-major) wmma::store_matrix_sync(&sC[offset], acc_fp16, BN, row_major); __syncthreads(); // 2. Each thread reads 8 consecutive elements from SMEM // 3. 128-bit store to GMEM (int4 = 16 bytes = 8 FP16) *(int4*)&C[row*N + col] = *(int4*)&sC[flat_idx];

STORE Without SMEM epilogue, tensor core fragments are scattered (the TV layout ≠ coalesced gmem layout). Writing through SMEM lets you "retile" — rearrange from fragment order to coalesced order. Result: 128-bit STG instructions instead of scattered 32-bit stores.

Why SMEM epilogue matters:

After mma.sync, thread 0 holds C[0,0], C[0,1], C[8,0], C[8,1] — scattered positions.
For coalesced 128-bit stores, you need 8 consecutive N-elements in one thread.

The retile through SMEM reshuffles: each thread deposits its fragments into SMEM at their logical positions, then re-reads 8 consecutive elements it didn't originally compute. Now stores are perfectly coalesced.

CuTe's make_tiled_copy_C + retile handles this reshuffling automatically based on the output layout.

Summary: Progressive Optimization Map

Step	CUDA technique	CuTe abstraction	Speedup source
1→2	__shared__ + cooperative copy + __syncthreads	SmemAllocator + make_tiled_copy_tv	Data reuse (÷16 GMEM reads)
2→3	cp.async + commit/wait + multi-stage buffer	CopyG2SOp + pipeline state (auto)	Overlap copy + compute
3→4	mma.sync + ldmatrix + XOR swizzle	MmaF16BF16Op + make_composed_layout + make_tiled_copy_A	16× compute (TC) + no bank conflicts
4→5	128b cp.async + SMEM retile + 128b STG	num_bits_per_copy=128 + tiled_copy_C + retile	4× fewer copy instr + coalesced stores

The CuTe promise realized: From step1 to step5, the kernel body structure stays nearly identical — local_tile → partition → copy → gemm → copy. What changes is the atoms and layouts passed in from the host. The same cute.copy becomes scalar load (step1), cooperative SMEM load (step2), cp.async (step3), or ldmatrix (step4) — determined entirely by the copy atom. Same cute.gemm becomes scalar FMA (step1) or mma.sync (step4).

In CUDA, each step requires fundamentally rewriting the kernel: new __shared__ declarations, new index math, new PTX inline assembly, new sync patterns. CuTe's layout algebra makes these orthogonal, composable choices.

Steps 2–5: CuTe DSL Explained via CUDA Equivalents

Step 2: Shared Memory Tiling

New CuTe concepts → CUDA equivalents

Step 3: Multi-Stage Async Pipeline

New CuTe concepts → CUDA equivalents

Step 4: Tensor Cores + Swizzled SMEM

New CuTe concepts → CUDA equivalents

Step 5: Vectorized Loads + SMEM Epilogue

New CuTe concepts → CUDA equivalents

Summary: Progressive Optimization Map