Steps 2–5: CuTe DSL Explained via CUDA Equivalents

Each step introduces one optimization. Here's how CuTe abstracts what you'd write by hand in CUDA.

Step 2: Shared Memory Tiling

Data path: GMEM β†’ SMEM (cooperative) β†’ sync β†’ RMEM β†’ FMA β†’ GMEM

New CuTe concepts β†’ CUDA equivalents

CuTe DSL
CUDA
smem = SmemAllocator() sA = smem.allocate_tensor(dtype, sA_layout, align) sB = smem.allocate_tensor(dtype, sB_layout, align)
__shared__ float sA[128 * 8]; // (BM, BK) flattened __shared__ float sB[128 * 8]; // (BN, BK) flattened
VIEW SmemAllocator = __shared__ declaration. CuTe adds layout metadata; CUDA just gets raw bytes.
CuTe DSL
CUDA
thr_copy = tiled_copy.get_slice(tidx) tAgA = thr_copy.partition_S(gA) # source (gmem) tAsA = thr_copy.partition_D(sA) # dest (smem) cute.copy(tiled_copy, tAgA, tAsA)
// Thread layout (32, 8): 32 in M, 8 in K int copy_m = (tidx % 32) * 4; // 4 consecutive M-rows int copy_k = tidx / 32; // 1 K-column for (int i = 0; i < 4; i++) sA[(copy_m+i) + copy_k*BM] = A_cta[(copy_m+i) + k*M];
LOAD make_tiled_copy_tv + partition_S/D encodes the cooperative thread→element mapping. In CUDA, you manually compute which elements each thread loads. CuTe derives it from the thread layout.
CuTe DSL
CUDA
cute.arch.sync_threads()
__syncthreads();
Step 2's main win: each A[m,k] loaded from GMEM once, shared by 16 threads with same M-row. Bandwidth reduced ~16Γ— for A (and ~16Γ— for B), at cost of __syncthreads barriers.

Step 3: Multi-Stage Async Pipeline

Data path: GMEM β†’(cp.async)β†’ SMEM[stage] β†’ sync β†’ RMEM β†’ FMA β†’ GMEM

New CuTe concepts β†’ CUDA equivalents

CuTe DSL
CUDA
# Copy atom with cp.async hardware DMA atom = make_copy_atom(CopyG2SOp(), dtype, num_bits_per_copy=32) # Issues cp.async β€” thread doesn't stall! cute.copy(tiled_copy, tAgA[..., k_tile], tAsA[..., stage]) cute.arch.cp_async_commit_group() cute.arch.cp_async_wait_group(NUM_STAGES - 2)
// cp.async PTX inline β€” bypasses thread data path asm("cp.async.ca.shared.global [%0], [%1], 4;" :: "r"(smem_addr), "l"(gmem_ptr)); asm("cp.async.commit_group;"); asm("cp.async.wait_group 1;"); // wait until ≀1 group inflight
LOAD CopyG2SOp = cp.async instruction. The GPU's DMA engine copies data without occupying thread execution slots. Thread can compute on old data while new data arrives.
CuTe DSL
CUDA
# SMEM has extra dimension for stages sA_layout = tile_to_shape(atom, (BM, BK, NUM_STAGES)) # Index with stage: tAsA[None, None, None, stage] # select stage
__shared__ float sA[NUM_STAGES][BM * BK]; // Index: sA[stage][m + k*BM] // Circular: read_stage, write_stage advance modulo NUM_STAGES
Pipeline overlaps: while computing on stage[read], DMA fills stage[write]. With 3 stages and NUM_STAGES-2 wait, at most 1 group finishes before compute starts β†’ full overlap of copy+compute after warmup.
What CuTe hides: The circular buffer pointer management (read_stage = (read_stage+1) % 3) and commit/wait bookkeeping. In CuTe, the copy just targets the right partition and the framework tracks pipeline state. In CUDA, you manage the state machine manually.

Step 4: Tensor Cores + Swizzled SMEM

Data path: GMEM β†’(cp.async)β†’ SMEM[swizzled] β†’(ldmatrix)β†’ RMEM β†’ HMMA β†’ GMEM

New CuTe concepts β†’ CUDA equivalents

CuTe DSL
CUDA
# MMA atom: hardware tensor core instruction op = MmaF16BF16Op(FP16, FP32, (16, 8, 16)) tC = make_layout((2, 2, 1)) # atom_layout: 4 warps tiled_mma = make_tiled_mma(op, tC, permutation_mnk=(32, 32, 16)) # Partition from SMEM (not GMEM like step1-3) tCsA = thr_mma.partition_A(sA) tCsB = thr_mma.partition_B(sB)
// wmma (or raw mma.sync.m16n8k16 PTX) // 4 warps in (2,2) layout: warp_m = warp_id/2 // Each warp computes 64Γ—64 of the 128Γ—128 tile // = 4Γ—4 wmma tiles (each 16Γ—16) wmma::fragment<matrix_a, 16,16,16, half, col_major> fa; wmma::load_matrix_sync(fa, &sA[offset], BM); wmma::mma_sync(acc, fa, fb, acc);
COMPUTE MmaF16BF16Op wraps the hardware's mma.sync.m16n8k16. CuTe's atom_layout (2,2,1) + permutation (32,32,16) tiles this across all 128 threads. In CUDA: you manually assign warp→tile mapping. The fragment load pattern (which SMEM address each lane reads) is what ldmatrix + CuTe's TV layout encode automatically.
CuTe DSL
CUDA
# Swizzled SMEM layout β€” bank-conflict-free sA_layout = make_composed_layout( make_swizzle(3, 3, 3), # 3-bit XOR 0, # applied to mode 0 layout_atom(64, 8):(1, 64)) # G2S copy writes to swizzled addresses cute.copy(tiled_copy, gA_partition, sA_partition)
// Swizzle: XOR-based address transform // Within each 64Γ—8 atom: k_swizzled = k ^ ((m>>3) & 7) // This spreads bank accesses across 32 banks int smem_idx(int m, int k, int stage) { int ml = m % 64, kl = k % 8; int ks = kl ^ ((ml >> 3) & 7); // THE SWIZZLE return ... + ml + ks * 64; }
VIEW make_composed_layout wraps a regular layout with a swizzle function. CuTe applies it transparently to all reads/writes. In CUDA, you XOR address bits manually in every indexing expression.
CuTe DSL
CUDA
# S2R via ldmatrix (warp-level cooperative load) atom_s2r = make_copy_atom(LdMatrix8x8x16bOp(True, 4), dtype) tiled_copy_s2r = make_tiled_copy_A(atom_s2r, tiled_mma) # Retile register to match ldmatrix output layout tCrA_copy = thr_s2r.retile(tCrA) cute.copy(tiled_copy_s2r, tCsA_copy, tCrA_copy)
// In our CUDA version: de-swizzle into linear buffer, // then wmma::load_matrix_sync (which internally uses ldmatrix) for (int i = lane; i < 256; i += 32) tempA[i] = sA[smem_idx(row, col, stage)]; __syncwarp(); wmma::load_matrix_sync(frag_a, tempA, ldm);
LOAD ldmatrix is a warp-cooperative instruction — 32 threads collectively load a matrix fragment from SMEM. CuTe's make_tiled_copy_A(atom, tiled_mma) DERIVES the correct lane→address mapping from the MMA's TV layout. Without CuTe, you'd need to understand the exact PTX fragment layout per thread.
This is CuTe's biggest value proposition: when you change the MMA atom (e.g., m16n8k16 β†’ m16n8k32), the derived S2R copy automatically adjusts. In CUDA, you'd rewrite all the fragment loading logic.

Step 5: Vectorized Loads + SMEM Epilogue

G2S: 128-bit cp.async (8 FP16/copy). Epilogue: RMEM β†’ SMEM β†’ retile β†’ 128-bit STG

New CuTe concepts β†’ CUDA equivalents

CuTe DSL
CUDA
# 128-bit cp.async (8 FP16 per transfer) atom = make_copy_atom(CopyG2SOp(), dtype, num_bits_per_copy=128) # Thread layout (16, 8) with val_layout (8, 1) # β†’ each thread copies 8 consecutive elements per call
// cp.async.cg 16 bytes β€” loads 8 FP16 at once asm("cp.async.cg.shared.global [%0], [%1], 16;" :: "r"(smem_addr), "l"(gmem_ptr)); // Thread layout: (16, 8) β†’ 16 groups of 8 M-elements // 128 threads Γ— 8 elements = 1024 per round, need 4 rounds
LOAD 128-bit cp.async = 4Γ— fewer instructions than step4's 32-bit version. Requires 16B-aligned source and contiguous elements. CuTe's val_layout (8,1) tells the compiler "8 consecutive elements per copy atom."
CuTe DSL
CUDA
# SMEM Epilogue: retile for coalesced stores # 1. Write fragments to SMEM (fragment order) autovec_copy(tCrC_fp16, sC_partition) sync_threads() # 2. Re-read from SMEM in coalesced order autovec_copy(sC_retiled, tCrC_retiled) # 3. Vectorized 128-bit store to GMEM cute.copy(tiled_copy_C, tCrC_retiled, gC_partition)
// 1. wmma::store to SMEM (row-major) wmma::store_matrix_sync(&sC[offset], acc_fp16, BN, row_major); __syncthreads(); // 2. Each thread reads 8 consecutive elements from SMEM // 3. 128-bit store to GMEM (int4 = 16 bytes = 8 FP16) *(int4*)&C[row*N + col] = *(int4*)&sC[flat_idx];
STORE Without SMEM epilogue, tensor core fragments are scattered (the TV layout β‰  coalesced gmem layout). Writing through SMEM lets you "retile" β€” rearrange from fragment order to coalesced order. Result: 128-bit STG instructions instead of scattered 32-bit stores.
Why SMEM epilogue matters:

After mma.sync, thread 0 holds C[0,0], C[0,1], C[8,0], C[8,1] β€” scattered positions.
For coalesced 128-bit stores, you need 8 consecutive N-elements in one thread.

The retile through SMEM reshuffles: each thread deposits its fragments into SMEM at their logical positions, then re-reads 8 consecutive elements it didn't originally compute. Now stores are perfectly coalesced.

CuTe's make_tiled_copy_C + retile handles this reshuffling automatically based on the output layout.

Summary: Progressive Optimization Map

StepCUDA techniqueCuTe abstractionSpeedup source
1β†’2__shared__ + cooperative copy + __syncthreadsSmemAllocator + make_tiled_copy_tvData reuse (Γ·16 GMEM reads)
2β†’3cp.async + commit/wait + multi-stage bufferCopyG2SOp + pipeline state (auto)Overlap copy + compute
3β†’4mma.sync + ldmatrix + XOR swizzleMmaF16BF16Op + make_composed_layout + make_tiled_copy_A16Γ— compute (TC) + no bank conflicts
4β†’5128b cp.async + SMEM retile + 128b STGnum_bits_per_copy=128 + tiled_copy_C + retile4Γ— fewer copy instr + coalesced stores
The CuTe promise realized: From step1 to step5, the kernel body structure stays nearly identical β€” local_tile β†’ partition β†’ copy β†’ gemm β†’ copy. What changes is the atoms and layouts passed in from the host. The same cute.copy becomes scalar load (step1), cooperative SMEM load (step2), cp.async (step3), or ldmatrix (step4) β€” determined entirely by the copy atom. Same cute.gemm becomes scalar FMA (step1) or mma.sync (step4).

In CUDA, each step requires fundamentally rewriting the kernel: new __shared__ declarations, new index math, new PTX inline assembly, new sync patterns. CuTe's layout algebra makes these orthogonal, composable choices.