VIEWSmemAllocator = __shared__ declaration. CuTe adds layout metadata; CUDA just gets raw bytes.
CuTe DSL
CUDA
thr_copy = tiled_copy.get_slice(tidx)
tAgA = thr_copy.partition_S(gA) # source (gmem)
tAsA = thr_copy.partition_D(sA) # dest (smem)
cute.copy(tiled_copy, tAgA, tAsA)
// Thread layout (32, 8): 32 in M, 8 in K
int copy_m = (tidx % 32) * 4; // 4 consecutive M-rows
int copy_k = tidx / 32; // 1 K-column
for (int i = 0; i < 4; i++)
sA[(copy_m+i) + copy_k*BM] = A_cta[(copy_m+i) + k*M];
LOADmake_tiled_copy_tv + partition_S/D encodes the cooperative threadβelement mapping. In CUDA, you manually compute which elements each thread loads. CuTe derives it from the thread layout.
CuTe DSL
CUDA
cute.arch.sync_threads()
__syncthreads();
Step 2's main win: each A[m,k] loaded from GMEM once, shared by 16 threads with same M-row. Bandwidth reduced ~16Γ for A (and ~16Γ for B), at cost of __syncthreads barriers.
# Copy atom with cp.async hardware DMA
atom = make_copy_atom(CopyG2SOp(), dtype,
num_bits_per_copy=32)
# Issues cp.async β thread doesn't stall!
cute.copy(tiled_copy, tAgA[..., k_tile], tAsA[..., stage])
cute.arch.cp_async_commit_group()
cute.arch.cp_async_wait_group(NUM_STAGES - 2)
// cp.async PTX inline β bypasses thread data path
asm("cp.async.ca.shared.global [%0], [%1], 4;"
:: "r"(smem_addr), "l"(gmem_ptr));
asm("cp.async.commit_group;");
asm("cp.async.wait_group 1;"); // wait until β€1 group inflight
LOADCopyG2SOp = cp.async instruction. The GPU's DMA engine copies data without occupying thread execution slots. Thread can compute on old data while new data arrives.
CuTe DSL
CUDA
# SMEM has extra dimension for stages
sA_layout = tile_to_shape(atom, (BM, BK, NUM_STAGES))
# Index with stage:
tAsA[None, None, None, stage] # select stage
Pipeline overlaps: while computing on stage[read], DMA fills stage[write]. With 3 stages and NUM_STAGES-2 wait, at most 1 group finishes before compute starts β full overlap of copy+compute after warmup.
What CuTe hides: The circular buffer pointer management (read_stage = (read_stage+1) % 3) and commit/wait bookkeeping. In CuTe, the copy just targets the right partition and the framework tracks pipeline state. In CUDA, you manage the state machine manually.
// wmma (or raw mma.sync.m16n8k16 PTX)
// 4 warps in (2,2) layout: warp_m = warp_id/2
// Each warp computes 64Γ64 of the 128Γ128 tile
// = 4Γ4 wmma tiles (each 16Γ16)
wmma::fragment<matrix_a, 16,16,16, half, col_major> fa;
wmma::load_matrix_sync(fa, &sA[offset], BM);
wmma::mma_sync(acc, fa, fb, acc);
COMPUTEMmaF16BF16Op wraps the hardware's mma.sync.m16n8k16. CuTe's atom_layout (2,2,1) + permutation (32,32,16) tiles this across all 128 threads. In CUDA: you manually assign warpβtile mapping. The fragment load pattern (which SMEM address each lane reads) is what ldmatrix + CuTe's TV layout encode automatically.
// Swizzle: XOR-based address transform
// Within each 64Γ8 atom: k_swizzled = k ^ ((m>>3) & 7)
// This spreads bank accesses across 32 banks
int smem_idx(int m, int k, int stage) {
int ml = m % 64, kl = k % 8;
int ks = kl ^ ((ml >> 3) & 7); // THE SWIZZLE
return ... + ml + ks * 64;
}
VIEWmake_composed_layout wraps a regular layout with a swizzle function. CuTe applies it transparently to all reads/writes. In CUDA, you XOR address bits manually in every indexing expression.
CuTe DSL
CUDA
# S2R via ldmatrix (warp-level cooperative load)
atom_s2r = make_copy_atom(LdMatrix8x8x16bOp(True, 4), dtype)
tiled_copy_s2r = make_tiled_copy_A(atom_s2r, tiled_mma)
# Retile register to match ldmatrix output layout
tCrA_copy = thr_s2r.retile(tCrA)
cute.copy(tiled_copy_s2r, tCsA_copy, tCrA_copy)
// In our CUDA version: de-swizzle into linear buffer,
// then wmma::load_matrix_sync (which internally uses ldmatrix)
for (int i = lane; i < 256; i += 32)
tempA[i] = sA[smem_idx(row, col, stage)];
__syncwarp();
wmma::load_matrix_sync(frag_a, tempA, ldm);
LOADldmatrix is a warp-cooperative instruction β 32 threads collectively load a matrix fragment from SMEM. CuTe's make_tiled_copy_A(atom, tiled_mma) DERIVES the correct laneβaddress mapping from the MMA's TV layout. Without CuTe, you'd need to understand the exact PTX fragment layout per thread.
This is CuTe's biggest value proposition: when you change the MMA atom (e.g., m16n8k16 β m16n8k32), the derived S2R copy automatically adjusts. In CUDA, you'd rewrite all the fragment loading logic.
# 128-bit cp.async (8 FP16 per transfer)
atom = make_copy_atom(CopyG2SOp(), dtype,
num_bits_per_copy=128)
# Thread layout (16, 8) with val_layout (8, 1)
# β each thread copies 8 consecutive elements per call
// cp.async.cg 16 bytes β loads 8 FP16 at once
asm("cp.async.cg.shared.global [%0], [%1], 16;"
:: "r"(smem_addr), "l"(gmem_ptr));
// Thread layout: (16, 8) β 16 groups of 8 M-elements
// 128 threads Γ 8 elements = 1024 per round, need 4 rounds
LOAD 128-bit cp.async = 4Γ fewer instructions than step4's 32-bit version. Requires 16B-aligned source and contiguous elements. CuTe's val_layout (8,1) tells the compiler "8 consecutive elements per copy atom."
CuTe DSL
CUDA
# SMEM Epilogue: retile for coalesced stores
# 1. Write fragments to SMEM (fragment order)
autovec_copy(tCrC_fp16, sC_partition)
sync_threads()
# 2. Re-read from SMEM in coalesced order
autovec_copy(sC_retiled, tCrC_retiled)
# 3. Vectorized 128-bit store to GMEM
cute.copy(tiled_copy_C, tCrC_retiled, gC_partition)
// 1. wmma::store to SMEM (row-major)
wmma::store_matrix_sync(&sC[offset], acc_fp16, BN, row_major);
__syncthreads();
// 2. Each thread reads 8 consecutive elements from SMEM
// 3. 128-bit store to GMEM (int4 = 16 bytes = 8 FP16)
*(int4*)&C[row*N + col] = *(int4*)&sC[flat_idx];
STORE Without SMEM epilogue, tensor core fragments are scattered (the TV layout β coalesced gmem layout). Writing through SMEM lets you "retile" β rearrange from fragment order to coalesced order. Result: 128-bit STG instructions instead of scattered 32-bit stores.
Why SMEM epilogue matters:
After mma.sync, thread 0 holds C[0,0], C[0,1], C[8,0], C[8,1] β scattered positions.
For coalesced 128-bit stores, you need 8 consecutive N-elements in one thread.
The retile through SMEM reshuffles: each thread deposits its fragments into SMEM at their logical positions, then re-reads 8 consecutive elements it didn't originally compute. Now stores are perfectly coalesced.
CuTe's make_tiled_copy_C + retile handles this reshuffling automatically based on the output layout.
The CuTe promise realized: From step1 to step5, the kernel body structure stays nearly identical β local_tile β partition β copy β gemm β copy. What changes is the atoms and layouts passed in from the host. The same cute.copy becomes scalar load (step1), cooperative SMEM load (step2), cp.async (step3), or ldmatrix (step4) β determined entirely by the copy atom. Same cute.gemm becomes scalar FMA (step1) or mma.sync (step4).
In CUDA, each step requires fundamentally rewriting the kernel: new __shared__ declarations, new index math, new PTX inline assembly, new sync patterns. CuTe's layout algebra makes these orthogonal, composable choices.