How CuTe abstracts away the manual index math you'd write in raw CUDA
| Aspect | Pure CUDA | CuTe DSL |
|---|---|---|
| Language | C++ / CUDA C | Python (compiles to PTX via MLIR) |
| ThreadβData mapping | Manual index arithmetic | Encoded in TiledMMA layouts β automatic |
| Tiling | Manual offset calculation per block | cute.local_tile(tensor, tiler, coord, proj) |
| MMA dispatch | Explicit PTX inline asm or wmma API | Swap the Atom β everything re-derives |
| Shared memory layout | Manual padding / swizzle for bank conflicts | make_composed_layout(swizzle, ...) |
| Changing arch | Rewrite kernel from scratch | Change atom + copy atoms, rest adapts |
// Manual index math everywhere
__global__ void gemm_naive(
float *A, float *B, float *C,
int M, int N, int K)
{
// Step A: Which tile am I?
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
// Step B: Manual threadβelement mapping
int row = bx * BM + (tx / 16) * 4;
int col = by * BN + (tx % 16) * 4;
// Step C: K-loop
float acc[4][4] = {0};
for (int k = 0; k < K; k += BK) {
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
for (int kk = 0; kk < BK; kk++)
// YOU compute the address
acc[i][j] += A[(row+i)*K + k+kk]
* B[(col+j)*K + k+kk];
}
// Step D: Store β manual global offset
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
C[(row+i)*N + col+j] = acc[i][j];
}
# Layouts encode the mapping
@cute.kernel
def gemm_kernel(mA, mB, mC, tiled_mma):
tidx = cute.arch.thread_idx()[0]
bidx, bidy = cute.arch.block_idx()[:2]
# Step A: CuTe tiles for you
gA = cute.local_tile(mA, tiler, (bidx,bidy,None),
proj=(1,None,1))
gB = cute.local_tile(mB, tiler, (bidx,bidy,None),
proj=(None,1,1))
gC = cute.local_tile(mC, tiler, (bidx,bidy,None),
proj=(1,1,None))
# Step B: TiledMMA knows the mapping
thr_mma = tiled_mma.get_slice(tidx)
tCgC = thr_mma.partition_C(gC)
tCrC = tiled_mma.make_fragment_C(tCgC)
tCrC.fill(0.0)
# Step C: K-loop β no index math
for k in range(num_k_tiles):
tCrA = cute.make_fragment_like(...)
cute.copy(tCgA, tCrA)
cute.gemm(tiled_mma, tCrC, tCrA, tCrB, tCrC)
# Step D: Epilogue β copy handles layout
cute.copy(atom, tCrC, tCgC)
In pure CUDA, the threadβdata mapping is implicit in your index math. CuTe makes it explicit and composable:
# Host side (@cute.jit) β compile-time constant
tiled_mma = cute.make_tiled_mma(
atom, # What hardware instruction? (shape + threadβelement contract)
atoms_layout, # How many, tiled how in (M,N,K)?
permutation_mnk # Reorder for memory access patterns
)
| Component | CUDA Equivalent | CuTe DSL |
|---|---|---|
| atom | The FMA/HMMA/WMMA you call in inline asm | MmaUniversalOp(f32) or MmaF16BF16Op(f16,f32,(16,8,16)) |
| atoms_layout | Your mental picture of "256 threads form a 16Γ16 grid" | make_layout((16,16,1), stride=(16,1,0)) |
| permutation | That tricky index remapping you hardcode for coalescing | ((16,4):(4,1), (16,4):(4,1), None) |
| partition_A/B/C | row = bx*BM + (tx/16)*4; etc. |
thr_mma.partition_C(gC) β derived from layouts |
__shared__ arrays with manual paddingmma.sync PTXResult: ~500 lines, fragile, arch-locked
CopyUniversalOp β CopyG2SOpMmaUniversalOp β MmaF16BF16OpLdMatrix8x8x16bOpResult: Same structure, new atoms, works