Understanding CuTe GEMM: A Visual Study Guide

Introduction

When I started learning NVIDIA’s CuTe (the layout algebra engine inside CUTLASS 3.x), the biggest challenge wasn’t the math — it was building visual intuition for how abstract layouts map to real threads and real data.

These interactive pages are the study notes I created along the way. They trace how a simple Ampere GEMM kernel is designed using CuTe DSL, starting from the high-level architecture and drilling down to individual thread behavior. Each page is self-contained and interactive — you can modify parameters and see results update live.

Reading Order

I recommend going through these in order. Each page builds on concepts introduced in the previous one.

Prerequisites

To get the most out of these notes, you should be comfortable with:

Basic CUDA programming (kernels, threads, blocks, shared memory)
Matrix multiplication and tiling concepts
What a GEMM is and why it matters for GPU performance

No prior CuTe or CUTLASS experience needed — that’s what these pages teach.

Understanding CuTe GEMM: A Visual Study Guide

Introduction

Reading Order

1. Kernel Design Overview

2. TV Layout & MMA Atom Explained

3. TV Layout Visualizer

4. Partition C Permutation Explained

5. Step 1 Thread Trace

6. CuTe DSL vs Pure CUDA Comparison

7. Step 1: CuTe vs CUDA Deep Dive

8. Step 1: Thread Ownership Visualization

9. Partition C Derivation

10. Step 2–5: CuTe vs CUDA Progression

Prerequisites

Introduction#

Reading Order#

1. Kernel Design Overview#

2. TV Layout & MMA Atom Explained#

3. TV Layout Visualizer#

4. Partition C Permutation Explained#

5. Step 1 Thread Trace#

6. CuTe DSL vs Pure CUDA Comparison#

7. Step 1: CuTe vs CUDA Deep Dive#

8. Step 1: Thread Ownership Visualization#

9. Partition C Derivation#

10. Step 2–5: CuTe vs CUDA Progression#

Prerequisites#

Introduction

Reading Order

1. Kernel Design Overview

2. TV Layout & MMA Atom Explained

3. TV Layout Visualizer

4. Partition C Permutation Explained

5. Step 1 Thread Trace

6. CuTe DSL vs Pure CUDA Comparison

7. Step 1: CuTe vs CUDA Deep Dive

8. Step 1: Thread Ownership Visualization

9. Partition C Derivation

10. Step 2–5: CuTe vs CUDA Progression

Prerequisites