partition_C: The Problem, The Algebra, The Derivation

How CuTe transforms (128, 128):(128, 1) → (1, (4,2), (4,2)):(0, (128,8192), (1,64)) in 5 algebraic steps

🔴 The Problem partition_C Solves

Given:

A 128×128 output tile gC in global memory (row-major)
256 threads that need to collaboratively compute this tile
A Multiply-Accumulate (MMA) instruction (here: scalar FMA — Fused Multiply-Add)

Need:

For each thread, produce a view (pointer + layout) into gC that selects exactly the 64 elements this thread is responsible for — without moving any data.

Why 64? 128 × 128 = 16,384 elements ÷ 256 threads = 64 elements per thread.

📖 Glossary: Abbreviations → Full Names

Abbreviation	Full Name	What It Is
MMA	Matrix Multiply-Accumulate	Hardware instruction: D = A × B + C
FMA	Fused Multiply-Add	Scalar MMA: d = a × b + c
CTA	Cooperative Thread Array	= thread block in CUDA
TV	Thread-Value	Layout mapping (thread_idx, value_idx) → element offset
Thr	Thread	Thread dimension in a layout
Frg / Frag	Fragment	Per-thread owned data elements
V / Val	Value	Elements within one atom execution
Perm	Permutation	Reordering of element positions
GMEM	Global Memory	GPU DRAM (HBM)
SMEM	Shared Memory	On-chip per-CTA scratchpad
RMEM	Register Memory	Per-thread register file
BM/BN/BK	Block M/N/K	CTA tile dimensions
TiledMma	Tiled MMA	MMA atom replicated across threads + positions
thrfrg_C	Thread-Fragment C	Algorithm: partition C into thread and fragment dims

🟣 Core Principle: Layout Composition as Function Composition

The Key Idea

A CuTe Layout = Shape + Stride is a function from logical coordinates to memory offsets.

Layout(coord) = dot(coord, stride) → memory offset

For example, (128, 128):(128, 1) is the function:

f(m, n) = m × 128 + n × 1

Why Layout Algebra?

partition_C needs to answer: "Given that thread 0 owns M-positions {0,1,2,3,64,65,66,67} and N-positions {0,1,2,3,64,65,66,67}, what is the layout that maps a flat index 0..63 to the correct memory offsets?"

Instead of computing 64 offsets explicitly, CuTe composes layout functions. The output is a new layout (pure metadata, no data movement) whose strides encode the answer.

Three Primitive Operations

logical_divide(A, B)

Splits layout A using tiler B.
Math: A ∘ (B, B*) where B* = complement of B.

Shape transforms: (M) → (TileM, RestM)

Like reshape(128) → (64, 2) but respecting the stride structure

zipped_divide(A, B)

= logical_divide + zip: groups all tile dims together and all rest dims together.

((TileM,RestM),(TileN,RestN))
→ ((TileM,TileN),(RestM,RestN))

Makes the "atom" part and "rest" part independently addressable

compose(A, B)

Function composition: new_layout(x) = A(B(x))

Transforms coordinate space without touching data.

Like: B maps (thread, value) → atom coords; A maps atom coords → memory

📥 Concrete Inputs (Step 1 Example)

The Tensor

gC = (128, 128):(128, 1) ↑ shape ↑ stride 128 rows × 128 cols, row-major f(m,n) = m*128 + n

The TiledMma

Atom: 1×1×1 scalar FMA AtomLayoutC: (1, 1):(0, 0) ← TV layout 1 thread, 1 value per atom atoms_layout: (16, 16, 1):(16, 1, 0) 256 threads as 16×16 M×N grid permutation_M: (16, 4):(4, 1) permutation_N: (16, 4):(4, 1) product = 16×4 = 64 positions per dim 128/64 = 2 repetitions needed

What does permutation (16, 4):(4, 1) mean?

It maps a 2D coordinate (f, r) where f∈[0,16), r∈[0,4) to position: f*4 + r*1

For thread 0 (f=0): positions are {0, 1, 2, 3} — the first 4 consecutive elements.

For thread 1 (f=1): positions are {4, 5, 6, 7} — the next 4 elements.

This covers 64 positions total (0..63). The tile is 128, so there's a "Rest" factor of 2 for positions {64..127}.

🔄 The 5-Step Pipeline

① logical_divide — Split Layout by Permutation

C++ Source (mma_atom.hpp:256-258)

auto t_tile = make_tile(permutation_mnk<0>(), // permutation_M permutation_mnk<1>()); // permutation_N auto t_tensor = logical_divide(ctensor, t_tile); // (PermM, PermN)

What It Does

Splits each dimension of gC according to the permutation layout. The permutation (16,4):(4,1) covers 64 of the 128 positions, leaving a "rest" factor of 2.

Math: A ∘ (B, B*)

For the M dimension (size 128, stride 128):

B = permutation_M = (16, 4):(4, 1) — covers 64 positions: {0,4,8,...,60, 1,5,...,61, 2,...,62, 3,...,63} B* = complement(B, 128) = (2):(64) — the 2 remaining tiles: {0, 64} (B, B*) = ((16, 4), 2):((4, 1), 64) — full factorization of 128 positions Compose with M-stride=128: stride((16, 4), 2) = ((4×128, 1×128), 64×128) = ((512, 128), 8192)

For the N dimension (size 128, stride 1):

Same structure, but base stride is 1 instead of 128: stride((16, 4), 2) = ((4×1, 1×1), 64×1) = ((4, 1), 64)

Input: (128, 128):(128, 1)

Output: ((16,4,2), (16,4,2)):((512,128,8192), (4,1,64))

Meaning: M split → (F=16 thread-groups, R=4 consecutive, Rest=2 repetitions)
N split → (F=16 thread-groups, R=4 consecutive, Rest=2 repetitions)

Visualization: M-dimension split

128 M-positions factored as (16 groups × 4 consecutive) × 2 repetitions

...

124

…

127

■ Thread 0 (F=0) ■ Thread 1 (F=1) ■ Thread 15 (F=15) | = Rest boundary (pos 64)

② zipped_divide — Group Atom Dimensions Together

C++ Source (mma_atom.hpp:261-263)

auto c_tile = make_tile(make_layout(size<0>(AtomShape_MNK{})), // AtomM = 1 make_layout(size<1>(AtomShape_MNK{}))); // AtomN = 1 auto c_tensor = zipped_divide(t_tensor, c_tile); // ((AtomM,AtomN),(RestM,RestN))

What It Does

Groups "atom" dimensions (the elements consumed by one MMA instruction) into one mode, and "rest" dimensions (how many atom invocations are needed) into another.

Why?

Step 3 needs to apply the TV (Thread-Value) layout within the atom. By zipping atom dimensions together, the atom mode becomes a self-contained unit we can transform independently.

Derivation

c_tile = (1, 1) — atom covers 1×1 elements (scalar FMA) zipped_divide splits by (1, 1), which is trivial: Each (16,4,2) subgroup in M: atom takes 1, rest gets (16,4,2) Each (16,4,2) subgroup in N: atom takes 1, rest gets (16,4,2) Then zip: atom parts → mode 0, rest parts → mode 1

Input: ((16,4,2), (16,4,2)):((512,128,8192), (4,1,64))

Output: ((1, 1), (16,4,2, 16,4,2)):((★, ★), (512,128,8192, 4,1,64))

Modes: Mode 0 = Atom (1,1): one MMA invocation processes 1 element
Mode 1 = Rest (16,4,2,16,4,2): 16384 total atom invocations to cover the tile

★ Note: Atom strides are "don't care" since atom is size 1×1.

For a tensor core atom (e.g., 16×8×8 HMMA), the atom mode would be (16, 8) with meaningful strides, and the rest would be smaller. The structure of the algorithm stays the same — only the numbers change. This is the power of CuTe's generality.

③ compose — Transform Atom from (M,N) to (Thread, Value)

C++ Source (mma_atom.hpp:265-266)

// Transform the Atom mode from (M,N) to (Thr,Val) auto tv_tensor = c_tensor.compose(AtomLayoutC_TV{}, _); // ((ThrV,FrgV),(RestM,RestN))

What It Does

Applies the atom's TV (Thread-Value) layout to mode 0 (the atom). This re-labels atom coordinates from spatial (M, N) to functional (which thread, which value within that thread).

Why This Step Exists

The atom layout knows how a hardware MMA instruction distributes its inputs/outputs across threads. For example, a tensor core MMA might spread a 16×8 output across 32 threads, each holding 4 values. The TV layout encodes that hardware-specific mapping.

For our scalar FMA: 1 thread computes 1 value, so TV = (1,1):(0,0) — trivial.

Derivation

AtomLayoutC_TV = (1, 1):(0, 0) ↑ 1 thread (ThrV), 1 value per atom (FrgV = Fragment Value) compose(atom_mode=(1,1), TV=(1,1)) → (ThrV=1, FrgV=1) This is just relabeling — no numeric change. The "_" in compose(..., _) means: leave mode 1 (Rest) untouched.

Input: ((1, 1), (16,4,2, 16,4,2))

Output: ((ThrV=1, FrgV=1), (16,4,2, 16,4,2))

Strides: ((0, 0), (512,128,8192, 4,1,64)) — unchanged numerically

Key: ThrV = threads within atom (intra-atom thread index)
FrgV = values per thread within atom (fragment values)

For a 16×8 tensor core (HMMA): AtomLayoutC_TV would be e.g., (32, 4):(some_strides) — 32 threads each holding 4 values. Then compose would transform (16,8) → (32, 4) with non-trivial strides derived from the hardware's register layout. This is where CuTe encodes hardware knowledge.

④ zipped_divide — Separate Thread-Indexing from Thread-Owned

C++ Source (mma_atom.hpp:268-272)

// Tile the tensor for the C-threads auto thr_tile = make_tile( _, // keep ThrV as-is make_tile(make_layout(size<1>(thr_layout_vmnk_)), // ThrM = 16 make_layout(size<2>(thr_layout_vmnk_)))); // ThrN = 16 auto thr_tensor = zipped_divide(tv_tensor, thr_tile); // ((ThrV,(ThrM,ThrN)),(FrgV,(RestM,RestN)))

What It Does

The "Rest" mode from step 3 contains both thread-indexing dimensions (which of the 16×16 threads) and per-thread-owned dimensions (the 4×2 elements each thread computes). This step separates them.

Why?

After this step, mode 0 = "which thread am I?" and mode 1 = "what data do I own?". The final slice (step 5) uses mode 0 to pick a thread and returns mode 1 as the result.

Derivation

thr_tile for mode 1 (Rest) = (ThrM=16, ThrN=16) Mode 1 was: (16, 4, 2, 16, 4, 2):(512, 128, 8192, 4, 1, 64) M-part N-part Divide M-rest (16, 4, 2) by ThrM=16: • Tile: 16 positions (strides from F dimension) → stride 512 • Rest: (4, 2) → strides (128, 8192) Divide N-rest (16, 4, 2) by ThrN=16: • Tile: 16 positions → stride 4 • Rest: (4, 2) → strides (1, 64) Zip → thread parts together, fragment parts together:

Input: ((ThrV=1, FrgV=1), (16,4,2, 16,4,2))

Output: ((1, (16, 16)), (1, (4,2), (4,2)))

Strides: ((0, (512, 4)), (0, (128,8192), (1,64)))

Meaning: Mode 0 = Thread selector: (ThrV=1, (ThrM=16, ThrN=16)) — picks one of 256 threads
Mode 1 = Fragment: (FrgV=1, (RestM=(4,2)), (RestN=(4,2))) — 64 elements this thread owns

Thread Indexing Strides Explained

Thread (tm, tn) where tm ∈ [0,16), tn ∈ [0,16): offset = tm × 512 + tn × 4 tm=0, tn=0: offset = 0 → row 0, col 0 tm=0, tn=1: offset = 4 → row 0, col 4 tm=1, tn=0: offset = 512 → row 4, col 0 (512/128 = row 4) tm=1, tn=1: offset = 516 → row 4, col 4 Thread 0 (tm=0, tn=0) starts at (row=0, col=0) Thread 1 (tm=0, tn=1) starts at (row=0, col=4) Thread 16 (tm=1, tn=0) starts at (row=4, col=0) Thread 255 (tm=15, tn=15) starts at (row=60, col=60)

⑤ slice — Pick One Thread's Fragment

C++ Source (mma_atom.hpp:469-472)

// In partition_C: auto thr_vmn = make_coord(get<0>(thr_vmnk_), // thr_v = 0 make_coord(get<1>(thr_vmnk_), // thr_m get<2>(thr_vmnk_))); // thr_n return thr_tensor(thr_vmn, make_coord(_, repeat<...>(_))); // → fragment view

What It Does

Plugs in the concrete thread coordinates into mode 0 (the thread selector). This computes a pointer offset for this thread and returns mode 1 (the fragment) as the result layout.

Derivation for Thread 0

Thread 0: thr_vmnk = (0, 0, 0, 0) → thr_vmn = (v=0, (m=0, n=0)) Pointer offset = 0×0 + 0×512 + 0×4 = 0 (no shift — thread 0 starts at gC's base) Remaining layout (mode 1) = the fragment: (1, (4,2), (4,2)):(0, (128,8192), (1,64))

Input: ((1,(16,16)), (1,(4,2),(4,2))) — full thrfrg_C result

Slice: thread_coord = (0, (0, 0)), fragment_coord = (_, (_, _), (_, _))

Output: (1, (4,2), (4,2)):(0, (128,8192), (1,64)) — thread 0's view of gC

What the Output Strides Mean Physically

tCgC[v, (m0,m1), (n0,n1)] → gC offset = v×0 + m0×128 + m1×8192 + n0×1 + n1×64 The 64 elements thread 0 owns (m0∈[0,4), m1∈[0,2), n0∈[0,4), n1∈[0,2)): (m0,m1,n0,n1) → (row, col) = (m0 + m1×64, n0 + n1×64) n1=0 n1=1 n0: 0 1 2 3 n0: 0 1 2 3 m1=0 m0=0: (0,0) (0,1) (0,2) (0,3) (0,64) (0,65) (0,66) (0,67) m0=1: (1,0) (1,1) (1,2) (1,3) (1,64) (1,65) (1,66) (1,67) m0=2: (2,0) (2,1) (2,2) (2,3) (2,64) (2,65) (2,66) (2,67) m0=3: (3,0) (3,1) (3,2) (3,3) (3,64) (3,65) (3,66) (3,67) m1=1 m0=0: (64,0) (64,1) (64,2) (64,3) (64,64) (64,65) (64,66) (64,67) m0=1: (65,0) (65,1) (65,2) (65,3) (65,64) (65,65) (65,66) (65,67) m0=2: (66,0) (66,1) (66,2) (66,3) (66,64) (66,65) (66,66) (66,67) m0=3: (67,0) (67,1) (67,2) (67,3) (67,64) (67,65) (67,66) (67,67)

✓ Verification: 16×16 Thread Ownership Grid

Each of 256 threads owns a 4×4 block at position (tm×4, tn×4) and a mirror at (+64, +64). Four 4×4 blocks = 64 elements. Let's visualize thread ownership for the first 16×16 corner of the 128×128 C tile:

Each colored 4×4 block is one thread's first sub-block. Thread numbers shown. The second sub-block at (+64,+64) follows the same pattern.

Element Count Check

Thread 0 owns: 1 × (4×2) × (4×2) = 1 × 8 × 8 = 64 elements 256 threads × 64 = 16,384 = 128 × 128 ✓ No element is claimed by two threads (permutation is a bijection). No element is unclaimed (complement covers the remainder).

🎯 Summary: Why Each Step Exists

Step	Operation	Problem It Solves	Output Shape
①	logical_divide	Factorize the tile according to the thread permutation — separate "which group" from "position within group"	((16,4,2),(16,4,2))
②	zipped_divide	Isolate the atom's footprint (what one MMA instruction touches) from the rest (how many times we repeat)	((1,1),(16,4,2,16,4,2))
③	compose	Re-label atom coordinates from spatial (M,N) to functional (Thread, Value) using hardware knowledge	((ThrV,FrgV),(Rest...))
④	zipped_divide	Separate thread-indexing dimensions from per-thread fragment dimensions across the "rest" mode	((Thr),(Frg))
⑤	slice	Plug in a concrete thread ID and return just that thread's fragment layout	(1,(4,2),(4,2))

The Design Principle

CuTe decomposes the "which thread owns which elements" problem into composable algebraic primitives — divide, zip, compose, slice — each handling one concern:

Spatial structure (how elements are arranged in memory)
Hardware structure (how the MMA instruction maps threads to elements)
Tiling structure (how many atoms tile across the output)

By keeping these orthogonal, the same code works for a scalar FMA atom, a 16×8×8 tensor core, or a 64×256×32 Blackwell UMMA — only the atom's TV layout changes. The algorithm is invariant to the hardware.

partition_C: Thread-to-Element Mapping Derivation

🔴 The Problem partition_C Solves

📖 Glossary: Abbreviations → Full Names

🟣 Core Principle: Layout Composition as Function Composition

The Key Idea

Why Layout Algebra?

Three Primitive Operations

📥 Concrete Inputs (Step 1 Example)

The Tensor

The TiledMma

What does permutation (16, 4):(4, 1) mean?

🔄 The 5-Step Pipeline

① logical_divide — Split Layout by Permutation

C++ Source (mma_atom.hpp:256-258)

What It Does

Math: A ∘ (B, B*)

Visualization: M-dimension split

② zipped_divide — Group Atom Dimensions Together

C++ Source (mma_atom.hpp:261-263)

What It Does

Why?

Derivation

③ compose — Transform Atom from (M,N) to (Thread, Value)

C++ Source (mma_atom.hpp:265-266)

What It Does

Why This Step Exists

Derivation

④ zipped_divide — Separate Thread-Indexing from Thread-Owned

C++ Source (mma_atom.hpp:268-272)

What It Does

Why?

Derivation

Thread Indexing Strides Explained

⑤ slice — Pick One Thread's Fragment

C++ Source (mma_atom.hpp:469-472)

What It Does

Derivation for Thread 0

What the Output Strides Mean Physically

✓ Verification: 16×16 Thread Ownership Grid

Element Count Check

🎯 Summary: Why Each Step Exists

The Design Principle