Video Diffusion · Efficiency Research · 6 papers, 2 lineages

One Goal, Two Lineages, a Fork at the End

Six papers chase one goal — cheaply predict which attention blocks matter, then skip the rest. The video-specialized SVG line predicts with clustering centroids; the universal Sparge line predicts with a fast attention-map approximation. The newest paper forks the field: training-free vs. trainable.

SVG line — video-specialized, centroid-based Sparge line — universal, prediction/mask-based Shared primitive — predict importance, skip

1 The Unifying Primitive — and Where It Splits

Within the SVG line, video latents are deeply redundant — neighboring frames and regions produce near-identical tokens — so one trick (cluster → centroid) is mined four different ways. The Sparge line reaches the same "predict importance cheaply" goal with a different estimator, so the centroid is the SVG line's signature, not a universal law.

redundant tokens K-means → centroid criticality error route compensate 2-bit residual
Estimate criticalitycentroid ≈ cheap proxy for a block's attention score
SVG2
Estimate & route by errorcompare centroid-vs-exact to find blocks that reconstruct worst
EAR
Compensate skipped blocksreplace dropped keys/values with the centroid — parameter-free
EAR
Quantize the residualsubtract centroid → tiny leftover squeezes to 2 bits
QVG
Different estimator entirelySparge line skips clustering — predicts the attention map directly (v1) or learns a mask (v2)
Sparge

2 Two Lineages, Each Re-Aiming Its Own Machinery

Not one story but two parallel ones. The SVG line keeps refining the centroid; the Sparge line keeps refining direct prediction. The shift is what each paper newly contributes.

▌ SVG LINE · video-specialized · centroid-based · training-free
ICML 2025

SVG 1

Sparse attention · bidirectional
Sparsity exists & has structure
  • Heads are spatial (within frame) or temporal (across frames)
  • Online profiling on ~1% of rows
  • Layout transform → tensor-core friendly
~2.3× faster
NeurIPS 2025

SVG 2

Sparse attention · bidirectional
Structure is semantic, not positional
  • K-means clusters replace position blocks
  • Lossless permutation → dense blocks
  • Centroid cache: K-means 76× cheaper
~2.3× faster
2026

SVG-EAR

Sparse attention + compensation
Optimize reconstruction error, not score
  • Route by expected error, not attention mass
  • Don't drop skipped blocks — compensate
  • Has a theoretical MSE bound
1.6–1.9× faster
ICML 2026

Quant VideoGen

KV-cache quant · autoregressive
Same atom solves memory, not compute
  • Centroid-subtract → 2-bit residuals
  • Progressive recursive quantization
  • 8B model on a single RTX 4090
~7× smaller
▌ SPARGE LINE · universal (any model) · prediction/mask-based · the training-free→trainable fork
ICML 2025

SpargeAttention

Universal sparse attn · any model
Predict the map, skip — no clustering, no fixed pattern
  • Stage 1: fast attention-map prediction
  • Stage 2: no-overhead softmax-aware filter
  • Sparse and quantized; still training-free
universal · training-free
2026

SpargeAttention2

Trainable sparse attn · the fork
Make the mask trainable — break training-free for a higher ceiling
  • Hybrid Top-k + Top-p mask (covers diffuse tail)
  • Distillation fine-tune from a dense teacher
  • Trades zero-cost adoption for performance
95% sparse · 16.2×

3 Three Axes Advancing in Parallel

Read across the lineage and three independent dials are each turning. The endpoint of every dial (highlighted) is the current frontier.

What gets grouped — the selection unit
head-type (spatial / temporal)
semantic token cluster
error-weighted cluster
How you decide what to keep — the criterion
positional heuristic
attention score / top-p
reconstruction error / cost ratio
What happens to the discarded mass
drop entirely
drop entirely
compensate with centroid
SVG1 fixed the what · SVG2 fixed the where · EAR fixed the why and the leftovers.

4 Where It's Heading — Merging the Two Branches

The two branches save different resources but share the same costly first step. The obvious next move: cluster once, then let one decision drive both savings.

Cluster once → centroids the one expensive step, shared SPARSE ATTENTION · saves TIME skip the token comparisons that don't matter centroid = criticality proxy + cheap stand-in QUANTIZATION · saves SPACE store the KV-cache history in low precision centroid = baseline to subtract → tiny residual One judgment → two savings "unimportant" = skip the math AND keep only the average

Level 1 · Thrift

Run clustering a single time; both machines feed off the same centroids instead of paying for it twice.

Level 2 · Elegance

One "important / not" label governs both budgets — skipping a block and collapsing it to its average are the same decision.

The catch

Different model types (bidirectional vs. streaming) and two stacked approximations whose errors compound — manage one combined quality budget.

5 The Fork: Training-Free vs. Trainable

The newest paper forces the biggest revision. With only the first papers in view, "everything is training-free" looked like a law. SpargeAttention2 shows it was just phase one — fine-tuning the mask buys a far higher ceiling, at a cost.

Training-Free Track
SVG1/2/EAR · QVG · SpargeAttention
  • Adopt cost: zero — drop into a pretrained model
  • Ceiling: modest (~1.6–2.3× end-to-end)
  • Sells: convenience, instant deployment
  • Risk: quality gets fragile at high sparsity
Best for: accelerating an off-the-shelf model today
Trainable Track
SpargeAttention2
  • Adopt cost: one-time distillation fine-tune
  • Ceiling: high — 95% sparse, 16.2×
  • Sells: raw performance ceiling
  • Risk: fine-tuning cost + per-model effort
Best for: a provider shipping a flagship who can afford a fine-tune
↘ likely future ↙   Hybridize: use a training-free method to initialize the mask, then fine-tune briefly — most of the ceiling at a fraction of the training cost.

6 One Constant, One Broken Assumption

STILL MANDATORY System–Algorithm Co-Design

An algorithm-only contribution is not viable in either lineage. Each ships systems pieces:

  • Centroid cache (SVG line) — warm-start K-means from the previous step/chunk. Cuts clustering cost 76× (SVG2), 3× (QVG). The Sparge line has no clustering, so instead leans on its no-overhead Stage-2 filter.
  • Custom kernel (both) — turn scattered sparsity into contiguous tensor-core tiles. Layout transform (SVG1) · dynamic block-sparse (SVG2) · fused routing (EAR) · fused dequant (QVG) · SageAttention kernels (Sparge).

BROKEN by paper 6 Training-Free Is No Longer a Law

Papers 1–5 all drop into a pretrained model with no fine-tuning — the original analysis called this the field's permanent signature.

SpargeAttention2 breaks it: a trainable mask + distillation fine-tune reaches 95% sparsity / 16.2×, far past the training-free ceiling. The field has forked into two tracks — see section 6.

Common misread: "K-means runs once for the whole video."

In the SVG line it runs every denoising step × every layer × every head (SVG2) or every frame-chunk (QVG) — thousands of times per video. It only feels free because each run warm-starts from the last and converges in a few iterations. The tell: the cache gives a finite 76× speedup, not the infinite one a true compute-once would. (The Sparge line has no K-means at all.)

7 Trajectory & Open Tensions

Where it's heading

  • Unify attention + quant in one pipeline — cluster once, spend on both.
  • "Error-aware everything" — route quantization bit-allocation by error too.
  • Adaptive budgets (top-p, top-k, per-head density) replacing fixed sparsity ratios.
  • The field forks — training-free (convenience) vs. trainable (95% ceiling) now coexist; hybrids likely.

Tensions & open questions

  • Rests on the centroid cache — breaks if steps stop resembling each other (long / fast-changing content, fewer sampler steps).
  • Thin theory — only EAR offers a bound; the rest are empirical.
  • High-sparsity quality is fragile — PSNR still drops (Wan2.2-T2V ≈ 25).
  • Hyperparameter zoo — cluster count, per-head vs. shared, top-p are tuned, not derived.