Video Diffusion · Efficiency Research · 6 papers, 2 lineages
One Goal, Two Lineages, a Fork at the End
Six papers chase one goal — cheaply predict which attention blocks matter, then skip the rest. The video-specialized SVG line predicts with clustering centroids; the universal Sparge line predicts with a fast attention-map approximation. The newest paper forks the field: training-free vs. trainable.
SVG line — video-specialized, centroid-based Sparge line — universal, prediction/mask-based Shared primitive — predict importance, skip
1 The Unifying Primitive — and Where It Splits
Within the SVG line, video latents are deeply redundant — neighboring frames and regions produce near-identical tokens — so one trick (cluster → centroid) is mined four different ways. The Sparge line reaches the same "predict importance cheaply" goal with a different estimator, so the centroid is the SVG line's signature, not a universal law.
Estimate criticalitycentroid ≈ cheap proxy for a block's attention score
SVG2
Estimate & route by errorcompare centroid-vs-exact to find blocks that reconstruct worst
EAR
Compensate skipped blocksreplace dropped keys/values with the centroid — parameter-free
EAR
Quantize the residualsubtract centroid → tiny leftover squeezes to 2 bits
QVG
Different estimator entirelySparge line skips clustering — predicts the attention map directly (v1) or learns a mask (v2)
Sparge
2 Two Lineages, Each Re-Aiming Its Own Machinery
Not one story but two parallel ones. The SVG line keeps refining the centroid; the Sparge line keeps refining direct prediction. The shift is what each paper newly contributes.
▌ SVG LINE · video-specialized · centroid-based · training-free
ICML 2025
SVG 1
Sparse attention · bidirectional
Sparsity exists & has structure
Heads are spatial (within frame) or temporal (across frames)
Online profiling on ~1% of rows
Layout transform → tensor-core friendly
~2.3× faster
NeurIPS 2025
SVG 2
Sparse attention · bidirectional
Structure is semantic, not positional
K-means clusters replace position blocks
Lossless permutation → dense blocks
Centroid cache: K-means 76× cheaper
~2.3× faster
2026
SVG-EAR
Sparse attention + compensation
Optimize reconstruction error, not score
Route by expected error, not attention mass
Don't drop skipped blocks — compensate
Has a theoretical MSE bound
1.6–1.9× faster
ICML 2026
Quant VideoGen
KV-cache quant · autoregressive
Same atom solves memory, not compute
Centroid-subtract → 2-bit residuals
Progressive recursive quantization
8B model on a single RTX 4090
~7× smaller
▌ SPARGE LINE · universal (any model) · prediction/mask-based · the training-free→trainable fork
ICML 2025
SpargeAttention
Universal sparse attn · any model
Predict the map, skip — no clustering, no fixed pattern
Stage 1: fast attention-map prediction
Stage 2: no-overhead softmax-aware filter
Sparse and quantized; still training-free
universal · training-free
2026
SpargeAttention2
Trainable sparse attn · the fork
Make the mask trainable — break training-free for a higher ceiling
Hybrid Top-k + Top-p mask (covers diffuse tail)
Distillation fine-tune from a dense teacher
Trades zero-cost adoption for performance
95% sparse · 16.2×
3 Three Axes Advancing in Parallel
Read across the lineage and three independent dials are each turning. The endpoint of every dial (highlighted) is the current frontier.
What gets grouped — the selection unit
head-type (spatial / temporal)›
semantic token cluster›
error-weighted cluster
How you decide what to keep — the criterion
positional heuristic›
attention score / top-p›
reconstruction error / cost ratio
What happens to the discarded mass
drop entirely›
drop entirely›
compensate with centroid
SVG1 fixed the what · SVG2 fixed the where · EAR fixed the why and the leftovers.
4 Where It's Heading — Merging the Two Branches
The two branches save different resources but share the same costly first step. The obvious next move: cluster once, then let one decision drive both savings.
Level 1 · Thrift
Run clustering a single time; both machines feed off the same centroids instead of paying for it twice.
Level 2 · Elegance
One "important / not" label governs both budgets — skipping a block and collapsing it to its average are the same decision.
The catch
Different model types (bidirectional vs. streaming) and two stacked approximations whose errors compound — manage one combined quality budget.
5 The Fork: Training-Free vs. Trainable
The newest paper forces the biggest revision. With only the first papers in view, "everything is training-free" looked like a law. SpargeAttention2 shows it was just phase one — fine-tuning the mask buys a far higher ceiling, at a cost.
Training-Free Track
SVG1/2/EAR · QVG · SpargeAttention
Adopt cost: zero — drop into a pretrained model
Ceiling: modest (~1.6–2.3× end-to-end)
Sells: convenience, instant deployment
Risk: quality gets fragile at high sparsity
Best for: accelerating an off-the-shelf model today
Trainable Track
SpargeAttention2
Adopt cost: one-time distillation fine-tune
Ceiling: high — 95% sparse, 16.2×
Sells: raw performance ceiling
Risk: fine-tuning cost + per-model effort
Best for: a provider shipping a flagship who can afford a fine-tune
↘ likely future ↙ Hybridize: use a training-free method to initialize the mask, then fine-tune briefly — most of the ceiling at a fraction of the training cost.
6 One Constant, One Broken Assumption
STILL MANDATORY System–Algorithm Co-Design
An algorithm-only contribution is not viable in either lineage. Each ships systems pieces:
Centroid cache (SVG line) — warm-start K-means from the previous step/chunk. Cuts clustering cost 76× (SVG2), 3× (QVG). The Sparge line has no clustering, so instead leans on its no-overhead Stage-2 filter.
BROKEN by paper 6 Training-Free Is No Longer a Law
Papers 1–5 all drop into a pretrained model with no fine-tuning — the original analysis called this the field's permanent signature.
SpargeAttention2 breaks it: a trainable mask + distillation fine-tune reaches 95% sparsity / 16.2×, far past the training-free ceiling. The field has forked into two tracks — see section 6.
✗
Common misread: "K-means runs once for the whole video."
In the SVG line it runs every denoising step × every layer × every head (SVG2) or every frame-chunk (QVG) — thousands of times per video. It only feels free because each run warm-starts from the last and converges in a few iterations. The tell: the cache gives a finite 76× speedup, not the infinite one a true compute-once would. (The Sparge line has no K-means at all.)
7 Trajectory & Open Tensions
Where it's heading
Unify attention + quant in one pipeline — cluster once, spend on both.
"Error-aware everything" — route quantization bit-allocation by error too.