Video Diffusion · Efficiency Research · 6 papers, 2 lineages

One Goal, Two Lineages, a Fork at the End

Six papers chase one goal — cheaply predict which attention blocks matter, then skip the rest. The video-specialized SVG line predicts with clustering centroids; the universal Sparge line predicts with a fast attention-map approximation. The newest paper forks the field: training-free vs. trainable.

SVG line — video-specialized, centroid-based Sparge line — universal, prediction/mask-based Shared primitive — predict importance, skip

1 The Unifying Primitive — and Where It Splits

Within the SVG line, video latents are deeply redundant — neighboring frames and regions produce near-identical tokens — so one trick (cluster → centroid) is mined four different ways. The Sparge line reaches the same "predict importance cheaply" goal with a different estimator, so the centroid is the SVG line's signature, not a universal law.

Estimate criticalitycentroid ≈ cheap proxy for a block's attention score

SVG2

Estimate & route by errorcompare centroid-vs-exact to find blocks that reconstruct worst

EAR

Compensate skipped blocksreplace dropped keys/values with the centroid — parameter-free

EAR

Quantize the residualsubtract centroid → tiny leftover squeezes to 2 bits

QVG

Different estimator entirelySparge line skips clustering — predicts the attention map directly (v1) or learns a mask (v2)

Sparge

2 Two Lineages, Each Re-Aiming Its Own Machinery

Not one story but two parallel ones. The SVG line keeps refining the centroid; the Sparge line keeps refining direct prediction. The shift is what each paper newly contributes.

▌ SVG LINE · video-specialized · centroid-based · training-free

ICML 2025

SVG 1

Sparse attention · bidirectional

Sparsity exists & has structure

Heads are spatial (within frame) or temporal (across frames)
Online profiling on ~1% of rows
Layout transform → tensor-core friendly

~2.3× faster

NeurIPS 2025

SVG 2

Sparse attention · bidirectional

Structure is semantic, not positional

K-means clusters replace position blocks
Lossless permutation → dense blocks
Centroid cache: K-means 76× cheaper

~2.3× faster

2026

SVG-EAR

Sparse attention + compensation

Optimize reconstruction error, not score

Route by expected error, not attention mass
Don't drop skipped blocks — compensate
Has a theoretical MSE bound

1.6–1.9× faster

ICML 2026

Quant VideoGen

KV-cache quant · autoregressive

Same atom solves memory, not compute

Centroid-subtract → 2-bit residuals
Progressive recursive quantization
8B model on a single RTX 4090

~7× smaller

▌ SPARGE LINE · universal (any model) · prediction/mask-based · the training-free→trainable fork

ICML 2025

SpargeAttention

Universal sparse attn · any model

Predict the map, skip — no clustering, no fixed pattern

Stage 1: fast attention-map prediction
Stage 2: no-overhead softmax-aware filter
Sparse and quantized; still training-free

universal · training-free

2026

SpargeAttention2

Trainable sparse attn · the fork

Make the mask trainable — break training-free for a higher ceiling

Hybrid Top-k + Top-p mask (covers diffuse tail)
Distillation fine-tune from a dense teacher
Trades zero-cost adoption for performance

95% sparse · 16.2×

3 Three Axes Advancing in Parallel

Read across the lineage and three independent dials are each turning. The endpoint of every dial (highlighted) is the current frontier.

What gets grouped — the selection unit

head-type (spatial / temporal)›

semantic token cluster›

error-weighted cluster

How you decide what to keep — the criterion

positional heuristic›

attention score / top-p›

reconstruction error / cost ratio

What happens to the discarded mass

drop entirely›

compensate with centroid

SVG1 fixed the what · SVG2 fixed the where · EAR fixed the why and the leftovers.

4 Where It's Heading — Merging the Two Branches

The two branches save different resources but share the same costly first step. The obvious next move: cluster once, then let one decision drive both savings.

Level 1 · Thrift

Run clustering a single time; both machines feed off the same centroids instead of paying for it twice.

Level 2 · Elegance

One "important / not" label governs both budgets — skipping a block and collapsing it to its average are the same decision.

The catch

Different model types (bidirectional vs. streaming) and two stacked approximations whose errors compound — manage one combined quality budget.

5 The Fork: Training-Free vs. Trainable

The newest paper forces the biggest revision. With only the first papers in view, "everything is training-free" looked like a law. SpargeAttention2 shows it was just phase one — fine-tuning the mask buys a far higher ceiling, at a cost.

Training-Free Track

SVG1/2/EAR · QVG · SpargeAttention

Adopt cost: zero — drop into a pretrained model
Ceiling: modest (~1.6–2.3× end-to-end)
Sells: convenience, instant deployment
Risk: quality gets fragile at high sparsity

Best for: accelerating an off-the-shelf model today

Trainable Track

SpargeAttention2

Adopt cost: one-time distillation fine-tune
Ceiling: high — 95% sparse, 16.2×
Sells: raw performance ceiling
Risk: fine-tuning cost + per-model effort

Best for: a provider shipping a flagship who can afford a fine-tune

↘ likely future ↙ Hybridize: use a training-free method to initialize the mask, then fine-tune briefly — most of the ceiling at a fraction of the training cost.

6 One Constant, One Broken Assumption

STILL MANDATORY System–Algorithm Co-Design

An algorithm-only contribution is not viable in either lineage. Each ships systems pieces:

Centroid cache (SVG line) — warm-start K-means from the previous step/chunk. Cuts clustering cost 76× (SVG2), 3× (QVG). The Sparge line has no clustering, so instead leans on its no-overhead Stage-2 filter.
Custom kernel (both) — turn scattered sparsity into contiguous tensor-core tiles. Layout transform (SVG1) · dynamic block-sparse (SVG2) · fused routing (EAR) · fused dequant (QVG) · SageAttention kernels (Sparge).

BROKEN by paper 6 Training-Free Is No Longer a Law

Papers 1–5 all drop into a pretrained model with no fine-tuning — the original analysis called this the field's permanent signature.

SpargeAttention2 breaks it: a trainable mask + distillation fine-tune reaches 95% sparsity / 16.2×, far past the training-free ceiling. The field has forked into two tracks — see section 6.

✗

Common misread: "K-means runs once for the whole video."

In the SVG line it runs every denoising step × every layer × every head (SVG2) or every frame-chunk (QVG) — thousands of times per video. It only feels free because each run warm-starts from the last and converges in a few iterations. The tell: the cache gives a finite 76× speedup, not the infinite one a true compute-once would. (The Sparge line has no K-means at all.)

7 Trajectory & Open Tensions

Where it's heading

Unify attention + quant in one pipeline — cluster once, spend on both.
"Error-aware everything" — route quantization bit-allocation by error too.
Adaptive budgets (top-p, top-k, per-head density) replacing fixed sparsity ratios.
The field forks — training-free (convenience) vs. trainable (95% ceiling) now coexist; hybrids likely.

Tensions & open questions

Rests on the centroid cache — breaks if steps stop resembling each other (long / fast-changing content, fewer sampler steps).
Thin theory — only EAR offers a bound; the rest are empirical.
High-sparsity quality is fragile — PSNR still drops (Wan2.2-T2V ≈ 25).
Hyperparameter zoo — cluster count, per-head vs. shared, top-p are tuned, not derived.

Sources: Sparse VideoGen (ICML 2025) · Sparse VideoGen 2 (NeurIPS 2025) · SVG-EAR (2026) · Quant VideoGen (ICML 2026) · SpargeAttention (ICML 2025) · SpargeAttention2 (2026). Built from the six summaries in knowledge/. Companion to trend_analysis.md.