A visual walkthrough of low-bit weight-only quantization: start from FP weights, add trainable rounding offsets V and clipping scalars α, β, reconstruct each block output, and update only those quantization parameters with SignSGD.
Only quantization parameters: rounding offset V, upper clip α, lower clip β.
After tuning, export normal INT weights + scale/zero-point.
RTN rounds each weight independently. SignRound instead uses calibration data to make the quantized block output match the FP block output.
Fast, but each weight is rounded locally. It does not look at how the block output changes.
Train V, α, β so W̃X reconstructs WX for calibration inputs.
Feed a few samples through the FP model and cache the input to the current block.
Compute the reference output: Y_fp = W X.
Quantize + dequantize weights using trainable rounding and clipping.
Compute Y_q = W̃ X through the same block.
Minimize ||Y_fp - Y_q||², not the direct weight error.
Move by the sign of the gradient and save the best parameters.
This demo uses a tiny 4×4 weight matrix and a small calibration input. The update is a simplified SignSGD-style reconstruction loop to show the mechanism; real LLM quantization applies the same idea block by block at large scale.
Crossing the center threshold can change the final integer rounding decision. SignSGD only needs the direction, not the exact gradient magnitude.
Rounding is threshold-based. For a weight near a rounding border, a small shift in V can flip the integer choice.
Output reconstruction is block-aware. Some local weight errors matter less if the block output remains close.
SignSGD is lightweight. It ignores gradient magnitude and only follows the direction, which is enough for bounded rounding/clipping parameters.
Serving stays simple. The tuned parameters are folded into the final quantized checkpoint.