The Universal Algorithm
template <uint32_t kNumBits>
void common_pack_weight(uint32_t *in_arr, uint32_t *out_arr) {
constexpr uint32_t mask = (1 << kNumBits) - 1;
for (uint32_t i = 0; i < 32; i++) {
uint32_t index = i * kNumBits; // absolute bit position in output stream
uint32_t word_idx = index / 32; // which int32 word
uint32_t bit_offset = index % 32; // position within that word
uint32_t val = in_arr[i] & mask;
out_arr[word_idx] |= (val << bit_offset); // place low part
if (bit_offset + kNumBits > 32) { // spans word boundary?
uint32_t part1_bits = 32 - bit_offset;
out_arr[word_idx + 1] |= (val >> part1_bits); // place high part in next word
}
}
}
// Boundary spanning occurs when: 32 % kNumBits ≠ 0
// Clean packing (no spans): 1, 2, 4, 8 bit (powers of 2 that divide 32)
// Spanning required: 3, 5, 6, 7 bit
All Bit Widths at a Glance
| Bits |
Words / 32 weights |
Weights / word |
32 % bits |
Spanning weights |
Status |
Rule: If 32 % bits == 0, packing is clean (no weight spans a word boundary). Otherwise, some weights will be split across two adjacent int32 words. The number of spanning weights = number of positions where (i * bits) % 32 + bits > 32.
AutoRound comparison: AutoRound's pack_248_bits handles 2/4/8-bit (all clean), and has a separate pack_3bits for 3-bit. It does not support 5/6/7-bit in the GPTQ export path. Humming handles all uniformly with one template.