Bit Packing Patterns (1–8 bit) — Humming Generic Template

common_pack_weight<kNumBits> — one algorithm handles all bit widths
Source: humming/include/humming/kernel/pack_weight.cuh

Packing Layout

Full weight (fits in one word)
Split: low bits (end of word)
Split: high bits (start of next word)

The Universal Algorithm

template <uint32_t kNumBits> void common_pack_weight(uint32_t *in_arr, uint32_t *out_arr) { constexpr uint32_t mask = (1 << kNumBits) - 1; for (uint32_t i = 0; i < 32; i++) { uint32_t index = i * kNumBits; // absolute bit position in output stream uint32_t word_idx = index / 32; // which int32 word uint32_t bit_offset = index % 32; // position within that word uint32_t val = in_arr[i] & mask; out_arr[word_idx] |= (val << bit_offset); // place low part if (bit_offset + kNumBits > 32) { // spans word boundary? uint32_t part1_bits = 32 - bit_offset; out_arr[word_idx + 1] |= (val >> part1_bits); // place high part in next word } } } // Boundary spanning occurs when: 32 % kNumBits ≠ 0 // Clean packing (no spans): 1, 2, 4, 8 bit (powers of 2 that divide 32) // Spanning required: 3, 5, 6, 7 bit

All Bit Widths at a Glance

Bits Words / 32 weights Weights / word 32 % bits Spanning weights Status

Rule: If 32 % bits == 0, packing is clean (no weight spans a word boundary). Otherwise, some weights will be split across two adjacent int32 words. The number of spanning weights = number of positions where (i * bits) % 32 + bits > 32.

AutoRound comparison: AutoRound's pack_248_bits handles 2/4/8-bit (all clean), and has a separate pack_3bits for 3-bit. It does not support 5/6/7-bit in the GPTQ export path. Humming handles all uniformly with one template.