LLM Inference with Codebook-based Q4X Quantization using the Llama.cpp Framework on RISC-V Vector CPUs
We utilize QuantX [MN25], our in-house hardware-aware quantization platform, to generate a CPUfriendly codebook-based quantization technique, Q4X, and demonstrate its effectiveness by integrating
it into a private fork of the popular Llama.cpp [GtLc25] framework. We take into account the memorybound nature of LLM inference and design a compact 64-element codebook which is stored right on
the CPU register file during the dequantization saving costly far-memory accesses. Q4X is integrated
into Llama.cpp using hardware-friendly data packing and cache-aware vectorized kernels optimized for
the RISC-V vector. The results are validated on a Milk-V Jupiter RISC-V board; achieving a better
tradeoff on tokens/sec, model size and perplexity compared to the built-in techniques of similar order
in Llama.cpp