Why Low-Bit QAT Need to Happen In Training

Thesis

If you want a language model small enough to run on a thirty-dollar single-board computer, you have to shrink the weights to a few bits each, or fewer. There are two ways to get there. Post-training quantization (PTQ) takes a finished full-precision model and rounds its weights down to low precision afterward. Quantization-aware training (QAT) makes the model experience the low-precision arithmetic during training, so it learns weights that survive the rounding. The Veritate experiments are unambiguous on which one works at the aggressive end of the precision range: below roughly four bits per weight, post-hoc quantization on our byte-level models is catastrophic, and quantization-aware training recovers nearly all of the lost quality. This article reports the measured cliff and the recoveries.

The Public Frame: Ternary as the Destination

The most striking public result in this space is BitNet b1.58 (Ma et al., 2024), which constrains every weight to one of three values, minus one, zero, or plus one. Three levels encode log-base-two of three, about 1.58, bits per weight. The key claim, validated at the three-billion-parameter scale, is that a ternary model trained natively in that regime matches a full-precision model of the same size and token budget on perplexity and downstream accuracy, while using several times less memory and running faster. The operative word is natively. The model is born ternary; it is not rounded to ternary after the fact. Our results explain why that distinction is not a detail but the whole game.

The Post-Hoc Cliff, Measured

On our trained 85M byte model, naive data-free post-training quantization falls off a cliff as precision drops. Ternary PTQ applied to the feed-forward layers alone inflated cross-entropy by 266 percent; applied to all layers, by 344 percent. Stacking ternary with structured 2:4 sparsity post-hoc produced a 410 percent regression, which is to say, garbage. Sub-one-bit PTQ schemes were similarly destructive. The diagnosis is not that the dynamic range is wrong; it is that the available level set, three values, or two, cannot represent the weights a full-precision objective settled on. The information lives in distinctions the coarse grid cannot make.

A related and instructive failure: swapping the activation function on a trained model also breaks it. Replacing GELU with ReLU on the finished 85M, with no retraining, drove cross-entropy up by 302 percent even though it produced 81 percent zeros. The model's weights were baked around GELU's smooth negative regime; structurally changing the operator after training destroys the very distinctions the weights encode. The same lesson, in a different costume: aggressive structural change is a training-time decision, not a deployment-time one.

Quantization-Aware Training Recovers It

When the model trains with the quantization in the loop, the cliff bends. On the 85M, ternary that was garbage as PTQ (cross-entropy of 3.56 versus a 0.43 baseline) recovered to 0.56 after only one thousand fine-tuning steps with quantization in the forward pass, and the resulting model is 17 megabytes on disk versus 326 megabytes for full precision, a nineteen-times reduction and five-times smaller than INT8. A second QAT pass that fixes the layer-norm fold ordering dropped a per-column INT8 perplexity from 7.88 to 4.44, a 44 percent improvement.

We then pushed below ternary on a per-layer basis. The earliest layers turned out to be 22-to-25-times more sensitive to binarization than the later ones, so we ran quantization-aware training that binarized only the first three layers (holding the rest at ternary) for two thousand steps. The post-quantization cliff of plus 2.18 cross-entropy bent down to plus 0.021. The combined shipping recipe, binary on the first three layers, ternary on the rest, with five percent of the most sensitive weights kept in higher precision, lands at an effective 1.98 bits per weight at a quality cost of plus 0.021 cross-entropy. That strictly dominates the best post-hoc result we had (2.31 bits per weight at plus 0.06). For the high-precision INT4 default the story is gentler still: per-row INT4 gives a 326-to-41-megabyte reduction at roughly a 0.4 percent quality cost versus INT8, because at four bits the level set is rich enough that careful post-hoc rounding mostly survives.

Limitations

Every number above is at 85M and on a byte-level model; the cliff location and the recovery margins will move with scale, and the public BitNet result suggests the picture improves with width, but we have not yet reproduced the sub-two-bit recipe at our 800M shape. The five-percent high-precision outlier set is a hyperparameter we tuned at h=768 and should re-tune at larger hidden size. The INT4 engine kernel for bit-exact execution still depends on a Hadamard-rotation step that is wired in our experiments but not yet in the production engine. The durable conclusion is the one BitNet states and our cliff confirms from the other direction: at the aggressive end of low-bit modeling, the precision constraint must be present during training. Bolted-on quantization below four bits does not work; trained-in quantization does. The full set of these experiments lives in the open Veritate project.

References

Ma et al., The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58), 2024.

Ashkboos et al., QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, 2024.

Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2023.