Activation Sparsity for Energy-Efficient Inference

Thesis

A dense feed-forward layer computes every neuron for every input, then mostly throws the result away. If a model could be coaxed to leave most of its feed-forward neurons at exactly zero on any given byte, an inference engine with the right kernel could skip the corresponding arithmetic and the energy that goes with it. This is activation sparsity, and it is one of the cleanest energy levers available to a model that must run on consumer or embedded hardware. The Veritate finding is specific and, we think, important: you cannot retrofit sparsity onto a finished model, but you can install it with a very short, cheap retraining pass, and the savings compose with weight pruning to reach a small fraction of the original feed-forward compute.

The Public Frame

The relevant public line of work is ReLUfication and its successors. ProSparse (Song et al., 2024) showed that LLaMA-2-7B can be pushed to roughly 89 percent activation sparsity with quality comparable to its original Swish-activated version, using a progressive regularization schedule, and reports up to a 4.5-times inference speedup. dReLU and TurboSparse (Song et al., 2024) diagnosed why a naive ReLU swap fails on gated activations and proposed a corrected activation that reaches up to 90 percent sparsity with two-to-five-times decoding speedups. The common thread across all of this work is that the sparsity has to be trained for. Simply replacing the activation function on a finished model does not produce useful sparsity. Our results reproduce that lesson directly at byte level.

The Retrofit Fails

We first confirmed the negative result on our own model. Naive magnitude gating of the GELU outputs on a trained 85M, zeroing small activations at inference time, saved only about 17 percent of the down-projection compute (under six percent end-to-end) before quality degraded, because the long tail of small GELU values turns out to carry information through the down-projection. Going further and hard-swapping GELU for ReLU on the trained model produced 81 percent zeros but inflated cross-entropy by 302 percent: the right count of zeros in the wrong places. The weights were optimized for GELU's smooth negative regime, and a post-hoc structural change discards exactly the distinctions they encode. Both results are logged as refuted retrofit approaches.

The Trained-In Version Works, Cheaply

The constructive finding is that the sparsity does not require a from-scratch retrain. Starting from the trained 85M at its canonical shape (hidden 768, twelve layers), we monkey-patched the feed-forward layers to use ReLU with a small L1 penalty on the post-activation values, then fine-tuned for only 500 steps on CPU, about three minutes of wall-clock. The post-swap model started at the expected catastrophe (plus 2.92 cross-entropy, 95.7 percent zeros), and the short retrain walked it back to plus 0.037 cross-entropy at 87.5 percent zeros. That is far under our quality budget and far above our sparsity bar. A tiny-scale from-scratch control had already confirmed the shape (83.2 percent zeros at plus 0.008 cross-entropy at hidden 256). The takeaway is that the model can re-learn to place its zeros correctly with a tiny fraction of the original training cost, which matters enormously when the whole project is about energy and consumer hardware.

Better still, the two sparsity dimensions compose. We stacked a 50 percent magnitude weight prune (itself recoverable to plus 0.0049 cross-entropy with 500 retrain steps, a clean reproduction of the lottery-ticket result) with the ReLU-plus-L1 activation sparsity. After a single 500-step retrain that held the pruned weights at zero, the combined model retained the 50 percent weight prune and 88.1 percent post-ReLU activation sparsity at plus 0.105 cross-entropy. The two dimensions are orthogonal: sparse weights and sparse activations multiply, leaving the feed-forward path doing roughly six percent of its dense floating-point work, about a seventeen-times theoretical speedup, once both a sparse-weight and a sparse-activation kernel exist in the engine.

Why Dense Models Do Not Give This For Free

It is worth noting why this has to be engineered rather than discovered. We measured the natural activation behavior of the dense 85M: about 18.6 percent of feed-forward neurons fire per byte under a small gating threshold, but 92.6 percent of the feed-forward width is needed to hold 95 percent of the activation energy. Dense models do not naturally concentrate their work into a few neurons. The mixture-of-experts style of sparsity, hard top-k routing, requires explicit training too; you do not get it by converting a dense model after the fact.

Limitations

These are 85M-scale results, and the explicit next step in our log is to re-validate at the 800M hidden size, where the constants may shift. The seventeen-times feed-forward speedup is a theoretical floating-point count, not a measured wall-clock number; it is gated on sparse-matmul and sparse-activation kernels that we have not yet shipped in the engine. The combined-sparsity quality cost, about 25 percent relative perplexity, is material rather than negligible, and whether that is acceptable depends on the downstream task. What the data establishes cleanly is the method: activation sparsity is a training-time property, installable with minutes of fine-tuning, and it stacks with weight pruning to dramatically cut the feed-forward compute that dominates a small model's inference cost. The same energy-first thinking runs through our work on quantization-aware training and speculative decoding, and the full method lives in Veritate.

References

Song et al., ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, 2024.

Song et al., Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters (dReLU / TurboSparse), 2024.

Frankle and Carbin, The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, 2019.