To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava
International Conference on Learning Representations (ICLR), 2026
Overview
The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. This work develops low-precision floating-point formats that provide numerical stability and memory savings without dequantization overhead. We identify a key statistical phenomenon — exponent concentration — where exponents of model weights consistently exhibit low entropy across different architectures and modalities.
Key Contributions
-
Exponent Concentration Phenomenon: We discover that the exponent bits in floating-point representations of trained model weights consistently show low entropy, regardless of the model architecture or modality. This is a fundamental statistical property of trained neural networks.
-
Theoretical Foundation: We prove that exponent concentration emerges naturally from -stable distributions produced by stochastic gradient descent (SGD), establishing tight entropy bounds. Our analysis determines a theoretical compression boundary near FP4.67.
-
ECF8 Framework: We propose Exponent-Concentrated FP8 (ECF8), a practical lossless compression framework that combines entropy-aware encoding of the low-entropy exponent distribution with GPU-optimized decoding for efficient hardware implementation.
Results
We evaluate ECF8 across large language models and diffusion transformers with up to 671B parameters:
- Up to 26.9% memory reduction in model weight storage
- Up to 177.1% throughput acceleration for model inference
- Perfectly lossless — zero deviation in model outputs compared to uncompressed models
Why It Matters
Unlike lossy quantization methods (e.g., GPTQ, AWQ), ECF8 guarantees identical outputs to the original model while still achieving significant memory and speed improvements. This is particularly valuable for deployment scenarios where output fidelity is critical, such as scientific computing, medical AI, and safety-critical applications.
The discovery of exponent concentration as a fundamental property of SGD-trained models also opens new directions for principled low-precision floating-point format design in modern AI systems.