Skip to main content

One post tagged with "efficiency"

View All Tags

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

· 5 min read
Zeyu Yang
PhD student at Rice University

Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

International Conference on Learning Representations (ICLR), 2026

[arXiv] [OpenReview] [Code]

Single GH200 GPU (96 GB) · DiffSynth · Playback at 4× Speed

ECF8 (Ours)13s
Generating...
ECF8 generated
0.0s / 13s0%
FP8 Baseline24s
Generating...
FP8 generated
0.0s / 24s0%
Pixel-identical outputs. ECF8 finished in 13s, 85% faster than FP8 (24s) with pixel-to-pixel identical outputs.

Modern GenAI models are enormous. DeepSeek-R1 alone has 671 billion parameters. Even after converting to FP8, serving these models eats up massive amounts of GPU memory and bandwidth. The standard response is lossy quantization: throw away some precision and hope the outputs don't degrade too much. But what if you didn't have to lose anything at all?

In this work, we show that you can losslessly compress FP8 model weights by exploiting a simple observation about how neural networks store information. The result is ECF8, a format that saves up to 26.9% memory and speeds up inference by up to 177.1%, while producing outputs that are bit for bit identical to the original model.