One post tagged with "efficiency"

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

October 3, 2025 · 5 min read

Zeyu Yang

PhD student at Rice University

Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

International Conference on Learning Representations (ICLR), 2026

[arXiv] [OpenReview] [Code]

Single GH200 GPU (96 GB) · DiffSynth · Playback at 4× Speed

ECF8 (Ours)13s

Generating...

0.0s / 13s0%

FP8 Baseline24s

Generating...

0.0s / 24s0%

Modern GenAI models are enormous. DeepSeek-R1 alone has 671 billion parameters. Even after converting to FP8, serving these models eats up massive amounts of GPU memory and bandwidth. The standard response is lossy quantization: throw away some precision and hope the outputs don't degrade too much. But what if you didn't have to lose anything at all?

In this work, we show that you can losslessly compress FP8 model weights by exploiting a simple observation about how neural networks store information. The result is ECF8, a format that saves up to 26.9% memory and speeds up inference by up to 177.1%, while producing outputs that are bit for bit identical to the original model.