Close Read: LeWorldModel, a JEPA That Trains From Pixels Without the Tricks
LeWorldModel (LeWM) claims to be the first Joint-Embedding Predictive Architecture that trains stably end to end from raw pixels using only two loss terms: next-embedding prediction plus a single regularizer that forces the latent distribution to be an isotropic Gaussian. The claim mostly holds, and the reason it holds is the cleanest idea in the paper: replace the usual pile of anti-collapse heuristics (stop-gradient, EMA, frozen foundation encoders, seven-term VICReg objectives) with one distribution-matching penalty borrowed from LeJEPA. The headline "one hyperparameter" is real for the loss, but it quietly leans on architectural and quadrature choices that are themselves tuned. This is a close read of the paper from the first equation to the last.