Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation—future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe—JEPA pretraining followed by action-head fine-tuning—without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in sample efficiency and generalization over existing methods.
VLA-JEPA model architecture. A target encoder produces latent targets from future frames, while the student pathway sees only the current observation through a VLM backbone. A predictor maps the history latent states and the latent action representations to future latent states, trained as a latent world model using a JEPA alignment loss. Future frames are never provided as inputs to the VLM backbone; they are used solely to construct training targets.
1. We analyze why many latent-action pretraining objectives remain pixel-tethered, becoming biased toward appearance, vulnerable to nuisance motion, and prone to information leakage when future context enters the learner.
2. We propose VLA-JEPA, a JEPA-style latent predictive alignment scheme that learns action-relevant transition semantics by predicting and aligning future latent states—without pixel reconstruction, information leakage and only one-stage pretraining pipeline.
3. VLA-JEPA yields consistent gains in sample efficiency, robustness, and generalization across embodied control benchmarks (LIBERO, LIBERO-Plus, SimplerEnv) and real-world settings, while simplifying training relative to prior multi-stage latent-action pipelines.
We identify four failure modes in existing latent-action pretraining pipelines: pixel-level objectives biased toward appearance, amplified noisy motion from real-world videos, information leakage causing latent-action collapse, and fragile multi-stage training pipelines. VLA-JEPA addresses all of these issues by design.
Visualization of the attention weight matrix of latent action tokens attending to image tokens. LAPA's latent actions focus on excessively dense visual information including operation-irrelevant details. UniVLA overemphasizes semantics, attending to irrelevant background elements. VLA-JEPA focuses precisely on the robotic arm, the hand, and the objects to be manipulated.
Experiments setup on LIBERO, LIBERO-Plus, SimplerEnv and real-world Franka robot. We evaluate VLA-JEPA on 3 simulation benchmarks and 1 real-world environment.
VLA-JEPA achieves state-of-the-art performance on the LIBERO benchmark with the highest average success rate, outperforming methods that rely on extensive robot datasets for pre-training.
| Method | Spatial | Object | Goal | LIBERO-10 | Avg |
|---|---|---|---|---|---|
| LAPA | 73.8 | 74.6 | 58.8 | 55.4 | 65.7 |
| UniVLA | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 |
| OpenVLA-OFT | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| π0 | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| π0-Fast | 96.4 | 96.8 | 88.6 | 60.2 | 85.5 |
| CoT-VLA | 87.5 | 91.6 | 87.6 | 69.0 | 81.1 |
| WorldVLA | 87.6 | 96.2 | 83.4 | 60.0 | 81.8 |
| villa-X | 97.5 | 97.0 | 91.5 | 74.5 | 90.1 |
| GR00T N1 | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| π0.5 | 98.8 | 98.2 | 98.0 | 92.4 | 96.9 |
| VLA-JEPA (ours) | 96.2 | 99.6 | 97.2 | 95.8 | 97.2 |
On SimplerEnv, VLA-JEPA achieves the highest average success rate on the Google Robot and competitive results on WidowX Robot, while using less than 1% of the training data compared to methods like villa-X.
| Method | Google Robot | WidowX Robot | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Pick | Move | Drawer | Place | Avg | Spoon | Carrot | Block | Eggplant | Avg | |
| LAPA* | - | - | - | - | - | 70.8 | 45.8 | 54.2 | 58.3 | 57.3 |
| villa-X | 81.7 | 55.4 | 38.4 | 4.2 | 44.9 | 48.3 | 24.2 | 19.2 | 71.7 | 40.8 |
| RoboVLMs | 77.3 | 61.7 | 43.5 | 24.1 | 51.7 | 45.8 | 20.8 | 4.2 | 79.2 | 37.5 |
| π0 | 72.7 | 65.3 | 38.3 | - | - | 29.1 | 0 | 16.6 | 62.5 | 40.1 |
| π0-Fast | 75.3 | 67.5 | 42.9 | - | - | 29.1 | 21.9 | 10.8 | 66.7 | 48.3 |
| VLA-JEPA (ours) | 88.3 | 64.1 | 59.3 | 49.1 | 65.2 | 75.0 | 70.8 | 12.5 | 70.8 | 57.3 |
VLA-JEPA achieves the best performance on 5 out of 7 perturbations in LIBERO-Plus, demonstrating significant advantages under Language, Light, Background, and Layout perturbations. This verifies that our latent action can effectively handle task-agnostic disturbances.
| Method | Camera | Robot | Language | Light | Background | Noise | Layout | Avg |
|---|---|---|---|---|---|---|---|---|
| UniVLA | 1.8 | 46.2 | 69.6 | 69.0 | 81.0 | 21.2 | 31.9 | 42.9 |
| OpenVLA-OFT | 56.4 | 31.9 | 79.5 | 88.7 | 93.3 | 75.8 | 74.2 | 69.6 |
| π0 | 13.8 | 6.0 | 58.8 | 85.0 | 81.4 | 79.0 | 68.9 | 53.6 |
| π0-Fast | 65.1 | 21.6 | 61.0 | 73.2 | 73.2 | 74.4 | 68.8 | 61.6 |
| WorldVLA | 0.1 | 27.9 | 41.6 | 43.7 | 17.1 | 10.9 | 38.0 | 25.0 |
| VLA-JEPA (ours) | 63.3 | 67.1 | 85.4 | 95.6 | 93.6 | 66.3 | 85.1 | 79.5 |
Effect of the proportion of human video data in pre-training on success rates across different perturbation dimensions on the LIBERO-Plus benchmark.
Real-world experimental results comparing VLA-JEPA with π0 and π0.5 across in-distribution, task OOD, and layout OOD settings.
The model achieves its best performance when the video horizon is close to the predefined action horizon. When T is too small, the encoded information is insufficient; when T is too large, redundant information is introduced.
| T | Spatial | Object | Goal | LIBERO-10 | Avg |
|---|---|---|---|---|---|
| 4 | 95.0 | 99.2 | 95.8 | 89.0 | 94.8 |
| 8 | 94.8 | 99.8 | 95.8 | 94.0 | 96.1 |
| 16 | 92.8 | 98.8 | 98.0 | 92.2 | 95.5 |
@article{vlajepa2025,
title={VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model},
author={Sun, Jingwen and Zhang, Wenyao and Qi, Zekun and Ren, Shaojie and Liu, Zezhi and Zhu, Hanxin and Sun, Guangzhong and Jin, Xin and Chen, Zhibo},
journal={arXiv preprint arXiv:25xx.xxxxx},
year={2025}
}