VLA-JEPA
Enhancing Vision-Language-Action Model
with Latent World Model

* equal contribution † corresponding author
1University of Science and Technology of China 2Zhongguancun Academy 3Shanghai Jiao Tong University
4Tsinghua University 5Eastern Institute of Technology 6University of Chinese Academy of Sciences 7Nankai University

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation—future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe—JEPA pretraining followed by action-head fine-tuning—without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in sample efficiency and generalization over existing methods.

VLA-JEPA Architecture

VLA-JEPA model architecture. A target encoder produces latent targets from future frames, while the student pathway sees only the current observation through a VLM backbone. A predictor maps the history latent states and the latent action representations to future latent states, trained as a latent world model using a JEPA alignment loss. Future frames are never provided as inputs to the VLM backbone; they are used solely to construct training targets.

Highlights

1. We analyze why many latent-action pretraining objectives remain pixel-tethered, becoming biased toward appearance, vulnerable to nuisance motion, and prone to information leakage when future context enters the learner.

2. We propose VLA-JEPA, a JEPA-style latent predictive alignment scheme that learns action-relevant transition semantics by predicting and aligning future latent states—without pixel reconstruction, information leakage and only one-stage pretraining pipeline.

3. VLA-JEPA yields consistent gains in sample efficiency, robustness, and generalization across embodied control benchmarks (LIBERO, LIBERO-Plus, SimplerEnv) and real-world settings, while simplifying training relative to prior multi-stage latent-action pipelines.

Method Overview

VLA-JEPA Framework

VLA-JEPA Framework. We adopt a VLM backbone with learnable latent action tokens. For action-free human videos, VLA-JEPA extracts latent actions via a world-model-based state transition objective using a V-JEPA2 encoder. For robot demonstrations, a flow-matching action head generates precise end-effector trajectories. During fine-tuning, both objectives are jointly optimized, enabling learned state-transition dynamics to be effectively leveraged for downstream robotic control.

Comparison with Prior VLA Methods

We identify four failure modes in existing latent-action pretraining pipelines: pixel-level objectives biased toward appearance, amplified noisy motion from real-world videos, information leakage causing latent-action collapse, and fragile multi-stage training pipelines. VLA-JEPA addresses all of these issues by design.

Attention Map Visualization

Visualization of the attention weight matrix of latent action tokens attending to image tokens. LAPA's latent actions focus on excessively dense visual information including operation-irrelevant details. UniVLA overemphasizes semantics, attending to irrelevant background elements. VLA-JEPA focuses precisely on the robotic arm, the hand, and the objects to be manipulated.

Experiments Setup

Experiments Setup

Experiments setup on LIBERO, LIBERO-Plus, SimplerEnv and real-world Franka robot. We evaluate VLA-JEPA on 3 simulation benchmarks and 1 real-world environment.

LIBERO Benchmark Results

VLA-JEPA achieves state-of-the-art performance on the LIBERO benchmark with the highest average success rate, outperforming methods that rely on extensive robot datasets for pre-training.

Method Spatial Object Goal LIBERO-10 Avg
LAPA 73.8 74.6 58.8 55.4 65.7
UniVLA 96.5 96.8 95.6 92.0 95.2
OpenVLA-OFT 97.6 98.4 97.9 94.5 97.1
π0 96.8 98.8 95.8 85.2 94.2
π0-Fast 96.4 96.8 88.6 60.2 85.5
CoT-VLA 87.5 91.6 87.6 69.0 81.1
WorldVLA 87.6 96.2 83.4 60.0 81.8
villa-X 97.5 97.0 91.5 74.5 90.1
GR00T N1 94.4 97.6 93.0 90.6 93.9
π0.5 98.8 98.2 98.0 92.4 96.9
VLA-JEPA (ours) 96.2 99.6 97.2 95.8 97.2

SimplerEnv Benchmark Results

On SimplerEnv, VLA-JEPA achieves the highest average success rate on the Google Robot and competitive results on WidowX Robot, while using less than 1% of the training data compared to methods like villa-X.

Method Google Robot WidowX Robot
Pick Move Drawer Place Avg Spoon Carrot Block Eggplant Avg
LAPA* - - - - - 70.8 45.8 54.2 58.3 57.3
villa-X 81.7 55.4 38.4 4.2 44.9 48.3 24.2 19.2 71.7 40.8
RoboVLMs 77.3 61.7 43.5 24.1 51.7 45.8 20.8 4.2 79.2 37.5
π0 72.7 65.3 38.3 - - 29.1 0 16.6 62.5 40.1
π0-Fast 75.3 67.5 42.9 - - 29.1 21.9 10.8 66.7 48.3
VLA-JEPA (ours) 88.3 64.1 59.3 49.1 65.2 75.0 70.8 12.5 70.8 57.3

LIBERO-Plus Robustness Results

VLA-JEPA achieves the best performance on 5 out of 7 perturbations in LIBERO-Plus, demonstrating significant advantages under Language, Light, Background, and Layout perturbations. This verifies that our latent action can effectively handle task-agnostic disturbances.

Method Camera Robot Language Light Background Noise Layout Avg
UniVLA 1.8 46.2 69.6 69.0 81.0 21.2 31.9 42.9
OpenVLA-OFT 56.4 31.9 79.5 88.7 93.3 75.8 74.2 69.6
π0 13.8 6.0 58.8 85.0 81.4 79.0 68.9 53.6
π0-Fast 65.1 21.6 61.0 73.2 73.2 74.4 68.8 61.6
WorldVLA 0.1 27.9 41.6 43.7 17.1 10.9 38.0 25.0
VLA-JEPA (ours) 63.3 67.1 85.4 95.6 93.6 66.3 85.1 79.5

Impact of Human Video Data

Human video data primarily enhances the robustness and stability of the VLA model by strengthening its existing skill repertoire, rather than introducing new action execution capabilities. As the scale of human video data increases, the robustness of the resulting policy consistently improves.

Human Video Proportion Analysis

Effect of the proportion of human video data in pre-training on success rates across different perturbation dimensions on the LIBERO-Plus benchmark.

Real-World Experiments

We evaluate VLA-JEPA on table-top manipulation tasks using a Franka Research 3 arm. VLA-JEPA achieves state-of-the-art performance under both in-distribution and object layout out-of-distribution settings. Notably, VLA-JEPA acquires the skill of repeated grasping—reopening the gripper to attempt another grasp after a failure—a capability not observed in π0 or π0.5.

Real-World Experimental Results

Real-world experimental results comparing VLA-JEPA with π0 and π0.5 across in-distribution, task OOD, and layout OOD settings.

Ablation: Video Horizon

The model achieves its best performance when the video horizon is close to the predefined action horizon. When T is too small, the encoded information is insufficient; when T is too large, redundant information is introduced.

T Spatial Object Goal LIBERO-10 Avg
4 95.0 99.2 95.8 89.0 94.8
8 94.8 99.8 95.8 94.0 96.1
16 92.8 98.8 98.0 92.2 95.5

BibTeX

@article{vlajepa2025,
      title={VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model},
      author={Sun, Jingwen and Zhang, Wenyao and Qi, Zekun and Ren, Shaojie and Liu, Zezhi and Zhu, Hanxin and Sun, Guangzhong and Jin, Xin and Chen, Zhibo},
      journal={arXiv preprint arXiv:25xx.xxxxx},
      year={2025}
    }