VLA-JEPA
Enhancing Vision-Language-Action Model
with Latent World Model

Jingwen Sun^1,2*, Wenyao Zhang^3,5*, Zekun Qi⁴, Shaojie Ren^2,6, Zezhi Liu^2,7,
Hanxin Zhu¹, Guangzhong Sun¹, Xin Jin^2,5†, Zhibo Chen^1,2†

* equal contribution † corresponding author

¹University of Science and Technology of China ²Zhongguancun Academy ³Shanghai Jiao Tong University
⁴Tsinghua University ⁵Eastern Institute of Technology ⁶University of Chinese Academy of Sciences ⁷Nankai University

arXiv Code

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation—future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe—JEPA pretraining followed by action-head fine-tuning—without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in sample efficiency and generalization over existing methods.

VLA-JEPA model architecture. A target encoder produces latent targets from future frames, while the student pathway sees only the current observation through a VLM backbone. A predictor maps the history latent states and the latent action representations to future latent states, trained as a latent world model using a JEPA alignment loss. Future frames are never provided as inputs to the VLM backbone; they are used solely to construct training targets.

Highlights

1. We analyze why many latent-action pretraining objectives remain pixel-tethered, becoming biased toward appearance, vulnerable to nuisance motion, and prone to information leakage when future context enters the learner.

2. We propose VLA-JEPA, a JEPA-style latent predictive alignment scheme that learns action-relevant transition semantics by predicting and aligning future latent states—without pixel reconstruction, information leakage and only one-stage pretraining pipeline.

3. VLA-JEPA yields consistent gains in sample efficiency, robustness, and generalization across embodied control benchmarks (LIBERO, LIBERO-Plus, SimplerEnv) and real-world settings, while simplifying training relative to prior multi-stage latent-action pipelines.

Method Overview

VLA-JEPA Framework. We adopt a VLM backbone with learnable latent action tokens. For action-free human videos, VLA-JEPA extracts latent actions via a world-model-based state transition objective using a V-JEPA2 encoder. For robot demonstrations, a flow-matching action head generates precise end-effector trajectories. During fine-tuning, both objectives are jointly optimized, enabling learned state-transition dynamics to be effectively leveraged for downstream robotic control.

Comparison with Prior VLA Methods

We identify four failure modes in existing latent-action pretraining pipelines: pixel-level objectives biased toward appearance, amplified noisy motion from real-world videos, information leakage causing latent-action collapse, and fragile multi-stage training pipelines. VLA-JEPA addresses all of these issues by design.

Visualization of the attention weight matrix of latent action tokens attending to image tokens. LAPA's latent actions focus on excessively dense visual information including operation-irrelevant details. UniVLA overemphasizes semantics, attending to irrelevant background elements. VLA-JEPA focuses precisely on the robotic arm, the hand, and the objects to be manipulated.

Experiments Setup

Experiments setup on LIBERO, LIBERO-Plus, SimplerEnv and real-world Franka robot. We evaluate VLA-JEPA on 3 simulation benchmarks and 1 real-world environment.

LIBERO Benchmark Results

VLA-JEPA achieves state-of-the-art performance on the LIBERO benchmark with the highest average success rate, outperforming methods that rely on extensive robot datasets for pre-training.

Method	Spatial	Object	Goal	LIBERO-10	Avg
LAPA	73.8	74.6	58.8	55.4	65.7
UniVLA	96.5	96.8	95.6	92.0	95.2
OpenVLA-OFT	97.6	98.4	97.9	94.5	97.1
π₀	96.8	98.8	95.8	85.2	94.2
π₀-Fast	96.4	96.8	88.6	60.2	85.5
CoT-VLA	87.5	91.6	87.6	69.0	81.1
WorldVLA	87.6	96.2	83.4	60.0	81.8
villa-X	97.5	97.0	91.5	74.5	90.1
GR00T N1	94.4	97.6	93.0	90.6	93.9
π_0.5	98.8	98.2	98.0	92.4	96.9
VLA-JEPA (ours)	96.2	99.6	97.2	95.8	97.2

SimplerEnv Benchmark Results

On SimplerEnv, VLA-JEPA achieves the highest average success rate on the Google Robot and competitive results on WidowX Robot, while using less than 1% of the training data compared to methods like villa-X.

Method	Google Robot					WidowX Robot
Method	Pick	Move	Drawer	Place	Avg	Spoon	Carrot	Block	Eggplant	Avg
LAPA*	-	-	-	-	-	70.8	45.8	54.2	58.3	57.3
villa-X	81.7	55.4	38.4	4.2	44.9	48.3	24.2	19.2	71.7	40.8
RoboVLMs	77.3	61.7	43.5	24.1	51.7	45.8	20.8	4.2	79.2	37.5
π₀	72.7	65.3	38.3	-	-	29.1	0	16.6	62.5	40.1
π₀-Fast	75.3	67.5	42.9	-	-	29.1	21.9	10.8	66.7	48.3
VLA-JEPA (ours)	88.3	64.1	59.3	49.1	65.2	75.0	70.8	12.5	70.8	57.3

LIBERO-Plus Robustness Results

VLA-JEPA achieves the best performance on 5 out of 7 perturbations in LIBERO-Plus, demonstrating significant advantages under Language, Light, Background, and Layout perturbations. This verifies that our latent action can effectively handle task-agnostic disturbances.

Method	Camera	Robot	Language	Light	Background	Noise	Layout	Avg
UniVLA	1.8	46.2	69.6	69.0	81.0	21.2	31.9	42.9
OpenVLA-OFT	56.4	31.9	79.5	88.7	93.3	75.8	74.2	69.6
π₀	13.8	6.0	58.8	85.0	81.4	79.0	68.9	53.6
π₀-Fast	65.1	21.6	61.0	73.2	73.2	74.4	68.8	61.6
WorldVLA	0.1	27.9	41.6	43.7	17.1	10.9	38.0	25.0
VLA-JEPA (ours)	63.3	67.1	85.4	95.6	93.6	66.3	85.1	79.5

Impact of Human Video Data

Human video data primarily enhances the robustness and stability of the VLA model by strengthening its existing skill repertoire, rather than introducing new action execution capabilities. As the scale of human video data increases, the robustness of the resulting policy consistently improves.

Effect of the proportion of human video data in pre-training on success rates across different perturbation dimensions on the LIBERO-Plus benchmark.

Real-World Experiments

We evaluate VLA-JEPA on table-top manipulation tasks using a Franka Research 3 arm. VLA-JEPA achieves state-of-the-art performance under both in-distribution and object layout out-of-distribution settings. Notably, VLA-JEPA acquires the skill of repeated grasping—reopening the gripper to attempt another grasp after a failure—a capability not observed in π₀ or π_0.5.

Real-world experimental results comparing VLA-JEPA with π₀ and π_0.5 across in-distribution, task OOD, and layout OOD settings.

Ablation: Video Horizon

The model achieves its best performance when the video horizon is close to the predefined action horizon. When T is too small, the encoded information is insufficient; when T is too large, redundant information is introduced.

T	Spatial	Object	Goal	LIBERO-10	Avg
4	95.0	99.2	95.8	89.0	94.8
8	94.8	99.8	95.8	94.0	96.1
16	92.8	98.8	98.0	92.2	95.5

BibTeX

@article{vlajepa2025,
      title={VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model},
      author={Sun, Jingwen and Zhang, Wenyao and Qi, Zekun and Ren, Shaojie and Liu, Zezhi and Zhu, Hanxin and Sun, Guangzhong and Jin, Xin and Chen, Zhibo},
      journal={arXiv preprint arXiv:25xx.xxxxx},
      year={2025}
    }

VLA-JEPA Enhancing Vision-Language-Action Model with Latent World Model