We introduce WEAVER: a world model architecture that satisfies the three desiderata: (i) fidelity, (ii) consistency, and (iii) efficiency. WEAVER unlocks state-of-the-art performance across policy evaluation (ρ = 0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of 38% on top of the π0.5 robot foundation model), and test-time planning (real-world success rate improvement of 14% with a 5–10× speedup over prior WMs).

WEAVER overview showing world-model uses for policy evaluation, policy improvement, and test-time planning.

(i) Fidelity

Physically accurate predictions that correlate with reality; WEAVER generates multiple views used by modern visuomotor policies.

(ii) Consistency

Coherent predictions over long horizons; WEAVER uses sparse memory and short-term history to handle occlusions.

(iii) Efficiency

Fast enough for downstream planning; reward and critic heads on imagined latents avoid expensive decoding and external judges.

Method

WEAVER architecture diagram.

WEAVER is an action-conditioned latent world model for long-horizon robot manipulation. It predicts future latent states, decodes future observations when needed, and plans in latent space. The key design decisions are:

Architecture and design. WEAVER predicts multiple views and robot proprioceptive state. The sparse long-term memory preserves scene context across occlusions, short-term history captures recent motion, and an efficient spatio-temporal Transformer generates future latents conditioned on the action tokens.

Training and fast inference. The latent dynamics model is trained with flow matching and Diffusion Forcing, which supports consistent long-horizon prediction with different noise levels across future steps. At inference time, KV caching and rectified-flow post-training reduce the cost for planning.

Accurate reward and value estimation. To efficiently score action chunks without decoding latents into images or querying an external VLM judge, WEAVER uses a lightweight reward head that operates directly on imagined latent states and the language instruction. A critic estimates returns beyond the imagined horizon.

World Model Evaluation

WEAVER achieves better performance than Ctrl-World on DROID validation and out-of-distribution (OOD) task data while using significantly lower inference time. As we decrease the number of function evaluations (NFE) to reduce latency, Ctrl-World's quality decreases more significantly than WEAVER; both models incur the highest error when predicting wrist-camera viewpoints.

FID and FVD versus H100 inference time for WEAVER and Ctrl-World.
World model evaluation dataset
World model evaluation trajectory
World model evaluation NFE

Latent Verifier and Planning

WEAVER's latent reward is trained to match RoboMeter's task progress rewards. On OOD real trajectories, the distilled reward model can distinguish action chunks using the predicted advantage. We observe that the segment with highest-advantage action sample corresponds to the best imagined outcome.

Reward prediction and advantage filtering plot.

Downstream applications

By jointly satisfying (i) fidelity, (ii) consistency, and (iii) efficiency, WEAVER supports various downstream world-model applications: policy evaluation, policy improvement, and test-time planning.

Policy Evaluation

To evaluate policies offline, we replay recorded real-world action trajectories open-loop in WEAVER and label the imagined rollouts. This turns the world model into a sandbox for ranking policies without additional robot execution. We show that pretrained world models often underestimate policy success, while WEAVER better matches real rollouts than Ctrl-World. The setting is difficult because rollouts span up to 40 seconds and require long-horizon visual prediction; Pour Beans is especially challenging due to granular dynamics. After finetuning, WEAVER-FT further improves agreement and is better at capturing outcomes across policies.

Policy evaluation plot comparing imagined and real-world success rates.
Policy evaluation task

Finetuning Policy

For policy improvement, WEAVER samples candidate action chunks, simulates their long-horizon outcomes, and estimates the advantage using the latent reward and critic heads. Only high-advantage rollouts are distilled back into the base policy, avoiding updates from plans predicted to be worse than the current behavior. We observe that finetuning π0.5 with real and synthetic data generated by WEAVER yields the strongest success rates.

Policy finetuning success rates across tasks.
Policy finetuning task
Policy finetuning success rates for the selected task.

Test-time Steering

At test time, WEAVER performs single-chunk best-of-N search: it samples multiple candidate action chunks, imagines their outcomes, and executes the one with the highest predicted advantage. To reduce latency, we post-train the world model with rectified flows, enabling fast latent rollouts and reward scoring. This makes test-time planning practical without pixel reconstruction or external VLM-as-a-judge scoring.

Test-time steering task
Test-time steering success rates for the selected task.
Chunk 1
Per-chunk last-frame predictions with advantage values.
Predicted last views of 4 candidate action samples (chunk size=15) with estimated advantage.

BibTeX

@article{jain2026weaver,
  title={WEAVER: Efficient World Models for Robot Video Prediction},
  author={Arnav Kumar Jain and Yilin Wu and Jesse Farebrother and Gokul Swamy and Andrea Bajcsy},
  journal={CoRR},
  volume={abs/2606.13672},
  year={2026}
}