dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Evaluating robotic manipulation policies at scale is one of the most persistent bottlenecks in embodied AI research. Every candidate policy, every training checkpoint, every architectural variant ideally needs to be tested across hundreds of environments and task configurations. In practice, this means thousands of physical robot hours — a cost that grows linearly with the number of policies under evaluation. dWorldEval proposes a fundamentally different approach: use a learned world model as a faithful proxy for real-world execution, enabling policy evaluation to happen entirely in imagination.
The Problem with Existing World Models
Prior work has attempted to use video generation models — such as WorldEval, WorldGym, and Ctrl-World — as evaluation substrates. The idea is appealing: condition a generative model on a policy's actions and observe whether the resulting video depicts task success. But these approaches share a critical architectural flaw. They adapt video generation backbones (typically diffusion models pre-trained on internet video) and inject actions as auxiliary conditioning signals. The result is that the model's strong visual priors frequently override the action signal. When a policy issues a subtly incorrect grasp command, the video model may hallucinate a successful grasp anyway — because "objects being picked up" is a high-probability visual pattern in its training distribution.
This failure mode is not merely theoretical. In our experiments on the LIBERO benchmark, existing baselines show severe performance degradation when evaluated on failure trajectories. WorldGym's ∆LPIPS metric spikes from 0.347 on expert data to 0.650 on failure data — nearly doubling. The model cannot faithfully represent what happens when a policy fails, which is precisely the regime where evaluation matters most.
A New Architecture: Discrete Diffusion from Scratch
dWorldEval takes a different path. Rather than adapting a pre-trained video backbone, we train a Masked Discrete Diffusion (MDD) model from scratch on robotic interaction data. The key design choice is to unify actions and visual observations into a single discrete token space. Actions are not auxiliary conditions — they are first-class tokens, denoised jointly with visual tokens through the same transformer architecture.
This unified tokenization has a concrete consequence: the model cannot ignore actions. Every denoising step must attend to both the visual context and the action tokens, because they occupy the same representational space. The result is substantially improved action controllability — the generated rollouts faithfully reflect the input action sequence, whether that sequence leads to success or failure.

Sparse Keyframe Memory for Long Horizons
Robotic manipulation tasks often unfold over dozens of steps. A world model that drifts after five steps is useless for evaluating a 30-step assembly sequence. To maintain spatiotemporal consistency over long horizons, dWorldEval introduces a sparse keyframe memory mechanism. The model caches a small set of keyframe observations from earlier in the rollout and attends to them during generation of subsequent frames. This acts as an anchor, preventing the gradual accumulation of visual errors that plagues autoregressive generation.
We validate this with a round-trip consistency protocol: apply a sequence of forward actions, then apply the corresponding inverse actions, and measure how closely the final observation matches the initial one. Without memory, LPIPS error grows to 0.411 at horizon H=20. With memory, it stays at 0.243 — a reduction of over 40%.
Automatic Success Detection via Progress Tokens
A world model that generates realistic rollouts is only half the solution. Someone — or something — still needs to judge whether the rollout depicts task success. dWorldEval addresses this by jointly predicting a discrete progress token alongside each visual frame. This token represents the model's estimate of task completion at each timestep, trained on both successful and failed trajectories with explicit progress annotations.
The progress token enables fully automatic evaluation: roll out a policy in the world model, read the terminal progress score, and classify the episode as success or failure. In our experiments, this automatic scoring closely tracks both human judgment and ground-truth execution results, capturing even non-monotonic performance fluctuations across training checkpoints.
Results: Strong Correlation with Real-World Performance
The central question for any evaluation proxy is: does it rank policies the same way the real world does? Across three evaluation domains — LIBERO (multi-view), RoboTwin (heterogeneous policies), and real-world manipulation tasks — dWorldEval achieves Pearson correlations of 0.910, 0.927, and 0.918 respectively between predicted and actual success rates. The mean max rank violation (MMRV) stays below 0.015, meaning the model almost never misranks two policies.
| Domain | Pearson r | MMRV ↓ |
|---|---|---|
| LIBERO (Multi-view) | 0.910 | 0.013 |
| RoboTwin (Heterogeneous) | 0.927 | 0.011 |
| Real-World Tasks | 0.918 | 0.014 |
These results suggest that dWorldEval can serve as a reliable proxy for real-world policy evaluation — enabling researchers to screen thousands of policy variants without deploying a single physical robot.
Why It Matters
The practical implication is a shift in how robotic policy development can work. Instead of the current paradigm — train a policy, deploy it on hardware, observe failures, iterate — researchers can now evaluate candidate policies in a learned world model with high fidelity. This doesn't replace real-world testing entirely, but it can dramatically reduce the number of physical deployments needed. Checkpoint selection, hyperparameter sweeps, and architecture comparisons can all happen in imagination, with real-world validation reserved for the most promising candidates.
dWorldEval also serves as the foundation for our companion work, Hi-WM, which extends the world model from a passive evaluator to an interactive corrective workspace for human-in-the-loop post-training.
For more details, see the full paper at dworldeval.github.io. If you're interested in working on world models for robotics, we're hiring — get in touch.