MolmoAct2: Action Reasoning Models for Real-world Deployment

By Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu

· HF Daily Papers · May 4, 2026

MolmoAct2 is a fully open vision-language-action model for robotics with specialized VLM backbone MolmoER, trained on 3.3M samples with three new datasets for spatial and embodied reasoning.

Categories: Model Releases, OSS & Tools, Research

Excerpt

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu — Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a f

Read at source: https://arxiv.org/abs/2605.02881