Hardware Platform
Our robot platform Galaxea R1 Lite is a mobile dual-arm manipulator designed for daily tasks that demand high mobility, extended end-effector reachability, and efficient whole-body teleoperation.

We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation.
Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training.
A comprehensive benchmark—spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation—demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.
Our robot platform Galaxea R1 Lite is a mobile dual-arm manipulator designed for daily tasks that demand high mobility, extended end-effector reachability, and efficient whole-body teleoperation.
The dataset is collected at 11 physical sites—covering residential, catering, retail, and office spaces. Objects and tasks are defined by the scenes.
Dataset Statistics of Galaxea Open-World Dataset, including the distribution of skills, locations, trajectory lengths and manipulated objects.
G0-VLA architecture and training pipeline. Stage 1 pre-trains a vision-language model on cross-embodiment data in an autoregressive manner. Stage 2 and post-train share the same model structure, trained on Galaxea open-world data with embodiment-specific views and high-level and subtask instructions, by supervising the Action Transformer’s action reconstruction with a flow- matching loss.
Fine-tuning benchmark results of different pre-trained VLAs. G0 (Full) achieves the highest average progress score, excelling in object-picking tasks such as Table Bussing, Microwave Operation, and Bed Making. G0 (Stage-2) lead in language following, action consistency, and whole-body control. G0 (Stage-1) performs the worst among pre-trained models, highlighting the necessity of uniform-embodiment pre-training.
Few-shot performance of VLAs. Few-shot transfer performance on Table Bussing and Microwave Operation. Stage-2 pre-training markedly improves success rates and execution smoothness, while Stage-1 pre-training alone offers no clear advantage over training from scratch.
Per-skill progress scores on the Bed Making task. Stage-2 single-embodiment pre- training substantially improves chassis, torso control, while cross-embodiment pre-training (Stage-1, π0) yields weaker performance, in some cases worse than training from scratch.
Instruction accuracy in benchmark tasks (%). G0-VLM serves as the task planner, processing human instructions and environmental observations to generate executable commands for the downstream VLA module.
@article{galaxea2025,
title={Galaxea G0: Open-World Dataset and Dual-System VLA Model},
author={Galaxea Team},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025}
}