Galaxea Open-World Dataset and G0 Dual-System VLA Model

Abstract

We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation.

Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training.

A comprehensive benchmark—spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation—demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Dataset

Hardware Platform

Our robot platform Galaxea R1 Lite is a mobile dual-arm manipulator designed for daily tasks that demand high mobility, extended end-effector reachability, and efficient whole-body teleoperation.

Data Samples

The dataset is collected at 11 physical sites—covering residential, catering, retail, and office spaces. Objects and tasks are defined by the scenes.

Statistics

Dataset Statistics of Galaxea Open-World Dataset, including the distribution of skills, locations, trajectory lengths and manipulated objects.

G0 Model

G0-VLA Pre-training

G0-VLA architecture and training pipeline. Stage 1 pre-trains a vision-language model on cross-embodiment data in an autoregressive manner. Stage 2 and post-train share the same model structure, trained on Galaxea open-world data with embodiment-specific views and high-level and subtask instructions, by supervising the Action Transformer’s action reconstruction with a flow- matching loss.

Benchmarking Pre-training

Fine-tuning benchmark results of different pre-trained VLAs. G0 (Full) achieves the highest average progress score, excelling in object-picking tasks such as Table Bussing, Microwave Operation, and Bed Making. G0 (Stage-2) lead in language following, action consistency, and whole-body control. G0 (Stage-1) performs the worst among pre-trained models, highlighting the necessity of uniform-embodiment pre-training.

Few-shot Transfer

Few-shot performance of VLAs. Few-shot transfer performance on Table Bussing and Microwave Operation. Stage-2 pre-training markedly improves success rates and execution smoothness, while Stage-1 pre-training alone offers no clear advantage over training from scratch.

Embodiment-specific Actions

Per-skill progress scores on the Bed Making task. Stage-2 single-embodiment pre- training substantially improves chassis, torso control, while cross-embodiment pre-training (Stage-1, π0) yields weaker performance, in some cases worse than training from scratch.

G0-VLM: Task Planner

Instruction accuracy in benchmark tasks (%). G0-VLM serves as the task planner, processing human instructions and environmental observations to generate executable commands for the downstream VLA module.

VLM Instruction accuracy in benchmark tasks

BibTeX

@article{galaxea2025,
    title={Galaxea G0: Open-World Dataset and Dual-System VLA Model},
    author={Galaxea Team},
    journal={arXiv preprint arXiv:2509.00576v1},
    year={2025}
}