Start with pretrained VLM on internet scale dataset
Continue training on cross-embodiment robot data
Both Open X embodiment data as well as proprietary
Train action head with continuous action outputs using flow matching (a variant of diffusion)
Outputs actions at 50hz
They trained a small version of the model which was trained from scratch and it performs much better than baselines but half as good as the larger model with internet scale pre-training
Seems decently fast and dextrous, demos at 1x speed