Training the model with action head producing continuous value actions is faster than autoregressive, but due to the noise, the gradient updates end up destroying the knowledge in the vision language model
Knowledge insulation works by cutting off the gradient propagation from the action head to the LM and then training on both discrete actions (autoregressive) which propagate into the LM as well as the higher fidelity continuous actions
With the coarser discrete actions, the VLM can be tuned to have better understanding of the important information for motor control
Without gradients from the continuous action head, the LM’s knowledge of the world is retained because the cross entropy loss doesn’t cause the same forgetting