Knowledge Insulation

May 26, 2026

  • Training the model with action head producing continuous value actions is faster than autoregressive, but due to the noise, the gradient updates end up destroying the knowledge in the vision language model
  • Knowledge insulation works by cutting off the gradient propagation from the action head to the LM and then training on both discrete actions (autoregressive) which propagate into the LM as well as the higher fidelity continuous actions
  • With the coarser discrete actions, the VLM can be tuned to have better understanding of the important information for motor control
  • Without gradients from the continuous action head, the LM’s knowledge of the world is retained because the cross entropy loss doesn’t cause the same forgetting