Robots must produce actions in real time. They ideally need to think about future actions before performing them
So there needs to be some asynchronicity
You may predict actions that are discontinuous with current state. So you need to smoothly interpolate to avoid dangerous accelerations or invalid states
If you execute actions synchronously with inference (take action, observe, predict, take action), you have pauses which makes things slow, could be dangerous, and deviates from the training data which doesn’t have these pauses.
Pauses also discourage scaling up size of the model since it would increase the latency which would reduce performance even though the actions may have been better, they are stale and slower so therefore worse.
When you generate an action sequence and begin to execute it and start inference on the next action sequence, you are predicting actions from the point in time at which inference started. But inference took some time to complete. So there may be differences between which actions the prediction is using and the actions / observations that the model had performed. Blending between the two could result in disaster.
They propose an algorithm called real-time chunking (RTC)
Enables real time action without discontinuities
Works on diffusion / flow matching models without changes to training
Seems like asynchronicity is absolutely essential to avoid jerkiness and is basically THE thing that leads to smooth looking demos
I don’t fully understand their algorithm though
Using real time chunking task performance remains high even with latencies up to 200ms