RoPE softmax and attention

May 28, 2026

Implemented RoPE. This was pretty tricky. Initially I tried implementing with a lot of einsum and it got confusing to get the rotation to apply correctly. Eventually got it, but it needed to do batched 2x2 matmuls to rotate the components. After getting it to work, I rewrote it to just use the formula for each component of the subvector (x, y) which was simpler and more efficient. I also didn’t store every 2x2 but just the sin and cos values since the rotation just required cos, sin and the negative of the sin value.

I also implemented softmax which was pretty straightforward. Scaled dot product attention was also easy. Though it was easier for me to write it with matmul and transpose rather than einsum. I think potentially the rope shenanigans corrupted the automaticity of einsum and I ended up tripping.

rope.py

softmax.py

attention.py