BPE tokenizer
I completed the BPE tokenizer part of assignment 1. It was good to get the python flowing again. The trickiest part was trying to understand the precise definition of the tokenization algorithm. One part I got a little confused about was whether applying merges happens sequentially within a pretoken. Another tricky part was that the chr function doesn’t output the byte corresponding to the numbers after 128 but instead a multi byte representation. And then bytes(number) yielded a number long byte string. Took a minute to remember to do bytes([number]) to get the single byte.
It was useful to implement slow versions of the training and encoding and identify the bottlenecks and optimize the implementations to hundred x the performance. The final implementations I ended up with were quite a bit faster than naive and far faster than the requirements for the problem, but also still have much room for improvement by reducing the unnecessary memory allocations and using not python. But I feel that it won’t be the best use of time to chase the most optimal versions of those.
I was surprised that the pretokenization regex bakes in some big priors about the dataset, splitting on punctuation, whitespace, etc. I kind of expected modern BPE to be more general and less specific to text, but it turns out not.
Next up will be implementing parts of the transformer language model components and surrounding training pieces. I’m excited to implement these pieces with the “from scratch” in pytorch approach using only parameters and modules. I imagine it being a little less tedious than the BPE training, but we’ll see.