> why byte pair encoding is so much more successful than other tokenization sche...

hn_throwaway_99 · on April 5, 2023

I'm not an AI expert, so I don't know what research has been done to verify it, but this comment below, https://news.ycombinator.com/item?id=35454839 , helped me understand it, and intuitively I think it makes sense.

That is, byte pair encoding tokenization is itself based on how common it is to see particular characters in sequential order in the training data. Thus, if the training data really frequently sees characters together (as, of course, it does in common words), then these words get a single token. Which, given how an LLM works, really makes sense because it looks for statistical relationships among strings of tokens. Thus, the way I think of it is that byte pair encoding is essentially like a pre-processing step that already optimizes for statistical relationships among individual characters.