Interesting that the benchmarks they show have it outperforming Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini also does better at 14.3. It's just one benchmark though, so looking forward to additional scores.
Can you help me understand why people seem to think of Connections as a more robust indicator of (general) performance than benchmarks typically used for eval?
It seems to me that while the game is very challenging for people it’s not necessarily an indicator of generalization. I can see how it’s useful - but I have trouble seeing how a low score on it would indicate low performance on most tasks.
Thanks and hopefully this isn’t perceived as offensive. Just trying to learn more about it.
edit: I realize you yourself indicate that it's "just one benchmark" - I am more asking about the broader usage I have seen here on HN comments from several people.
The most interesting thing about it is that it’s the type of task where you'd expect LLMs to do well, yet the best models only score around 30%, while top humans get 100%. Many other benchmarks are also getting close to saturation.