Interesting that the benchmarks they show have it outperforming Gemma 2 9B and L...

chant4747 · on July 18, 2024

Can you help me understand why people seem to think of Connections as a more robust indicator of (general) performance than benchmarks typically used for eval?

It seems to me that while the game is very challenging for people it’s not necessarily an indicator of generalization. I can see how it’s useful - but I have trouble seeing how a low score on it would indicate low performance on most tasks.

Thanks and hopefully this isn’t perceived as offensive. Just trying to learn more about it.

edit: I realize you yourself indicate that it's "just one benchmark" - I am more asking about the broader usage I have seen here on HN comments from several people.

zone411 · on July 23, 2024

The most interesting thing about it is that it’s the type of task where you'd expect LLMs to do well, yet the best models only score around 30%, while top humans get 100%. Many other benchmarks are also getting close to saturation.