Here's a pair of quick sanity check questions I've been asking LLMs: "家系ラーメンについて教えて", "カレーの作り方教えて". It's a silly test but surprisingly many fails at it - and Chinese models are especially bad with it. The commonalities between models doing okay-ish for these questions seem to be Google-made OR >70b OR straight up commercial(so >200B or whatever).
I'd say gpt-oss-20b is in between Qwen3 30B-A3B-2507 and Gemma 3n E4b(with 30B-A3B at lower side). This means it's not obsoleting GPT-4o-mini for all purposes.
The free-beer commercial ChatGPT or Gemini can read them and point out major errors. Larger Gemma models and huge Chinese models like full DeepSeek or Kimi K2 may work too. Sometimes the answer is odd enough that some 7B models can notice it. Technically there are no guarantee that models with same name in different sizes like Qwen 3 0.6B and 27B uses the same dataset, but it kind of tells a bit about quality and compositions of dataset that their creator owns.
I don't actually need accurate answers to those questions, it's just an expectation adjuster for me, so to speak. There should be better questions for other languages/use cases, but these seem to correlate better with model sizes and scales of companies than flappy birds.
I'm guessing the issue is just the model size. If you're testing sub-30B models and finding errors, well they're probably not large enough to remember everything in the training data set, so there's inaccuracies and they might hallucinate a bit regarding factoids that aren't very commonly seen in the training data.
Commercial models are presumably significantly larger than the smaller open models, so it sounds like the issue is just mainly model size...
What those text mean isn't too important, it can probably be "how to make flat breads" in Amharic or "what counts as drifting" in Finnish or something like that.
What's interesting is that these questions are simultaneously well understood by most closed models and not so well understood by most open models for some reason, including this one. Even GLM-4.5 full and Air on chat.z.ai(355B-A32B and 106B-A12B respectively) aren't so accurate for the first one.
I'd say gpt-oss-20b is in between Qwen3 30B-A3B-2507 and Gemma 3n E4b(with 30B-A3B at lower side). This means it's not obsoleting GPT-4o-mini for all purposes.