Here's a pair of quick sanity check questions I've been asking LLMs: "家系ラーメンについて...

hnfong · 2025-08-06T02:02:52 1754445772

What does failing those two questions look like?

I don't really know Japanese, so I'm not sure whether I'm missing any nuances in the responses I'm getting...

numpad0 · 2025-08-06T11:04:37 1754478277

The free-beer commercial ChatGPT or Gemini can read them and point out major errors. Larger Gemma models and huge Chinese models like full DeepSeek or Kimi K2 may work too. Sometimes the answer is odd enough that some 7B models can notice it. Technically there are no guarantee that models with same name in different sizes like Qwen 3 0.6B and 27B uses the same dataset, but it kind of tells a bit about quality and compositions of dataset that their creator owns.

I don't actually need accurate answers to those questions, it's just an expectation adjuster for me, so to speak. There should be better questions for other languages/use cases, but these seem to correlate better with model sizes and scales of companies than flappy birds.

0: https://gist.github.com/numpad0/abdf0a12ad73ada3b886d2d2edcc...

1: https://gist.github.com/numpad0/b1c37d15bb1b19809468c933faef...

hnfong · 2025-08-06T12:15:20 1754482520

Thanks for the detailed response.

I'm guessing the issue is just the model size. If you're testing sub-30B models and finding errors, well they're probably not large enough to remember everything in the training data set, so there's inaccuracies and they might hallucinate a bit regarding factoids that aren't very commonly seen in the training data.

Commercial models are presumably significantly larger than the smaller open models, so it sounds like the issue is just mainly model size...

PS: Okra on curry is pretty good actually :)

mtlynch · 2025-08-05T21:00:04 1754427604

For anyone else curious, the Chinese translates to:

>"Tell me about Iekei Ramen", "Tell me how to make curry".

numpad0 · 2025-08-05T22:41:18 1754433678

What those text mean isn't too important, it can probably be "how to make flat breads" in Amharic or "what counts as drifting" in Finnish or something like that.

What's interesting is that these questions are simultaneously well understood by most closed models and not so well understood by most open models for some reason, including this one. Even GLM-4.5 full and Air on chat.z.ai(355B-A32B and 106B-A12B respectively) aren't so accurate for the first one.

lukax · 2025-08-05T21:11:07 1754428267

Japanese, not Chinese

mtlynch · 2025-08-06T00:09:12 1754438952

Ah, my bad. I misread Google Translate when I did auto-detect.

Thanks for the correction!

magoghm · 2025-08-05T21:13:34 1754428414

It's not Chinese, it's Japanese.