Do Open Large Language Models Know What, Where, and When? A Case Study with Quiz-Style Questions

Authors

DOI:

https://doi.org/10.14529/jsfi250307

Keywords:

Large Language Models (LLMs), question answering, reasoning, evaluation metrics, quiz datasets, LLM-as-a-judge, human-AI comparison

Abstract

Large language models (LLMs) are increasingly tested on reasoning-intensive benchmarks, yet their performance on complex quiz-style tasks remains underexplored. In this paper we evaluate modern open-source LLMs on the Russian intellectual game What? Where? When?, a challenging format requiring fact recall, associative reasoning, and interpretation of hidden clues. We introduce a new dataset of 2600 questions (2018–2025), enriched with empirical human team success rates and annotated with structural and thematic clusters. We benchmark 14 recent open models accessible via API using both automatic metrics (Exact Match, BLEU, ROUGE) and an LLM-as-a-Judge framework. The best system, Qwen3-235B-A22B-Thinking, achieved 32.4% accuracy, but still lagging behind the average human team success rate (45.8%). Large-scale reasoning-enabled models consistently outperformed non-reasoning or smaller counterparts, particularly in domains such as technology, ancient world, psychology, and nature. However, omission, wordplay, and proper-name questions remained difficult across all systems. Comparison with CheGeKa (MERA leaderboard) shows that our dataset is substantially harder: while leading proprietary and open models reach EM of 0.534–0.645 and 0.442 on CheGeKa, respectively, the strongest model in our benchmark achieves only 0.255 EM. Correlation analysis indicates that human and model perceptions of difficulty only weakly align, suggesting different problem-solving strategies. Qualitative case studies further show that models excel more in fact recall than in reconstructing hidden logic. Our findings highlight both the progress of open LLMs and their current limitations in quiz-style reasoning. The new dataset offers a complementary and more challenging benchmark for Russian-language evaluation.

References

MERA Leaderboard. https://mera.a-ai.ru/en/text/leaderboard, accessed: 2025-09-08

What? Where? When? https://en.wikipedia.org/wiki/What%3F_Where%3F_When%3F, accessed: 2025-09-08

Aßenmacher, M., Karrlein, L., Schiele, P., et al.: Introducing wwm-german-18k – can LLMs crack the million? (or win at least 500 euros?). In: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024). pp. 287–296 (2024), https://aclanthology.org/2024.icnlsp-1.31/

Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 7819, pp. 160–172. Springer (2013). https://doi.org/10.1007/978-3-642-37456-2_14

Chen, A., Stanovsky, G., Singh, S., et al.: Evaluating question answering evaluation. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering. pp. 119–124 (2019). https://doi.org/10.18653/v1/D19-5817

Chi, N., Malchev, T., Kong, R., et al.: ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models. In: Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. pp. 113–119 (2024), https://aclanthology.org/2024.sigtyp-1.14/

Cobbe, K., Kosaraju, V., Bavarian, M., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021), https://arxiv.org/abs/2110.14168

Foster, E.J., Friedlander, K.J., Fine, P.A.: Mastermind and expert mind: A qualitative study of elite quizzers. Journal of Expertise 8(1), 38–71 (2025), https://www.journalofexpertise.org/articles/volume8_issue1/JoE_8_1_Foster_etal.html

Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022), https://arxiv.org/abs/2203.05794

Hendrycks, D., Burns, C., Basart, S., et al.: Measuring massive multitask language understanding. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=d7KBjmI3GmQ

Hu, L., Li, Q., Xie, A., et al.: GameArena: Evaluating LLM reasoning through live computer games. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview.net/forum?id=SeQ8l8xo1r

Joshi, M., Choi, E., Weld, D.S., et al.: TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. pp. 1601–1611 (2017). https://doi.org/10.18653/v1/P17-1147

Khan, M.A., Yadav, N., Masud, S., et al.: QUENCH: Measuring the gap between Indic and non-Indic contextual general reasoning in LLMs. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 4493–4509 (2025), https://aclanthology.org/2025.coling-main.303/

Lifar, M., Protsenko, B., Kupriianenko, D., et al.: LlaMa meets Cheburashka: impact of cultural background for LLM quiz reasoning. In: Language Gamification - NeurIPS 2024 Workshop (2024), https://openreview.net/forum?id=xCAzTXumhh

Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004), https://aclanthology.org/W04-1013/

McInnes, L., Healy, J., Saul, N., et al.: Umap: Uniform manifold approximation and projection for dimension reduction. The Journal of Open Source Software 3(29), 861 (2018), https://joss.theoj.org/papers/10.21105/joss.00861

Mikhalkova, E., Khlyupin, A.A.: Russian Jeopardy! Data Set for Question-Answering Systems. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. pp. 508–514 (2022), https://aclanthology.org/2022.lrec-1.53/

Papineni, K., Roukos, S., Ward, T., et al.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318 (2002). https://doi.org/10.3115/1073083.1073135

Rodriguez, P., Feng, S., Iyyer, M., et al.: Quizbowl: The case for incremental question answering (2021), https://arxiv.org/abs/1904.04792

Srivastava, A., Rastogi, A., Rao, A., et al.: Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023), https://openreview.net/forum?id=uyTL5Bvosj

Taktasheva, E., Shavrina, T., Fenogenova, A., et al.: TAPE: Assessing few-shot Russian language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2022. pp. 2472–2497 (2022). https://doi.org/10.18653/v1/2022.findings-emnlp.183

Xian, N., Fan, Y., Zhang, R., et al.: An empirical study of evaluating long-form question answering. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1141–1151. SIGIR ’25 (2025). https://doi.org/10.1145/3726302.3729895

Yang, Z., Qi, P., Zhang, S., et al.: HotpotQA: A dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 2369–2380 (2018). https://doi.org/10.18653/v1/D18-1259

Zhang, Y., Wang, M., Li, X., et al.: TurnBench-MS: A benchmark for evaluating multi-turn, multi-step reasoning in large language models. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 19892–19924. Association for Computational Linguistics, Suzhou, China (Nov 2025). https://doi.org/10.18653/v1/2025.findings-emnlp.1084

Zheng, L., Chiang, W.L., Sheng, Y., et al.: Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc. (2023). https://doi.org/10.5555/3666122.3668142

Downloads

Published

2025-12-25

How to Cite

Kuznetsova, A. V., Byzov, V. A., Aslanov, I. V., & Kotelnikov, E. V. (2025). Do Open Large Language Models Know What, Where, and When? A Case Study with Quiz-Style Questions. Supercomputing Frontiers and Innovations, 12(3), 90–107. https://doi.org/10.14529/jsfi250307