RuParam: a Russian Parametric Dataset for LLM Evaluation

Pavel V. Grashchenkov; Lada I. Pasko; Regina R. Nasyrova

doi:10.14529/jsfi250301

Authors

Pavel V. Grashchenkov Lomonosov Moscow State University, Moscow, Russian Federation https://orcid.org/0000-0001-9754-2452
Lada I. Pasko Lomonosov Moscow State University, Moscow, Russian Federation https://orcid.org/0000-0002-0533-809X
Regina R. Nasyrova Lomonosov Moscow State University, Moscow, Russian Federation https://orcid.org/0009-0008-6280-0667

DOI:

https://doi.org/10.14529/jsfi250301

Keywords:

Large Language Models, linguistic evaluation, minimal pairs, Russian, linguistic parameters, language acquisition

Abstract

We introduce RuParam, a parametric dataset designed to evaluate the acquisition of Russian by large language models (LLMs). This corpus mirrors the structure of the BLiMP family of datasets by containing minimal pairs of sentences. However, our goal was to expand its scope as much as possible by incorporating diverse phenomena from several domains of Russian grammar. A significant portion of the data originates from the Tests of Russian as a Foreign Language (TORFL); similar sources were not previously used for linguistic evaluation of LLMs. Additionally, this study details experimental findings involving six LLMs. These LLMs, sourced from multiple developers, vary in size and pretraining data, which affects their proficiency in Russian. We investigate how effectively these models handle universal, typological, and Russian-specific grammatical features. Our results indicate that while most of the models demonstrate relatively high performance, they struggle significantly with some of the Russian-specific categories.

References

Adeeba, F., Dillon, B., Sajjad, H., Bhatt, R.: UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu (2025), https://arxiv.org/abs/2508.01006

Başar, E., Padovani, F., Jumelet, J., Bisazza, A.: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. p. 16506–16521. Association for Computational Linguistics, Suzhou, China (2025). https://doi.org/10.34810/data1393

Bel, N., Punsola, M., Ruiz-Fernández, V.: CatCoLA, Catalan Corpus of Linguistic Acceptability. Procesamiento del Lenguaje Natural 73, 177–190 (2024). https://doi.org/10.34810/data1393

Bel, N., Punsola, M., Ruiz-Fernández, V.: EsCoLA: Spanish corpus of Linguistic Acceptability. In: Calzolari, N., Kan, M.Y., Hoste, V., et al. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 6268–6277. ELRA and ICCL, Torino, Italia (May 2024). https://doi.org/10.34810/data1138

Daultani, V., Martínez, H.J.V., Okazaki, N.: Acceptability Evaluation of Naturally Written Sentences. Journal of Information Processing 32, 652–666 (2024). https://doi.org/10.2197/ipsjjip.32.652

Featherston, S.: Response Methods in Acceptability Experiments, p. 39–61. Cambridge Handbooks in Language and Linguistics, Cambridge University Press (2021). https://doi.org/10.1017/9781108569620

Grashchenkov, P.: RuConst: A Treebank for Russian. Lomonosov Philology Journal. Series 9. Philology 3, 94–112 (2024). https://doi.org/10.55959/MSU0130-0075-9-2024-47-03-7, (in Russian)

Grashchenkov, P., Pasko, L., Studenikina, K., Tikhomirov, M.: Russian parametric corpus RuParam. Scientific and Technical Journal of Information Technologies, Mechanics and Optics 24(6), 991–998 (2024). https://doi.org/10.17586/2226-1494-2024-24-6-991-998, (in Russian)

Hendrycks, D., Burns, C., Basart, S., et al.: Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) pp. 11260–11285 (2021). https://doi.org/10.18653/v1/2024.findings-acl.671

Hu, H., Zhang, Z., Huang, W., et al.: Revisiting acceptability judgements (05 2023). https://doi.org/10.48550/arXiv.2305.14091

Jentoft, M., Samuel, D.: NoCoLA: The Norwegian Corpus of Linguistic Acceptability. In: Alumäe, T., Fishel, M. (eds.) Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). pp. 610–617. University of Tartu Library, Tórshavn, Faroe Islands (May 2023), https://aclanthology.org/2023.nodalida-1.60/

Jumelet, J., Weissweiler, L., Bisazza, A.: MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs (04 2025). https://doi.org/10.48550/arXiv.2504.02768

Ligeti-Nagy, N., Ferenczi, G., Héja, E., et al.: HuLU: Hungarian Language Understanding Benchmark Kit. In: Calzolari, N., Kan, M.Y., Hoste, V., et al. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 8360–8371. ELRA and ICCL, Torino, Italia (May 2024), https://aclanthology.org/2024.lrec-main.733/

Mikhailov, V., Shamardina, T., Ryabinin, M., et al.: RuCoLA: Russian Corpus of Linguistic Acceptability. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. p. 5207–5227. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.emnlp-main.348

Savchuk, S.O., Arkhangelskiy, T., Bonch-Osmolovskaya, A.A., et al.: Russian National Corpus 2.0: New opportunities and development prospects. Voprosy Jazykoznanija 2, 7–34 (2024). https://doi.org/10.31857/0373-658X.2024.2.7-34, (in Russian)

Someya, T., Oseki, Y.: JBLiMP: Japanese Benchmark of Linguistic Minimal Pairs. In: Vlachos, A., Augenstein, I. (eds.) Findings of the Association for Computational Linguistics: EACL 2023. pp. 1581–1594. Association for Computational Linguistics, Dubrovnik, Croatia (May 2023). https://doi.org/10.18653/v1/2023.findings-eacl.117

Someya, T., Sugimoto, Y., Oseki, Y.: JCoLA: Japanese Corpus of Linguistic Acceptability. In: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 9477–9488. ELRA and ICCL, Torino, Italia (May 2024), https://aclanthology.org/2024.lrec-main.828/

Song, S., Hu, J., Mahowald, K.: Language Models Fail to Introspect About Their Knowledge of Language (2025), https://arxiv.org/abs/2503.07513

Song, Y., Krishna, K., Bhatt, R., Iyyer, M.: SLING: Sino Linguistic Evaluation of Large Language Models. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 4606–4634. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec 2022). https://doi.org/10.18653/v1/2022.emnlp-main.305

Suijkerbuijk, M., Prins, Z., Kloots, M.d.H., et al.: BLiMP-NL: A Corpus of Dutch Minimal Pairs and Acceptability Judgments for Language Model Evaluation. Computational Linguistics. P. 1–35 (05 2025). https://doi.org/10.1162/coli_a_00559

Taktasheva, E., Bazhukov, M., Koncha, K., et al.: RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 9268–9299. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024). https://doi.org/10.18653/v1/2024.emnlp-main.522

Team, Q.: Qwen2.5: A party of foundation models (September 2024), https://qwenlm.github.io/blog/qwen2.5/

Tikhomirov, M., Chernyshev, D.: Impact of Tokenization on LLaMa Russian Adaptation. In: 2023 Ivannikov Ispras Open Conference (ISPRAS). pp. 163–168 (2023). https://doi.org/10.1109/ISPRAS60948.2023.10508177

Tikhomirov, M., Chernyshev, D.: Facilitating large language model Russian adaptation with Learned Embedding Propagation. Journal of Language and Education 10(4), 130–145 (2024). https://doi.org/10.17323/jle.2024.22224

Trotta, D., Guarasci, R., Leonardelli, E., Tonelli, S.: Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus. In: Findings of the Association for Computational Linguistics: EMNLP 2021. p. 2929–2940. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-emnlp.250

V´azquez Mart´inez, H.J., Heuser, A., Yang, C., Kodner, J.: Evaluating Neural Language Models as Cognitive Models of Language Acquisition. In: Hupkes, D., Dankers, V., Batsuren, K., et al. (eds.) Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP. pp. 48–64. Association for Computational Linguistics, Singapore (Dec 2023). https://doi.org/10.18653/v1/2023.genbench-1.4

Volodina, E., Mohammed, Y.A., Klezl, J.: DaLAJ - a dataset for linguistic acceptability judgments for Swedish. In: Proceedings of the 10thWorkshop on NLP for Computer Assisted Language Learning. p. 28–37. LiU Electronic Press (2021), https://aclanthology.org/2021.nlp4call-1.3/

Warstadt, A., Parrish, A., Liu, H., et al.: BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics 8, 377–392 (07 2020). https://doi.org/10.1162/tacl_a_00321

Warstadt, A., Singh, A., Bowman, S.R.: Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics 7, 625–641 (09 2019). https://doi.org/10.1162/tacl_a_00290

Xiang, B., Yang, C., Li, Y., et al.: CLiMP: A Benchmark for Chinese Language Model Evaluation. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 2784–2790. Association for Computational Linguistics, Online (Apr 2021). https://doi.org/10.18653/v1/2021.eacl-main.242

Zheng, L., Chiang, W.L., Sheng, Y., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 46595–46623. Curran Associates, Inc. (2023). https://doi.org/10.5555/3666122.3668142