RuBookSum: Dataset for Russian Literature Abstractive Summarization
DOI:
https://doi.org/10.14529/jsfi250306Keywords:
large language model, summarization, literature, booksAbstract
The majority of existing Russian document summarization datasets focus on short-form source documents which does not require complex causal analysis or coreference resolutions. Furthermore, processing longer multi-page texts poses a serious challenge to current generation of language models as the limited context window complicates response generation by demanding additional task partitioning. To lay the groundwork for future research of the problem, we introduce RuBookSum, an abstractive summarization dataset for Russian long-form narrative summarization. Our dataset covers documents from various literature domains, including fiction, classic, children books and popular science, and includes high-quality human-written summaries. To establish a baseline, we evaluate popular open-source large language models and provide comprehensive analysis on their performance. Additionally, we propose optimized algorithms for long-document summarization, which enable up to 300% summary generation speed up without significant drops in quality.
References
Huot, F., Maynez, J., Narayan, S., et al.: Text-blueprint: An interactive platform for plan-based conditional generation. In: Croce, D., Soldaini, L. (eds.) Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. pp. 105–116. Association for Computational Linguistics, Dubrovnik, Croatia (May 2023). https://doi.org/10.18653/v1/2023.eacl-demo.13
Kim, Y., Chang, Y., Karpinska, M., et al.: FABLES: Evaluating faithfulness and content selection in book-length summarization. In: First Conference on Language Modeling (2024)
Kryscinski, W., Rajani, N., Agarwal, D., et al.: BOOKSUM: A collection of datasets for long-form narrative summarization. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022. pp. 6536–6558. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec 2022). https://doi.org/10.18653/v1/2022.findings-emnlp.488
LibRusEc: Library of works of art. https://librusec.org/ (2025), accessed: 2025-07-30
Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004), https://aclanthology.org/W04-1013/
Liu, A., et al.: DeepSeek-V3 Technical Report. CoRR (2024), https://arxiv.org/abs/2412.19437
Narodny Briefly: Digital library of short summaries of literary works. https://wiki.briefly.ru/ (2025), accessed: 2025-07-30
Scirè, A., Conia, S., Ciciliano, S., Navigli, R.: Echoes from Alexandria: A Large Resource for Multilingual Book Summarization. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023. pp. 853–867. Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.findings-acl.54
T-Bank: T-Bank has opened access to its own Russian-language language model in the 7–8 billion parameter weight category. https://www.tbank.ru/about/news/20072024-tbank-opened-access-its-own-russian-language-language-model-weight-categoryof-7-8-billion-parameters/ (2024), accessed: 2025-08-21
Tikhomirov, M., Chernyshov, D.: Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation. Journal of Language and Education 10(4), 130–145 (Dec 2024). https://doi.org/10.17323/jle.2024.22224
Wu, J., Ouyang, L., Ziegler, D.M., et al.: Recursively summarizing books with human feedback (2021), https://arxiv.org/abs/2109.10862
Yandex: YandexGPT 5 with reasoning mode. https://ya.ru/ai/gpt (2025), accessed: 2025-07-30
Yang, A., et al.: Qwen3 technical report (2025), https://arxiv.org/abs/2505.09388
Zhang, T., Kishore, V., Wu, F., et al.: BERTScore: Evaluating Text Generation with BERT. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020), https://openreview.net/forum?id=SkeHuCVFDr
Downloads
Published
How to Cite
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.