Information and communication and chemical technologies

No. 2 (23) - 2024 / 2024-06-30 / Number of views: 70

RESEARCH OF REPRESENTATIVENESS OF KAZAKH LANGUAGE CORPORA BY WORD STEMS FOR THE SUMMARIZATION

Authors

Al-Farabi Kazakh National University
Al-Farabi Kazakh National University

Keywords

neural language modeling, NLP, text summarization, Kazakh language, representativity, synthetic datasets

Link to DOI:

https://doi.org/10.58805/kazutb.v.2.23-366

How to quote

Zhabaev, T. ., and U. Tukeyev. “RESEARCH OF REPRESENTATIVENESS OF KAZAKH LANGUAGE CORPORA BY WORD STEMS FOR THE SUMMARIZATION”. Vestnik KazUTB, vol. 2, no. 23, June 2024, doi:10.58805/kazutb.v.2.23-366.

Abstract

In this work, we investigated the dependence of the work of the summarization model on the number of word stems in it. The work was performed on a synthetic summarization dataset for the Kazakh language. Taking the number of word stems as a metric of representativeness, an analysis of the quality of work of three summation models was performed depending on the number of word stems in the training dataset. To obtain three datasets, we divided the training dataset into three parts. BLEU estimates were obtained for each model on the test files. The experimental part of the work showed that the model with the largest number of stems shows the highest BLEU score. But the score does not directly depend on the number of word stems. Two models trained on datasets of different sizes show approximately the same scores.