Doctor of Technical Sciences, Professor, Head of the Laboratory V.I. Romanovsky Institute of Mathematics of the Academy of Sciences of the Republic of Uzbekistan, Republic of Uzbekistan, Tashkent
UZROBERTA: AN UZBEK LANGUAGE PRE-TRAINED MODEL
ABSTRACT
This paper presents the Uzbek Language Understanding Evaluation (UzLUE) framework, an acronym for UzLUE. UzLUE is a test for assessing natural language understanding in Uzbek, encompassing tasks such as message categorization. This was constructed from a wide-ranging source corpus, while ensuring copyright compliance, to guarantee its broad accessibility. We leverage this with UzLUE-RoBERTa, a pre-established language model, aiming to make the base model in UzLUE more reproducible and to encourage subsequent studies. Our findings show that the UzLUE-RoBERTa-base surpasses other benchmark models, including those that are multilingual.
АННОТАЦИЯ
В статье представлена система оценки понимания узбекского языка (UzLUE). UzLUE является тестом для оценки понимания естественного языка на узбекском языке, в который включили задачу категоризации сообщений. UzLUE был создан на основе обширного корпуса источников, гарантируя при этом соблюдение авторских прав и широкую доступность. Для этого тест UzLUE мы реализовали на основе RoBERTa, заранее созданной языковой модели, в результате чего получили более воспроизводимую базовую модель UzLUE-RoBERTa для стимулирования последующих исследований. Наши результаты показывают, что UzLUE-RoBERTa превосходит другие бенчмарк модели, в том числе, многоязычные.
Keywords: uzRoberta, masked language model, pre-train, latin-cyrillic script alphabet, bert, transformers.
Ключевые слова: uzRoberta, модель маскированного языка, pre-train, латиница-кириллица, bert, трансформеры.
Introduction
Pretrained language models based on transformers [2], have achieved cutting-edge results in a range of natural language processing (NLP) applications. Examples of publicly available models for high-resource languages include BERT [3] and RoBERTa [4]. However, there aren't many models of the same-language quality that are comparable to Uzbek because it has limited resources.
The multilingual BERT [2], XLM [5], and XLM-R [6] systems are designed to move knowledge from languages with abundant resources to those with scarce resources. It has received training in a number of languages, including Uzbek. In zero-shot cross-language model transfer, these multilingual models produce impressive results, but in future assignments, they underperform their monolingual counterparts. High-memory GPUs are needed to fine-tune multilingual models since they have more vocabulary and parameters than single-language models. Monolingual models have been pre-trained and made available for a variety of languages as a result.
The first published Uzbekistan model built on the RoBERTa architecture is presented in this article. Language resources are scarce in Uzbek. Public language models, tagged datasets, and even sizable amounts of unprocessed text are lacking. We start by generating 300 milliron word high-quality message corpus. The model is then pre-trained, and we give it the UzRoBERTa basis name. We use the multilingual xlm-roberta-base and distilbert-base-multilingual-cased to measure model performance and assess the usefulness of fine-tuned BERT models in categorization. According to our comparison, UzRoBERTa base performs significantly better on this metric than xlm-Roberta-base and distilbert-base-multilingual-cased.
Previous work
It has been attempted to create word embeddings for Uzbek. Over 100 languages, including Uzbek, have been trained embeddings using Wikipedia [7]. the 100K most often used words in their vocabulary. Distributed word representations for 157 distinct languages were created using the Wikipedia and Common Crawl databases [8]. The author's Uzbek Wikipedia model has 110K words, but the Common Crawl model only has 830K words. They produce word embeddings for Uzbek (as well as other Turkic languages) using fastText, and they line them up with embeddings for other Turkic languages. Their model has 200K words, while the 24M words in their Uzbek training data were taken from websites [9].
The aforementioned embeddings were all created for the Latin script of the Uzbek language. Word embeddings for the Cyrillic script are produced by Mansurov and Mansurov 2020 using the word2vec [10], GloVe [11], and fastText [12] techniques. The writers crawl websites with the "uz" domain in order to gather information. Their training data contained more than 79M words.
The fundamental flaw in these models' embeddings is that each word only receives a single vector, regardless of how many possible meanings it may have. They are unable to encode uncommon phrases, among other things, because word2vec and GloVe are word-level models. Transformer was used to train the BERT model to recognize the Cyrillic script [B. Mansurov et al., 2021].
As far as we know, no Latin-based Transformer-based Uzbek language model has been made public. The two main contributions of this paper are the gathering of a high-quality Latin alphabet corpus and the use of this corpus to train an UzRoBERTa-based model for the Uzbek language.
Methods
Corpora Selection Criteria
When scanning a collection of corpora for beginning code, from which task-specific corpora are created and annotated, we take into account two criteria. Accessibility is the initial criterion. UzLUE's primary goal is to support ongoing NLP research and development, thus we make sure the data it contains can be utilized and shared by anybody as freely as possible. Quality and diversity make up the second requirement. By eliminating subpar text, we make sure that each sample in these corpora is of a specific caliber and that the ratio of formal to informal writing is maintained.
Diversity and high quality. We choose a subset of ten of these 20 source corpora to create the source corpus and the UzLUE benchmark. In doing so, we take into account the following criteria: 1) the corpus should not be specific to restricted fields (diversity); 2) the corpus must be written in contemporary Uzbek; and 3) the corpus should not be predominately composed of contents that have privacy or harmfulness problems.
Preprocessing
We carefully preprocess these source corpora before generating a subset for each subsequent job because they come from different sources with varied degrees of quality and curation. The Sentence Splitter is used to break down each document in these corpora into individual sentences, and we detail our preprocessing procedures in this section [13]. During the annotation phase of each UzLUE work, the preprocessing procedures listed below are utilized in addition to manual inspection and filtering.
Noise Filtering. We remove noisy and/or non-Uzbek text from the selected source corpora [14]. We first remove hashtags (e.g., #JMT). HTML tags (e.g., <br>), bad characters (e.g., U+200B (zero-width space). U+FEFF (byte order mark)), empty parenthesis (e.g., ()), and consecutive blanks. We then filter out sentences with more than 10 other characters. For the corpora derived from news articles, we remove information about reporters and press, images, source tags as well as copyright tags (e.g., copyright by ©).
Removal of harmful content. We employ a number of automatic methods to exclude certain undesirable sentences from the source corpus in order to prevent introducing undesirable contents and biases into UzLUE.
Removal II. We eliminate sentences that include private information to reduce potential privacy concerns. Regular expressions that match URLs, email addresses, and user-mentioned keywords like "@xxxx" are used to find these statements.
Pretrained Language Models
In the past, downstream tasks in NLP have been solved by taking word embeddings from raw text and feeding them into structures that are tailored to the task at hand. Recurrent neural networks (RNN) are frequently used in these topologies. A current trend is to pretrain Transformer-based language models, such as BERT and RoBERTa, on a large amount of unannotated text before fine-tuning it for a downstream job with far less labeled input. On a range of tasks, such as language comprehension and question-answering, our strategy outperforms past attempts [15][9].
The Transformer uses the attention method without repetition and is composed of encoder and decoder stacks. The one and only part of BERT is an encoder stack, which is pretrained using tasks for predicting the next sentence and an MLM (masked language model). After certain input tokens are randomly masked, the MLM seeks to predict the original masked tokens. The goal of the NSP is to determine, given two text sequences, whether the second text sequence will come after the first in the original text. The language model can be augmented with an additional layer and improved after training so that it can handle a variety of downstream tasks, such as part-of-speech tagging.
We give robust baselines for all the benchmark tasks in UzLUE to support further study using it. As part of this endeavor, we pretrain and make available large-scale language models for Uzbek in the hopes that this will lessen the load on individual researchers to retrain large-scale language models. We pretrain the language model (PLM), RoBERTa [16], especially, from scratch.
Language Models
A variety of Uzbek language models are pretrained using different training configurations. As a result, we may investigate the best conditions for pretraining Uzbek models and further develop straightforward yet efficient baseline models for UzLUE. UzLUE-RoBERTa is trained by us. We change the preprocessing method, pretraining corpus, and other training configurations.
Table 1.
Statistics of the pretraining corpus
|
Uzbek news |
c4 |
Total |
# Sentences |
12M |
18.6M |
30.6M |
# Words |
120,562,356 |
179,5635,63 |
300,125,919 |
size (GB) |
1.25 |
1.85 |
3.1 |
Corpora for pretraining. The following two publicly accessible Uzbek corpora were compiled from a variety of sources to cover a wide range of themes and writing styles. We integrate these corpora to create the final pretraining corpus, which is about 3.1GB in size. For total data, see Table 1:
Uzbek news. It includes both formal articles (news and books) and colloquial text (dialogues).
c4. A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”.
Ethics-Related Matters. These corpora frequently have undesired social biases because we gather and use as much publicly accessible data as we can for pretraining. Furthermore, despite the fact that these corpora were all publicly accessible, we already noted that quite a deal of PII was there. These two are both problematic. A linguistic model that learns social biases could be the outcome of social biases in the corpus. A linguistic model may learn PII from the corpus, which can then be retrieved by adversarial assaults.
For three reasons, we do not screen out anything that is socially biased or hate speech. First of all, the size of the pretraining corpus makes human inspection impractical. Automatically detecting hate speech or socially biased content is a difficult subject in and of itself because both depend heavily on the context in which they are presented. We identified dangerous words in the text's Russian and English versions and deleted them.
Table 2.
Implementation detail of UzRoBERTa. WWM refers to the whole word masking strategy
Model |
Parameter |
Masking |
Training Steps |
Batch Size |
Learning Rate |
Device |
UZRoBERTa base |
110M |
Dynamic, WWM |
1M |
2048 |
10-4 |
4xV100 GPUs |
Configurations for training. RoBERTa [16] architectures are what we settle on for our language model. Details of the implementation are provided in Table 2. Following the first training process, all models are pretrained with a dynamic masking method using sequences that are each no more than 512 tokens long. We employ whole word masking (WWM) to mask tokens, which covers every token that makes up a single phrase. Next Sentence Prediction, or NSP, is another function of BERT. The original configurations from [4] apply to additional hyperparameters that are neither listed in Table 3 nor in the pretraining method instructions. In line with this, we lower the learning rate. For RoBERTa, we set the learning rate at 10-5.
Existing Language Models
In addition to our own language models, we evaluate the two existing multilingual language models and Uzbek monolingual language model on our benchmark:
xlm-roberta-base [16] XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
distilbert-base-multilingual-cased [17] The model is trained on the concatenation of Wikipedia in 104 different languages listed.
Results
Table 3 displays the classification accuracy, F1 and cross-entropy loss values. On the darayo.uz website's News Category Dataset, we will fine-tune four models to identify the category of news based on the headline and a succinct description. The collection includes 17243 news headlines that were collected between 2021 and 2022.
Table 3.
The classification accuracy, F1 and cross entropy loss of four models on the news dataset
Model |
LOSS |
F1 |
ACC |
xlm-roberta-base |
0.27 |
0.90 |
0.91 |
distilbert-base-multilingual-cased |
0.37 |
0.88 |
0.90 |
UZRoBERTabase |
0.13 |
0.96 |
0.96 |
Discussion
Remember that xlm-roberta-base and distilbert-base-multilingual-cased were trained on Wikipedia and CommonCrawl, while UZRoBERTabase was trained on news articles. As a result, on the news evaluation set, UZRoBERTa base performs significantly better than xlm-Roberta-base and distilbert-base-multilingual-cased was.
According to our assessment, UZRoBERTa base's outstanding performance is mostly due to three factors:
- UZRoBERTabase training data is of higher quality than that of other models. We have sufficiently pre-processed the data.
- Transfer learning to Uzbek from other languages may not have been successful fo rxlm-roberta-base and distilbert-base-multilingual-cased.
Conclusion
The goal of this study was to construct a RoBERTa-based monolingual pretrained Uzbek language model. The end result is the first such model that is openly accessible, called UZRoBERTabase. Our model's accuracy on a masked language model is significantly higher than multilingual BERT's, despite the fact that it was trained on a smaller corpus.
The fact that UZRoBERTabase was trained exclusively on Uzbek literature gives it a theoretical advantage over xlm-roberta-base and distilbert-base-multilingual-cased in that its vocabulary is narrower and theoretically better reflects the nuances of the language. When task-specific fine-tuning data is available in a language other than Uzbek, however, xlm-Roberta-base and distilbert-base-multilingual-cased are preferable. To achieve this, however, xlm-roberta-base and distilbert-base-multilingual-cased need to be trained on Uzbek texts of substantially higher quality than those found in Uzbek Wikipedia and Common Crawl, as the model's performance on MLM accuracy falls significantly short of that of UZRoBERTabase. Future studies on UZRoBERTabase ought to think about training a model using additional texts written in various genres. On 3.1 GB of text, UZRoBERTabase was trained. A language model that is twice as big should be just as accurate as one built on tens of gigabytes of text. We were unable to assess its performance on these tasks since there were no publicly available datasets for the tasks that came next in Uzbek. Producing such datasets and evaluating UZRoBERTa based on them in subsequent assignments is another direction for future development.
Finally, similar to the work done in [18], it would be interesting to investigate how the tokenizer affects the performance of the model. We want to create a tokenizer that can accurately separate words into stems and suffixes because the Uzbek language has a lot of inflectional patterns.
References:
- Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser and I. Polosukhin, e-print arXiv.1706.03762, 2017.
- J. Devlin, M. Chang, K. Lee and K. Toutanova, e-print arXiv.1810.04805, 2019.
- M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov, e-print arXiv.1907.11692, 2019.
- G. Lample and A. Conneau, e-print arXiv.1901.07291, 2019.
- Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer and V. Stoyanov, e-print arXiv.1911.02116, 2019.
- R. Al-Rfou’, B. Perozzi, S. Skiena, ‘Proceedings of the Seventeenth Conference on Computational Natural Language Learning,’ 2013. PP. 183.
- E. Grave, P. Bojanowski, P. Gupta, A. Joulin and T. Mikolov, ‘Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),’ 2018.
- V. Baisa and V. Suchomel, ‘Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12),’ 2012. PP. 28.
- T. Mikolov, K. Chen, G. Corrado and J. Dean, e-print arXiv.1301.3781, 2013.
- J. Pennington, R. Socher and Ch. Manning, ‘Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),’ 2014. PP. 1532.
- P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Enriching Word Vectors with Subword Information (Transactions of the Association for Computational Linguistics, 2017.
- J. Devlin, M. Chang, K. Lee and K. Toutanova, e-print arXiv.1810.04805, 2018.
- S. Park, J. Moon, S. Kim, W. Cho, J. Han, J. Park, Ch. Song, J. Kim, Y. Song, T. Oh, J. Lee, J. Oh, S. Lyu, Y. Jeong, I. Lee, S. Seo, D. Lee, H. Kim, M. Lee, S. Jang, S. Do, S. Kim, K. Lim, J. Lee, K. Park, J. Shin, S. Kim, L. Park, A. Oh, J. Ha and K. Cho, e-print arXiv. 2105.09680, 2021.
- R. R. Davronov, R. A. Safarov and Sh. Q. Abdumalikov, ‘International Conference on Information Science and Communications Technologies (ICISCT),’ 2021.
- R. R. Davronov, R. A. Safarov and Sh. Q. Abdumalikov, ‘International Conference on Information Science and Communications Technologies (ICISCT),’ 2021.
- URL: https://huggingface.co/rifkat/uztext-3Gb-BPE-Roberta
- URL: https://huggingface.co/xlm-roberta-base
- URL: https://huggingface.co/distilbert-base-multilingual-cased
- URL: https://www.nltk.org/index.html
- URL: https://spacy.io/