The impact of lemmatization for morphologically-rich languages
Abstract
Are there ways to improve the performance of language models, beyond increases in size -both in the number of model parameters or in the size of training corpora?
Our benchmarks show that another way to increase accuracy is leveraging linguistic data sources, like lemmatization data, synonyms-antonyms dictionaries, entities lists or phrases.
This benchmark focuses on how linguistic data affects the performance of embeddings for morphologically-rich languages, Modern Standard Arabic (MSA) in this case.
Results show that enriching embeddings with lemmatization yields better topic models and can increase Topic Coherence scores by up to 15%.
We compare the results of applying lemmatization to multiple topic modeling techniques, using lemma-based embeddings compared to traditional word-level embeddings.
Introduction
With the high-availability of textual data that lack structure or labels, text mining techniques, like topic modeling, became crucial. Topic modelling tries to summarize documents by extracting the most important topics in an unsupervised way.
We investigate the impact of lemmatization on the performance of embedding-based topic modeling techniques for morphologically-rich languages like Modern Standard Arabic (MSA).
As an example, the root word كتب “he wrote” in Arabic can be used to form more than 250 word forms like سيكتبون “They will write”, مكتوب “written”, and فكتبن “Then they (feminine) wrote”.
For this reason, leveraging a text normalization technique like lemmatization, which maps different word forms to their base form, can be very productive.
We trained two variants of each of the following topic models: CTM, ETM, and BERTopic with non-lemmatized and lemmatized input text. For BERTopic, we trained two additional variants that were initialized with custom word-based and lemma-based embeddings.
Dataset
We worked with NADiA1, an Arabic dataset that contains 35,416 articles which were extracted from the SkyNewsArabia Arabic news website. The dataset was originally used for multi-label classification, where every article can belong to more than one of 24 categories. For the purposes of this benchmark, we worked only with the articles and discarded the labels.
We’ve published the following resources on GitHub:
- embeddings: linguistically-enriched text embeddings at https://github.com/bitext/enriched-embeddings/tree/main/ara
- lemmatization data: 10K word lemmatization dictionary at https://github.com/bitext/
lexical-resources/tree/main/ lemmatization/ara - morphological data: 10K word morphological dictionary at https://github.com/bitext/
lexical-resources/tree/main/ samples/ara
Preprocessing
We filtered the articles to keep only the ones with a total word count between 15 and 500 words. We cleaned the text to remove numbers, extraneous whitespace, stop words and diacritics, and lemmatized the text.
For lemmatization, we used the Bitext lemmatizer, which works based on linguistic principles to connect roots and forms, and uses extensive morphological attributes.
We then created a vocabulary using the most frequent 10,000 words in the corpus and processed the articles to keep only these words.
Experiments
For all of our experiments, we trained models with the following number of topics: 5, 10, 25, 50, 75, and 100. For each experiment, we trained a model for 5 runs and then averaged the obtained evaluation scores. We trained two variants of each model using both non-lemmatized and lemmatized text.
For BERTopic, we trained addition two variants with word-based and lemma-based embeddings.
Results
We evaluated our models using the Topic Coherence (NPMI) metric on the test set that contains 3,416 articles. Topic Coherence measures the interpretability of the generated topics by assessing the coherence of the top n words of each topic.
The resulting score is a decimal value between -1.0 and 1.0, where a higher score indicates more coherent topics. Table 1 shows the resulting Topic Coherence scores of the different models.

Table 1 Topic Coherence scores of our models with different numbers of topics. Each score is calculated by averaging the resulting scores of 5 runs
The results show that training topic models on lemmatized text leads to better performance. In addition, we see that initializing BERTopic with lemma-based embeddings leads to better performance than using either AraBERT or word-level embeddings.
This highlights the importance of lemmatization to normalize text when working with languages with complex morphology like Arabic.

Figure 1 A comparison of Topic Coherence scores of different models. The dotted lines refer to models trained on lemmatized text, whereas the solid lines refer to models trained on non-lemmatized text

Figure 2 A comparison of the Topic Coherence scores of the two variants of BERTopic that were trained using word-level and lemma-based embeddings
Conclusion
In this work, we investigated the effects of leveraging lemmatization, using Bitext lemmatizer, for the task of topic modeling in Arabic. We worked with a high-quality Arabic dataset of news articles to train three models: CTM, ETM and BERTopic.
For the BERTopic model, we trained 4 variants using both AraBERT and word2vec word and lemma-based embeddings that we trained on Wikipedia text. We evaluated our models on a separate test set using the Topic Coherence (NPMI) metric.
Our results show that applying Bitext lemmatizer on text yields better topic models and higher Topic Coherence scores. Also, we showed that using lemma-based embeddings when initializing BERTopic leads to better performance than using word-level embeddings.
Are you interested in downloading the full benchmark? Click on the button below!
Recent Comments