Lemmatization is the process followed to determine the lemma of each word in a text depending on its intended meaning.
The lemma form of a word is used to increase search relevancy and to reduce indexing needs in databases.
The main difference with stemming is that lemmatization takes into consideration the context to solve the problem of disambiguation.

Contact us for more information!

The software is currently available for over 50 languages: Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bulgarian, Catalan, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Kannada, Kazakh, Korean, Kyrgyz, Macedonian, Malay, Malayalam, Mongolian, Nepali, Norwegian Bokmal, Norwegian Nynorsk, Persian, Portuguese, Punjabi, Russian, Serbian Latinica, Slovak, Spanish, Swahili, Swedish, Tagalog, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, and Zulu.
On every text, a word may appear in different inflected forms. For example, the verb “play” can appear as “playing”, “played”, “plays”. However, all of them should be classified in the same category, because even if they are different words they all mean the same.
Machine Learning classification projects have an accuracy target and in many cases it is hard to achieve, if not impossible. Main parameters are right (feature selection, learning algorithm...) but at some point accuracy hits a wall and it is hard to overcome it. Training cycles get too long, deadlines are delayed and the project stalls. This only gets worse if the project is multilingual (richer morphology...). There are two solutions. The classical one involves enlarging the training corpus, which implies tagging new data. When possible, typically this solution increases cost and slows down project development. The second solution relies on increasing the quality of the existing training set and this can be achived trough lemmatization.
Bitext lemmatization software allows you to group the different forms of a word text into the same root word. This has been applied to improve database indexing, text categorization tools, and machine learning pipelines.

Benefits for Machine Learning and Deep Learning algorithms

Bitext lemmatization software helps to disambiguate and group words by considering the context. Let’s take the word "book" as an example: depending on the surrounding text it can mean two different things.
  • I enjoy booking my trips online, it helps me to save money: In this case, booking means reservation, the lemma being the verb “book”.
  • I bought three new books last week on my trip to Dublin: In this case, books refers to a novel, the lemma being the noun “book”.
If the algorithm can treat each of those “book” as different words, the error margin will be lower, therefore, we will increase the accuracy of the results even when using a smaller training corpora.


Bitext Lemmatization software as part of MarkLogic’s “Ask Anything” Universal Index. We help MarkLogic to provide advanced language support.

If you are interested in this service, or you need more information please contact us!

Schedule Your Demo



José Echegaray 8 , building 3, office 4
Parque Empresarial Las Rozas
28232 Las Rozas



1700 Montgomery Street, Suite 101
CA 94111