Lemmatization is the process followed to determine the lemma of each word in a text depending on its intended meaning.
The main difference with stemming is that lemmatization takes into consideration the context to solve the problem of disambiguation.

Contact us for more information!

The software is currently available for over 50 languages: Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bulgarian, Catalan, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Kannada, Kazakh, Korean, Kyrgyz, Macedonian, Malay, Malayalam, Mongolian, Nepali, Norwegian Bokmal, Norwegian Nynorsk, Persian, Portuguese, Punjabi, Russian, Serbian Latinica, Slovak, Spanish, Swahili, Swedish, Tagalog, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, and Zulu.
On every text, a word may appear in different inflected forms. For example, the verb “play” can appear as “playing”, “played”, “plays”. However, all of them should be classified in the same category, because even if they are different words they all mean the same.   Our lemmatization software allows you to group the different forms of a word text into the same root word. This has been applied to improve database indexing, text categorization tools, and machine learning pipelines.

Benefits for Machine Learning and Deep Learning algorithms

Our lemmatization software helps to disambiguate and group words by considering the context. Let’s take the word "book" as an example: depending on the surrounding text it can mean two different things.
  • I enjoy booking my trips online, it helps me to save money: In this case, booking means reservation, the lemma being the verb “book”.
  • I bought three new books last week on my trip to Dublin: In this case, books refers to a novel, the lemma being the noun “book”.
If the algorithm can treat each of those “book” as different words, the error margin will be lower, therefore, we will increase the accuracy of the results even when using a smaller training corpora.


Bitext Lemmatization software as part of MarkLogic’s “Ask Anything” Universal Index. We help MarkLogic to provide advanced language support.

If you are interested in this service, or you need more information please contact us!

Schedule Your Demo



José Echegaray 8 , building 3, office 4
Parque Empresarial Las Rozas
28232 Las Rozas



1700 Montgomery Street, Suite 101
CA 94111