Bitext Linguistic Analysis Platform: the Lemmatizer

Bitext provides a multilingual platform to perform full text analysis and tagging, including both lexical and syntactic analysis. The Bitext Lemmatizer is the core component of the lexical analysis component. For an overview of our platform, see Bitext Deep Linguistic Analysis Platform

Talk to an Expert

Main Features of Bitext Lemmatizer:

covers +100 languages and variants: 77 languages and 25 variants
processes +60,000,000 words per second on a standard server
source code available, also in escrow
optimized versions for search & indexing, for chatbots and for SEO
integrated with complementary tools:

– decompounding for German, Korean…

– word segmentation for Japanese, Chinese…

How the Lemmatizer works

Bitext Lemmatization service identifies all potential lemmas (also called roots) for any word, using morphological analysis and lexicons curated by computational linguists.

The service receives a word as input and will return:

1. if the word is a form, all the lemmas it can correspond to that form
2. if the word is a lemma, the lemma itself

If no lemmas have been found for a word, it will return the input word as lemma.

For example, for the word “spoke”, the lemmatization service will return the lemmas “speak” and “spoke”. For the sentence “We spoke yesterday about fixing the bike.”, the Lemmatization service will output a set of lemmas for every word in the sentence.

The service uses two main components: a powerful software engine and the most comprehensive lexicons in the market.

1. The software. Bitext Lemmatizer runs on any platform and has a powerful engine capable of processing millions of words per second.

2. The Lexicons. Bitext Lexical Dictionaries contain linguistically-curated wordlists that cover all possible words of each language with their morphological and semantic attributes. They are constantly updated against real language corpora

Use Cases

These clients trust Bitext Lemmatizer…

Languages and Dictionaries

The Lemmatization Service supports 77+ languages and 25 language variants. The lexical coverage for each language is highly comprehensive. The number of lemmas and forms varies across languages: from around 60,000 lemmas and 130,000 forms for English, to 65,000 lemmas and 3,000,000 forms for Spanish, and 70,000 lemmas and 32,000,000 forms for Finnish. Lemmas and forms for all languages are enriched with comprehensive morphosyntactic features.

77 languages currently available:

- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German

Greek
Gujarati
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Irish Gaelic
Italian
Japanese
Kannada
Kazakh
Khmer
Korean
Kyrgyz
Lao
Latvian
Lithuanian
Macedonian
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian Bokmal
Norwegian Nynorsk

Oriya
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Serbian
Sindhi
Sinhala
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Zulu

- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu

Variants:

- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)

English (India)
Finnish (Standard)
Finnish (Colloquial)
French (France)
French (Canada)
French (Switzerland)
German (Germany)
German (Switzerland)
Italian (Italy)

Italian (Switzerland )
Portuguese (Portugal)
Portuguese (Brazil)
Spanish (Spain)
Spanish (North America)
Spanish (Central America)
Spanish (Andes)
Spanish (Southern Cone)

- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland)
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)

See Lexical Resources

Lemmatization and Stemming

Sometimes, lemmatization and stemming are used as interchangeable terms; however, there are differences that are important to note.

See our blog post on the topic

Demos

We have both API and on-premise demo versions of the Bitext Lemmatizer. Let us know about your needs (languages, OS/platform, programming language…) and we will compile a customized evaluation version for you.

Request Lemmatizer Demo

Technical Specifications

Software

The Lemmatization service is powered by Bitext’s Lexical Analyzer, which uses a proprietary implementation of Finite State Automata (FSA) that allows for high performance and high levels of compression. Compression rates can reach up to 1:100 (100 MB of raw data can be compressed into 1MB) for more complex languages like Finnish or Turkish.

How to use third-party software and still be on the safe side? We can provide source code upon request on different modalities such as in escrow.

Footprint

Lexical Data are encoded in a compressed format which allows direct lookups without decompression. For example, the run-time version for English uses 1MB for 130,000 forms;

for Spanish, 3,000,000 forms take up 2MB; and for Finnish 32,000,000, forms take up 3MB; including full morphosyntactic information.

Throughput

On an 18-core Intel® Xeon® W-2295 @ 3.00GHz with 128GB of RAM and a 512 GB SSD, the Lexical Analyzer can process:

~60,000,000 lookups per second, or
~750MB of text per second

Scalability

The software is provided as a thread-safe library. The software is completely “self-contained” with minimal OS dependencies (just the standard C libraries) and with no need for complex installation procedures.

The Lexical Analyzer is written in platform-independent C. As a result, it can be run on any OS that can compile C: Windows, Linux, Solaris… It can also be run on mobile devices, given their current processing power.

Complementary tools: decompounding and segmentation

The Bitext Lemmatizer is fully integrated with additional tools that are required by some languages. A lemmatizer usually processes words, rather than compounds or parts of words; as a result, in these languages, running a lemmatizer without preprocessing would yield very poor results.

There are two main tools that solve these problems:

Decompounding Tool: some languages like German, Korean, Dutch, Norwegian or Swedish can create new tokens/strings by joining words. For example, the word “Diskussionsthemen” in German. In these languages, decompounding is a necessary step before lemmatizing, otherwise, the error rate of the lemmatizer will be way too high. In our example, the word “Diskussionsthemen” in German would be wrongly lemmatized if we do not split the compound into noun “Diskussion” and noun “Themen” with “Thema” as root.

Word Segmentation Tool. Some languages like Chinese, Japanese, Vietnamese or Thai do not separate words as it is done, for example, in Romance or Anglo-Saxon languages like Spanish, English or French. In short, words in these languages are not separated by spaces, so the lemmatizer needs to identify them and separate them into words (although the system does not work the same way in Chinese, Japanese, Vietnamese or Thai or in other languages). For example, in Chinese, this sentence “把音量调低一点” will be split as 把 | 音量 | 调低 | 一点 .

Use Case

Bitext provides MarkLogic customers with the leading-edge benefits of Bitext’s Deep Linguistics Analysis Platform.

Download

Benchmark

The report presents a comparison among NLTK, Stanford and Bitext. Check out the results now!

Download

Demo

Do you have questions?

Schedule a conversation with one of our experts to find out which NLP solution works best for you.

Request

Whitepaper

On how lemmatization and POS tagging can facilitate Machine Learning projects, using less data and less time.

Download

Benchmark. We have run a benchmark with the most popular enterprise lemmatizers in the market: NLTK, Stanford, TwinWord, CST, Spacy and Simplemma (please, contact us if you know of other lemmatizers that should be included here). The benchmark focuses on three main points:

Linguistic accuracy in identifying the proper lemma (for now with English).
Processing speed: how many millions of words per second and with what hardware resources.
Customization and maintenance tasks.

You can download the benchmark here

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA

Bitext Linguistic Analysis Platform: the Lemmatizer

Main Features of Bitext Lemmatizer:

How the Lemmatizer works

Use Cases

These clients trust Bitext Lemmatizer…

Languages and Dictionaries

77 languages currently available:

Variants:

Lemmatization and Stemming

Demos

Technical Specifications

Software

Footprint

Throughput

Scalability

OS Independence

Complementary tools: decompounding and segmentation

Use Case

Benchmark

Demo

Whitepaper

MADRID, SPAIN

SAN FRANCISCO, USA