Bitext NLP Data Overview

Bitext is trusted by market leaders to develop comprehensive NLP datasets and multilingual tools (like lexical, semantic, and syntactic annotation tools) in up to 77 languages.

Bitext’s Deep Linguistic Analysis Platform

Bitext offers a multilingual platform, designed for enterprise use, to analyze & tag text at three levels:

  • Lexical
  • Syntactic
  • Semantic

The platform has been designed for enterprise use.

Lexical Level and Lemmatization

At the lexical level, the main component is the lemmatizer, which has integrated tools to perform decompounding or word segmentation (something required by some languages to perform proper lemmatization).

The lemmatizer can be additionally packaged to cover the full pipeline of language analysis, from sentence segmentation to full parsing, and includes tools like spell-checking.

Both components of the lemmatizer, data and software, can be distributed integrated or separately. All these tools are available in 77 languages and 25 language variants.

Bitext Lemmatizer

Syntactic Level and Parsing

At the syntactic level, the parser is the main component. The parser analyzes the structure of the sentences in the text and is used for tasks like POS Tagging and Phrase Extraction. Additionally, it is used as the base component for various semantic level tasks like Named Entity Recognition (NER), Topic-Level Sentiment Analysis or Generation of Synthetic Text. We have developed parsers for 21 languages and are always adding new languages.

For a full list of services, at the lexical, syntactic and semantic levels, check our linguistic services.



Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain


541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA