Bitext NLP Data Overview

Bitext develops comprehensive NLP datasets and multilingual tools in up to 77 languages, trusted by market leaders, including lexical, semantic, and syntactic annotation tools.


Bitext Deep Linguistic Analysis Platform

Bitext offers a multilingual platform to analyze & tag text at three levels:

  • lexical
  • syntactic
  • semantic

The platform has been designed for enterprise use.

Lexical Level and Lemmatization

At the lexical level, the main component is the lemmatizer, integrated with tools for decompounding or word segmentation (required by some languages to perform proper lemmatization).

The lemmatizer can be additionally packaged to cover a language analysis full pipeline, from sentence segmentation to full parsing, including tools like spell checking.

Both components of the lemmatizer, data and software, can be distributed integrated or separately. All these tools are available in 77 languages and 25 language variants.

Bitext Lemmatizer

This page describes how the Bitext Lemmatizer works

Syntactic Level and Parsing

At the syntactic level, the parser is the main component. The parser analyzes the structure of the sentences in the text and is used for tasks like POS Tagging and Phrase Extraction. Additionally, it is used as the base component for various semantic level tasks like Named Entity Recognition (NER), Topic-Level Sentiment Analysis or Generation of Synthetic Text. We have developed parsers for 21 languages and are always adding new languages.

For a full list of services, at the lexical, syntactic and semantic levels, check our linguistic services.