Linguistic Services

Bitext provides core tools to automatically pre-annotate custom corpora & datasets. These tools annotate both at the word level, inflection, etc.) and at the sentence level (Topic-Based Sentiment Analysis, Categorization, Parsing, etc.). We provide:

Download Linguistic Services

bitext-linguistic-service-synthetic-data-chatbots

Lexical Services (No Grammar)

Sentence Segmentation

- Splits text into sentences, according to language-specific punctuation rules.
- Available in all languages.
- Example: Hello! How are you doing? → Hello! | How are you doing?

Tokenization

- Splits a sentence into words, according to language-specific space and punctuation rules.
- Available in all languages (except Chinese, Japanese, Vietnamese, Thai…)
- Example: How are you doing? → How | are | you | doing | ?

Word Segmentation (No-space Tokenization)

- Splits text into words for languages that do not use spaces to separate them.
- Available in Chinese, Japanese, Vietnamese.
- Example: 把音量调低一点→ 把 | 音量 | 调低 | 一点

Decompounding

- Splits compound words/tokens into its individual component words.
- Available in German, Dutch, Norwegian, Swedish, Korean and Finnish.
- Example: Rindfleischetikettierung → Rind | Fleisch | Etikettierung

Lemmatization

- Returns the possible roots for a word form
- Available in most languages (except Chinese, Vietnamese, Thai and other languages without inflection)
- Example: running → run

POS Tagging (Ambiguous)

- Returns the possible parts of speech (and optionally other attributes) of a word
- Available in all languages
- Example: run → verb (infinitive), verb (1st person singular, present tense), noun (singular)

Inflection

- Returns all forms of a root word
- Available in most languages (except Chinese, Vietnamese, Thai, and other languages without inflection)
- Example: run → run, runs, ran, running

Language identification

- Detects the language(s) used in each sentence of a longer input text
- Available in all languages
- Example: Oui! I love Paris → “Oui!” – French, “I love Paris” – English

Spell Checking

- Checks if a word is spelled correctly
- Available in all languages
- Example: excelent → incorrect

Spell Suggestions

- Suggests corrections for incorrectly spelled words
- Available in all languages
- Example: excelent → excellent

Syntactic and Semantic Services (Grammar and Meaning)

Entity Extraction

- Detect proper names (like people and places) and other special text (like phones and URLs)
- Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
- Example: John lives in New York → “John” – person name, “New York” – place

Offensive Language Detection

- Detect offensive or vulgar expressions in text
- Available in all languages.
- Example: tell John to f*ck off → “f*ck off” – offensive

Anonymization

- Remove sensitive or personal information (PII) from text
- Available in Dutch, English, French, German, Italian, Portuguese, Spanish
- Example: My name is John and my account number is 1234567 → My name is XXXX and my account number is XXXX

POS-Tagging (Disambiguated)

- Returns the part of speech for each word in a sentence
- Available in English, Dutch, Danish, Czech, Catalan
- Example: John runs back home → “John” – proper noun, “runs” – verb, “back” – preposition, “home” – noun

Phrase Extraction

- Returns the constituents (like noun phrases and verb phrases) of a sentence
- Available in English, French, German, Dutch, Italian, Portuguese, Spanish, Catalan.
- Example: John’s sister was performing in the theatre → “John’s sister” – NP, “was performing” – VP, “in the theatre” – PP

Topic-Based Sentiment Analysis

- Returns the sentiment and corresponding topic of opinions in text
- Available in Catalan, Dutch, English, French, German, Italian, Portuguese, Spanish.
- Example: I hate my old phone → opinion: “hate” (negative), topic: “my old phone”

Categorization

- Returns the categories applicable to a text, based on pre-defined rules
- Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
- Example: John is feeling great. → HAPPINESS [RULE: feel + great → HAPPINESS]
- Example: John was weeping like a willow. → SADNESS [RULE: weep + like + willow → SADNESS]

Parsing

Produces a tree with the hierarchical constituent parts of a sentence (words, phrases, clauses, etc.)
Available in Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, French, German, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian

Languages

- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German

Greek
Gujarati
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Irish Gaelic
Italian
Japanese
Kannada
Kazakh
Khmer
Korean
Kyrgyz
Lao
Latvian
Lithuanian
Macedonian
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian Bokmal

Norwegian Nynorsk
Oriya
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Serbian
Sindhi
Sinhala
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Zulu

Data Samples & Languages Specifications

Kazakh

Data Sample



Language Specifications

Armenian

Data Sample



Language Specifications

Slovak

Data Sample



Language Specifications

Mongolian

Data Sample



Language Specifications

Russian

Data Sample



Language Specifications

Portuguese

Data Sample



Language Specifications

Malayalam

Data Sample



Language Specifications

Urdu

Data Sample



Language Specifications

Catalan

Data Sample



Language Specifications

Variants

- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)

English (India)
Finnish (Standard)
Finnish (Colloquial)
French (France)
French (Canada)
French (Switzerland)
German (Germany)
German (Switzerland)
Italian (Italy)

Italian (Switzerland )
Portuguese (Portugal)
Portuguese (Brazil)
Spanish (Spain)
Spanish (North America)
Spanish (Central America)
Spanish (Andes)
Spanish (Southern Cone)

Contact us for more information about our evaluation and training data

I want to know more!

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA