Linguistic Services

Bitext provides core tools to automatically pre-annotate custom corpora & datasets. These tools annotate both at the word level (lemmatization/stemming, inflection, etc.) and at the sentence level (Topic-Based Sentiment Analysis, Categorization, Parsing, etc.). We provide:

 

bitext-linguistic-service-synthetic-data-chatbots

Lexical Services (No Grammar)

Sentence Segmentation

    • Splits text into sentences, according to language-specific punctuation rules.
    • Available in most languages (except Chinese, Japanese, Vietnamese, Thai, and some others).
    • Example: Hello! How are you doing? → Hello! | How are you doing?

Tokenization

    • Splits a sentence into words, according to language-specific space and punctuation rules.
    • Available in all languages (except Chinese, Japanese, Vietnamese, Thai…)
    • Example: How are you doing? → How | are | you | doing | ?

Word Segmentation (No-space Tokenization)

    • Splits text into words for languages that do not use spaces to separate them.
    • Available in Chinese, Japanese, Vietnamese.
    • Example: 把音量调低一点→ 把 | 音量 | 调低 | 一点

Decompounding

    • Splits compound words/tokens into its individual component words.
    • Available in German, Dutch, Norwegian, Swedish, Korean
    • Example: Rindfleischetikettierung → Rind | Fleisch | Etikettierung

Lemmatization (Ambiguous)

    • Returns the possible roots for a word form
    • Available in most languages (except Chinese, Vietnamese, Thai, and other pictographic languages)
    • Example: running → run

POS Tagging (Ambiguous)

    • Returns the possible parts of speech (and optionally other attributes) of a word
    • Available in all languages
    • Example: run → verb (infinitive), verb (1st person singular, present tense), noun (singular)

Inflection

    • Returns all forms of a root word
    • Available in most languages (except Chinese, Vietnamese, Thai, and other pictographic languages)
    • Example: run → run, runs, ran, running

Language identification

    • Detects the language(s) used in each sentence of a longer input text
    • Available in all languages
    • Example: Oui! I love Paris → “Oui!” – French, “I love Paris” – English

Spell Checking

    • Checks if a word is spelled correctly
    • Available in all languages
    • Example: excelent → incorrect

Spell Suggestions

    • Suggests corrections for incorrectly spelled words
    • Available in all languages
    • Example: excelent → excellent

Syntactic and Semantic Services (Grammar and Meaning)

Entity Extraction

    • Detect proper names (like people and places) and other special text (like phones and URLs)
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: John lives in New York → “John” – person name, “New York” – place

Offensive Language Detection

    • Detect offensive or vulgar expressions in text
    • Available in all languages.
    • Example: tell John to f*ck off → “f*ck off” – offensive

Anonymization

    • Remove sensitive or personal information (PII) from text
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish
    • Example: My name is John and my account number is 1234567 → My name is XXXX and my account number is XXXX

POS-Tagging (Disambiguated)

    • Returns the part of speech for each word in a sentence
    • Available in English, Dutch, Danish, Czech, Catalan
    • Example: John runs back home → “John” – proper noun, “runs” – verb, “back” – preposition, “home” – noun

Phrase Extraction

    • Returns the constituents (like noun phrases and verb phrases) of a sentence
    • Available in English, French, German, Dutch, Italian, Portuguese, Spanish, Catalan.
    • Example: John’s sister was performing in the theatre → “John’s sister” – NP, “was performing” – VP, “in the theatre” – PP

Topic-Based Sentiment Analysis

    • Returns the sentiment and corresponding topic of opinions in text
    • Available in Catalan, Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: I hate my old phone → opinion: “hate” (negative), topic: “my old phone”

Categorization

    • Returns the categories applicable to a text, based on pre-defined rules
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish. 
    • Example: John is feeling great. → HAPPINESS [RULE: feel + great → HAPPINESS]
    • Example: John was weeping like a willow. → SADNESS [RULE: weep + like + willow → SADNESS]

Parsing

  • Produces a tree with the hierarchical constituent parts of a sentence (words, phrases, clauses, etc.)
  • Available in Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, French, German, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Data Samples & Languages Specifications

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
  • English (India)
  • Finnish (Standard)
  • Finnish (Colloquial)
  • French (France)
  • French (Canada)
  • French (Switzerland)
  • German (Germany)
  • German (Switzerland)
  • Italian (Italy)
  • Italian (Switzerland )
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Spanish (Spain)
  • Spanish (North America)
  • Spanish (Central America)
  • Spanish (Andes)
  • Spanish (Southern Cone)

 

Contact us for more information about our evaluation and training data

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA