Linguistic Services

Bitext provides core tools to automatically pre-annotate custom corpora & datasets. These tools annotate both at the word level (lemmatization/stemming, inflection…) and at the sentence level (Topic-Based Sentiment Analysis, Categorization, Parsing…). We provide:

 

Lexical services (no grammar)

Sentence segmentation

    • Splits text into sentences, according to language-specific punctuation rules.
    • Available in all languages.
    • Example: Hello! How are you doing? → Hello! | How are you doing?

Tokenization

    • Splits a sentence into words, according to language-specific space and punctuation rules.
    • Available in all languages (except Chinese, Japanese, Vietnamese, Thai…)
    • Example: How are you doing? → How | are | you | doing | ?

Word segmentation (no-space tokenization)

    • Split text into words for languages that do not use spaces to separate them.
    • Available in Chinese, Japanese, Vietnamese.
    • Example: 把音量调低一点→ 把 | 音量 | 调低 | 一点

Decompounding

    • Split compound words/tokens into its individual component words.
    • Available in German, Dutch, Norwegian, Swedish, Korean
    • Example: Rindfleischetikettierung → Rind | Fleisch | Etikettierung

Lemmatization (ambiguous)

    • Return the possible roots for a word form
    • Available in most languages (except Chinese, Vietnamese, Thai…)
    • Example: running → run

POS Tagging (ambiguous)

    • Return the possible parts of speech (and optionally other attributes) of a word
    • Available in all languages
    • Example: run → verb (infinitive), verb (1st person singular, present tense), noun (singular)

Inflection

    • Return all forms of a root word
    • Available in all languages (except Chinese, Vietnamese, Thai…)
    • Example: run → run, runs, ran, running

Language identification

    • Detect the language(s) used in each sentence of a longer input text
    • Available in  all languages
    • Example: Oui! I love Paris → “Oui!” – French, “I love Paris” – English

Spell checking

    • Check if a word is spelled correctly
    • Available in all languages
    • Example: excelent → incorrect

Spell suggestions

    • Suggest corrections for incorrectly spelled words
    • Available in all languages
    • Example: excelent → excellent

Syntactic and Semantic services (grammar and meaning)

Entity extraction

    • Detect proper names (people, places…) and other special text (phones, URLs…)
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: John lives in New York → “John” – person name, “New York” – place

Offensive language detection

    • Detect offensive or vulgar expressions in text
    • Available in all languages.
    • Example: tell John to f*ck off → “f*ck off” – offensive

Anonymization

    • Remove sensitive or personal information (PII) from text
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: My name is John and my account number is 1234567 → My name is XXXX and my account number is XXXX.

POS-Tagging (disambiguated)

    • Return the parts of speech for each word in a sentence.
    • Available in English, Dutch, Danish, Czech, Catalan.
    • Example: John runs back home → “John” – proper noun, “runs” – verb, “back” – preposition, “home” – noun

Phrase Extraction

    • Returns the constituents (noun phrases, verb phrases…) of a sentence
    • Available in English, French, German, Dutch, Italian, Portuguese, Spanish, Catalan.
    • Example: John’s sister was performing in the theatre → “John’s sister” – NP, “was performing” – VP, “in the theatre” – PP

Topic-Based Sentiment Analysis

    • Returns the sentiment and corresponding topic of opinions in text
    • Available in Catalan, Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: I hate my old phone → opinion: “hate” (negative), topic: “my old phone”

Categorization

    • Returns the categories applicable to a text, based on pre-defined rules
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish. 
    • Example: John is feeling great. → HAPPINESS [RULE: feel + great → HAPPINESS]
    • Example: John was weeping like a willow. → SADNESS [RULE: weep + like + willow → SADNESS]

Parsing

  • Produce a tree with the hierarchical constituent parts of a sentence (words, phrases, clauses…)
  • Available in Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, French, German, Hungarian,  Italian, Polish, Portuguese,  Romanian, Russian, Serbian, Slovak, Slovenian, Spanish. Swedish, Ukrainian.

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Data Samples & Languages Specifications

  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Assamese
  • Azeri
  • Basque
  • Belarusian
  • Bengali
  • Bulgarian
  • Burmese
  • Catalan
  • Chinese
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Esperanto
  • Estonian
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
  • English (India)
  • Finnish (Standard)
  • Finnish (Colloquial)
  • French (France)
  • French (Canada)
  • French (Switzerland)
  • German (Germany)
  • German (Switzerland)
  • Italian (Italy)
  • Italian (Switzerland )
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Spanish (Spain)
  • Spanish (North America)
  • Spanish (Central America)
  • Spanish (Andes)
  • Spanish (Southern Cone)

 

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
    • English (India)
    • Finnish (Standard)
    • Finnish (Colloquial)
    • French (France)
    • French (Canada)
    • French (Switzerland)
    • German (Germany)
    • German (Switzerland)
    • Italian (Italy)
    • Italian (Switzerland)
    • Portuguese (Portugal)
    • Portuguese (Brazil)
    • Spanish (Spain)
    • Spanish (North America)
    • Spanish (Central America)
    • Spanish (Andes)
    • Spanish (Southern Cone)

      Contact us for more Information about our evaluation and training data

      MADRID, SPAIN

      SAN FRANCISCO, USA