Linguistic Services

Bitext provides core tools to automatically pre-annotate custom corpora & datasets. These tools annotate both at the word level (lemmatization/stemming, inflection…) and at the sentence level (Topic-Based Sentiment Analysis, Categorization, Parsing…). We provide:

 

Lexical services (no grammar)

Sentence segmentation

    • Splits text into sentences, according to language-specific punctuation rules.
    • Available in all languages.
    • Example: Hello! How are you doing? → Hello! | How are you doing?

Tokenization

    • Splits a sentence into words, according to language-specific space and punctuation rules.
    • Available in all languages (except Chinese, Japanese, Vietnamese, Thai…)
    • Example: How are you doing? → How | are | you | doing | ?

Word segmentation (no-space tokenization)

    • Split text into words for languages that do not use spaces to separate them.
    • Available in Chinese, Japanese, Vietnamese.
    • Example: 把音量调低一点→ 把 | 音量 | 调低 | 一点

Decompounding

    • Split compound words/tokens into its individual component words.
    • Available in German, Dutch, Norwegian, Swedish, Korean
    • Example: Rindfleischetikettierung → Rind | Fleisch | Etikettierung

Lemmatization (ambiguous)

    • Return the possible roots for a word form
    • Available in most languages (except Chinese, Vietnamese, Thai…)
    • Example: running → run

POS Tagging (ambiguous)

    • Return the possible parts of speech (and optionally other attributes) of a word
    • Available in all languages
    • Example: run → verb (infinitive), verb (1st person singular, present tense), noun (singular)

Inflection

    • Return all forms of a root word
    • Available in all languages (except Chinese, Vietnamese, Thai…)
    • Example: run → run, runs, ran, running

Language identification

    • Detect the language(s) used in each sentence of a longer input text
    • Available in  all languages
    • Example: Oui! I love Paris → “Oui!” – French, “I love Paris” – English

Spell checking

    • Check if a word is spelled correctly
    • Available in all languages
    • Example: excelent → incorrect

Spell suggestions

    • Suggest corrections for incorrectly spelled words
    • Available in all languages
    • Example: excelent → excellent

Syntactic and Semantic services (grammar and meaning)

Entity extraction

    • Detect proper names (people, places…) and other special text (phones, URLs…)
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: John lives in New York → “John” – person name, “New York” – place

Offensive language detection

    • Detect offensive or vulgar expressions in text
    • Available in all languages.
    • Example: tell John to f*ck off → “f*ck off” – offensive

Anonymization

    • Remove sensitive or personal information (PII) from text
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: My name is John and my account number is 1234567 → My name is XXXX and my account number is XXXX.

POS-Tagging (disambiguated)

    • Return the parts of speech for each word in a sentence.
    • Available in English, Dutch, Danish, Czech, Catalan.
    • Example: John runs back home → “John” – proper noun, “runs” – verb, “back” – preposition, “home” – noun

Phrase Extraction

    • Returns the constituents (noun phrases, verb phrases…) of a sentence
    • Available in English, French, German, Dutch, Italian, Portuguese, Spanish, Catalan.
    • Example: John’s sister was performing in the theatre → “John’s sister” – NP, “was performing” – VP, “in the theatre” – PP

Topic-Based Sentiment Analysis

    • Returns the sentiment and corresponding topic of opinions in text
    • Available in Catalan, Dutch, English, French, German, Italian, Portuguese, Spanish.
    • Example: I hate my old phone → opinion: “hate” (negative), topic: “my old phone”

Categorization

    • Returns the categories applicable to a text, based on pre-defined rules
    • Available in Dutch, English, French, German, Italian, Portuguese, Spanish. 
    • Example: John is feeling great. → HAPPINESS [RULE: feel + great → HAPPINESS]
    • Example: John was weeping like a willow. → SADNESS [RULE: weep + like + willow → SADNESS]

Parsing

  • Produce a tree with the hierarchical constituent parts of a sentence (words, phrases, clauses…)
  • Available in Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, French, German, Hungarian,  Italian, Polish, Portuguese,  Romanian, Russian, Serbian, Slovak, Slovenian, Spanish. Swedish, Ukrainian.

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Data Samples & Languages Specifications

  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Assamese
  • Azeri
  • Basque
  • Belarusian
  • Bengali
  • Bulgarian
  • Burmese
  • Catalan
  • Chinese
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Esperanto
  • Estonian
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
  • English (India)
  • Finnish (Standard)
  • Finnish (Colloquial)
  • French (France)
  • French (Canada)
  • French (Switzerland)
  • German (Germany)
  • German (Switzerland)
  • Italian (Italy)
  • Italian (Switzerland )
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Spanish (Spain)
  • Spanish (North America)
  • Spanish (Central America)
  • Spanish (Andes)
  • Spanish (Southern Cone)

 

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
    • English (India)
    • Finnish (Standard)
    • Finnish (Colloquial)
    • French (France)
    • French (Canada)
    • French (Switzerland)
    • German (Germany)
    • German (Switzerland)
    • Italian (Italy)
    • Italian (Switzerland)
    • Portuguese (Portugal)
    • Portuguese (Brazil)
    • Spanish (Spain)
    • Spanish (North America)
    • Spanish (Central America)
    • Spanish (Andes)
    • Spanish (Southern Cone)

      Contact us for more Information about our evaluation and training data

      MADRID, SPAIN

      Camino de las Huertas, 20, 28223 Pozuelo
      Madrid, Spain

      SAN FRANCISCO, USA

      541 Jefferson Ave Ste 100, Redwood City
      CA 94063, USA