Linguistic Services

Bitext provides core tools to automatically pre-annotate custom corpora & datasets. These tools annotate both at the word level (lemmatization/stemming, inflection…) and at the sentence level (Topic-Based Sentiment Analysis, Categorization, Parsing…). We provide:

 

Lexical services (no grammar)

Your Title Goes Here
Sentence segmentation
Splits text into sentences, according to language-specific punctuation rules. Applicable to all languages. Example: Hello! How are you doing? → Hello! | How are you doing?
Tokenization
Splits a sentence into words, according to language-specific space and punctuation rules. Applicable to most languages (except Chinese, Japanese, Vietnamese, Thai…) Example: How are you doing? → How | are | you | doing | ?
Word segmentation (no-space tokenization)
Split text into words for languages that do not use spaces to separate them. Applicable to Chinese, Japanese, Vietnamese, Thai… Example: 把音量调低一点→ 把 | 音量 | 调低 | 一点

 

Decompounding
Split compound words/tokens into its individual component words. Applicable to German, Dutch, Norwegian, Swedish, Korean… Example: Rindfleischetikettierung → Rind | Fleisch | Etikettierung
Lemmatization (ambiguous)
Return the possible roots for a word form Applicable to most languages (except Chinese, Vietnamese, Thai…) Example: running → run
POS Tagging (ambiguous)

Return the possible parts of speech (and optionally other attributes) of a word Applicable to all languages Example: run → verb (infinitive), verb (1st person singular, present tense), noun (singular)

Inflection
Return all forms of a root word Applicable to most languages (except Chinese, Vietnamese, Thai…) Example: run → run, runs, ran, running
Language identification
Detect the language(s) used in each sentence of a longer input text Applicable to all languages Example: Oui! I love Paris → “Oui!” – French, “I love Paris” – English
Spell checking
Check if a word is spelled correctly Applicable to all languages Example: excelent → incorrect
Spell suggestions
Suggest corrections for incorrectly spelled words Applicable to all languages Example: excelent → excellent

Syntactic services (grammar) 

Your Title Goes Here
Entity extraction

Detect proper names (people, places…) and other special text (phones, URLs…) Applicable to all languages Example: John lives in New York → “John” – person name, “New York” – place

Offensive language detection

Detect offensive or vulgar expressions in text Applicable to all languages
Example: tell John to f*ck off → “f*ck off” – offensive

Anonymization

Remove sensitive or personal information (PII) from text Applicable to all languages Example: My name is John and my account number is 1234567 → My name is XXXX and my account number is XXXX.

 

POS-Tagging (disambiguated)

Return the parts of speech for each word in a sentence
Applicable to all languages Example: John runs back home → “John” – proper noun, “runs” – verb, “back” – preposition, “home” – noun

Phrase Extraction

Returns the constituents (noun phrases, verb phrases…) of a sentence Applicable to all languages Example: John’s sister was performing in the theatre → “John’s sister” – NP, “was performing” – VP, “in the theatre” – PP

Topic-Based Sentiment Analysis

Returns the sentiment and corresponding topic of opinions in text Applicable to all languages Example: I hate my old phone → opinion: “hate” (negative), topic: “my old phone”

Categorization

Returns the categories applicable to a text, based on pre-defined rules Applicable to all languages Example: John is feeling great. → HAPPINESS [RULE: feel + great → HAPPINESS] Example: John was weeping like a willow. → SADNESS [RULE: weep + like + willow → SADNESS]

Parsing

Produce a tree with the hierarchical constituent parts of a sentence (words, phrases, clauses…) Applicable to all languages

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu
  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Assamese
  • Azeri
  • Basque
  • Belarusian
  • Bengali
  • Bulgarian
  • Burmese
  • Catalan
  • Chinese
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Esperanto
  • Estonian
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Variants

    • French Dutch
    • Portuguese
    • Spanish
    • English
    • Italian
    • German
    • Turkish
    • Polish

    Under Preparation:

      • Danish
      • Swedish
      • Korean
      • Chinese
      • Japanese

    Contact us for more Information about our evaluation and training data

    SAN FRANCISCO, USA

    541 Jefferson Ave., Ste. 100

    Redwood City

    CA 94063

    MADRID, SPAIN

    José Echegaray 8, Building 3

    Parque Empresarial Las Rozas

    28232 Las Rozas