Lexical Resources

Bitext Lexical Data Resources are the most comprehensive and consistent set of language resources in the world, with support for +100 languages and dialects. This proprietary data has been developed to meet the highest quality standards in the field of computational linguistics.

 

77

Languages

26

Variants

 

lexical-resources-bitext-map

Bitext data is used in production by some of the world’s largest and most successful software companies, including 3 out of the top 5 NASDAQ companies.

Bitext lexical data contains full morphological descriptions of 77 languages and 25 language regional variants like:

  • lemma, form, POS, voice, tense, aspect person, gender…

For up to 18 different tags or attributes.

Bitext lexical data uses a consistent set of descriptive tags for all languages, regardless of their morphological typology: fusional, agglutinative… This allows for a consistent management of different applications across languages with the same source code, simplifying the extension of any application to new languages.

This morphological information is enriched with:

  • related phenomena like contractions or clitic pronouns
  • complementary entity dictionaries, categorized in 16 types: places, names…
  • frequency information in large representative language corpora
  • other features like offensiveness or formality information
  • Companies that already own data and software for mainstream languages (English, Spanish…) can expand language coverage of existing applications to cover 77 languages and 25 language regional variants.

 

  • Companies that don’t have significant assets in the NLP space can quickly build a suite of Natural Language Processing (NLP) components (tokenizers, lemmatizers, POS taggers, phrase extractors, parsers, etc.), since Bitext Lexical Data can be delivered with source code to perform a full analysis from raw text to full parsing.

We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
    • Greek
    • Gujarati
    • Hebrew
    • Hindi
    • Hungarian
    • Icelandic
    • Indonesian
    • Irish Gaelic
    • Italian
    • Japanese
    • Kannada
    • Kazakh
    • Khmer
    • Korean
    • Kyrgyz
    • Lao
    • Latvian
    • Lithuanian
    • Macedonian
    • Malay
    • Malayalam
    • Marathi
    • Mongolian
    • Nepali
    • Norwegian Bokmal
    • Norwegian Nynorsk
    • Oriya
    • Persian
    • Polish
    • Portuguese
    • Punjabi
    • Romanian
    • Russian
    • Serbian
    • Sindhi
    • Sinhala
    • Slovak
    • Slovenian
    • Spanish
    • Swahili
    • Swedish
    • Tagalog
    • Tamil
    • Telugu
    • Thai
    • Turkish
    • Ukrainian
    • Urdu
    • Uzbek
    • Vietnamese
    • Zulu

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
  • English (India)
  • Finnish (Standard)
  • Finnish (Colloquial)
  • French (France)
  • French (Canada)
  • French (Switzerland)
  • German (Germany)
  • German (Switzerland)
  • Italian (Italy)
  • Italian (Switzerland )
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Spanish (Spain)
  • Spanish (North America)
  • Spanish (Central America)
  • Spanish (Andes)
  • Spanish (Southern Cone)

 

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
    • English (India)
    • Finnish (Standard)
    • Finnish (Colloquial)
    • French (France)
    • French (Canada)
    • French (Switzerland)
    • German (Germany)
    • German (Switzerland)
    • Italian (Italy)
    • Italian (Switzerland)
    • Portuguese (Portugal)
    • Portuguese (Brazil)
    • Spanish (Spain)
    • Spanish (North America)
    • Spanish (Central America)
    • Spanish (Andes)
    • Spanish (Southern Cone)

Data Samples & Languages Specifications

Bitext’s Lexical Data Resources

Download a full description of the features available in each language

Features

  • Lemma: the canonical form for the inflected word is provided.
  • POS: part of Speech such as noun, verb, adjective, etc. is defined.
  • Voice: verb form is classified as active or passive.
  • Tense: specifies when the action takes place such as past, present, future, etc.
  • Aspect: indicates whether the action is complete, ongoing, habitual, etc.
  • Mood: modality of the verb form is provided: indicative, subjunctive, imperative, etc.
  • Person: verb or pronoun refers to the first, second or third person.
  • Number: state of being singular, dual or plural.
  • Gender: noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.
  • Case: the function that the noun or adjective plays within a sentence.
  • Degree: an adjective is specified as in its positive, comparative or superlative form.
  • Definiteness: specifies whether a noun or adjective refers to a concrete or general concept.
  • Polarity: indicates whether a verb, adjective or noun is in a negative form.
  • Contractions: shortened form of a word or group of words are provided.
  • Pronominal Clitics: clitic pronouns are identified and tagged.
  • Formality: indicates the social status of the speaker in relation to the context.
  • Frequency: relative frequency of the form based on a large general-purpose corpus.
  • Named Entities: pre-defined entities are tagged as person names, places, organization, etc.
  • Offensive: indicates whether the form might be considered offensive in certain contexts.

 

SAN FRANCISCO, USA

541 Jefferson Ave., Ste. 100

Redwood City

CA 94063

MADRID, SPAIN

José Echegaray 8, Building 3

Parque Empresarial Las Rozas

28232 Las Rozas