Lexical Resources

Bitext Lexical Data Resources are the most comprehensive and consistent set of language resources in the world, with support for +100 languages and dialects. This proprietary data has been developed to meet the highest quality standards in the field of computational linguistics.

 

77

Languages

26

Variants

 

lexical-resources-bitext-map

Bitext data is used in production by some of the world’s largest and most successful software companies, including 3 out of the top 5 NASDAQ companies.

Bitext lexical data contains full morphological descriptions of 77 languages and 25 language regional variants like:

  • lemma, form, POS, voice, tense, aspect person, gender, …

For up to 18 different tags or attributes.

Bitext lexical data uses a consistent set of descriptive tags for all languages, regardless of their morphological typology: fusional, agglutinative… This allows for a consistent management of different applications across languages with the same source code, simplifying the extension of any application to new languages.

This morphological information is enriched with:

  • morphologically/related phenomena like contractions or clitic pronouns
  • complementary entity dictionaries, categorized in 16 types: places, names…
  • frequency information in large representative language corpora
  • other features like offensiveness or formality information
  • Companies that already own data and software for mainstream languages (English, Spanish…) can expand language coverage of existing applications to cover 77 languages and 25 language regional variants.

 

  • Companies that don’t have significant assets in the NLP space can quickly build a suite of Natural Language Processing (NLP) components (tokenizers, lemmatizers, POS taggers, phrase extractors, parsers, etc.), since Bitext Lexical Data can be delivered with source code to perform a full analysis from raw text to full parsing.

We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu

Languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
    • Greek
    • Gujarati
    • Hebrew
    • Hindi
    • Hungarian
    • Icelandic
    • Indonesian
    • Irish Gaelic
    • Italian
    • Japanese
    • Kannada
    • Kazakh
    • Khmer
    • Korean
    • Kyrgyz
    • Lao
    • Latvian
    • Lithuanian
    • Macedonian
    • Malay
    • Malayalam
    • Marathi
    • Mongolian
    • Nepali
    • Norwegian Bokmal
    • Norwegian Nynorsk
    • Oriya
    • Persian
    • Polish
    • Portuguese
    • Punjabi
    • Romanian
    • Russian
    • Serbian
    • Sindhi
    • Sinhala
    • Slovak
    • Slovenian
    • Spanish
    • Swahili
    • Swedish
    • Tagalog
    • Tamil
    • Telugu
    • Thai
    • Turkish
    • Ukrainian
    • Urdu
    • Uzbek
    • Vietnamese
    • Zulu

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
  • English (India)
  • Finnish (Standard)
  • Finnish (Colloquial)
  • French (France)
  • French (Canada)
  • French (Switzerland)
  • German (Germany)
  • German (Switzerland)
  • Italian (Italy)
  • Italian (Switzerland )
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Spanish (Spain)
  • Spanish (North America)
  • Spanish (Central America)
  • Spanish (Andes)
  • Spanish (Southern Cone)

 

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
    • English (India)
    • Finnish (Standard)
    • Finnish (Colloquial)
    • French (France)
    • French (Canada)
    • French (Switzerland)
    • German (Germany)
    • German (Switzerland)
    • Italian (Italy)
    • Italian (Switzerland)
    • Portuguese (Portugal)
    • Portuguese (Brazil)
    • Spanish (Spain)
    • Spanish (North America)
    • Spanish (Central America)
    • Spanish (Andes)
    • Spanish (Southern Cone)

Bitext’s Lexical Data Resources

Download a full description document of Bitext’s Lexical Data Resources

Features

  • Lemma: the canonical form for the inflected word is provided.
  • POS: part of Speech such as noun, verb, adjective, etc. is defined.
  • Voice: verb form is classified as active or passive.
  • Tense: specifies when the action takes place such as past, present, future, etc.
  • Aspect: indicates whether the action is complete, ongoing, habitual, etc.
  • Mood: modality of the verb form is provided: indicative, subjunctive, imperative, etc.
  • Person: verb or pronoun refers to the first, second or third person.
  • Number: state of being singular, dual or plural.
  • Gender: noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.
  • Case: the function that the noun or adjective plays within a sentence.
  • Degree: an adjective is specified as in its positive, comparative or superlative form.
  • Definiteness: specifies whether a noun or adjective refers to a concrete or general concept.
  • Polarity: indicates whether a verb, adjective or noun is in a negative form.
  • Contractions: shortened form of a word or group of words are provided.
  • Pronominal Clitics: clitic pronouns are identified and tagged.
  • Formality: indicates the social status of the speaker in relation to the context.
  • Frequency: relative frequency of the form based on a large general-purpose corpus.
  • Named Entities: pre-defined entities are tagged as person names, places, organization, etc.
  • Offensive: indicates whether the form might be considered offensive in certain contexts.

 

Per-language specifications

Download a file with the specifications of 20+ of our languages. Each language includes several lists of forms (inflectional, derivational, extended, named entities, offensive words…) along with frequency indication. The number of total forms in each language goes up to 80 millions.  Some examples:

 

  • Thai: 40,000 forms
  • English: 180,000 forms
  • Japanese: 500,000 forms
  • Spanish: 2,500,000 forms
  • Finnish: 80,000,000 forms

 

SAN FRANCISCO, USA

541 Jefferson Ave., Ste. 100

Redwood City

CA 94063

MADRID, SPAIN

José Echegaray 8, Building 3

Parque Empresarial Las Rozas

28232 Las Rozas