linguistic resources

lexical resources

Bitext Lexical Data Resources are the most comprehensive and consistent set of language resources in the world, with support for +100 languages and dialects. This proprietary data has been developed to meet the highest quality standards in the field of computational linguistics.

languages

variants

Bitext data is used in production by some of the world’s largest and most successful software companies, including 3 out of the top 5 NASDAQ companies. 

Download a full description document of Bitext’s Lexical Data Resources:

We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):

languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German

Variants

  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish Gaelic
  • Italian
  • Japanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kyrgyz
  • Lao
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian Bokmal
  • Norwegian Nynorsk
  • Oriya
  • Persian
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Zulu
    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
  • English (India)
  • Finnish (Standard)
  • Finnish (Colloquial)
  • French (France)
  • French (Canada)
  • French (Switzerland)
  • German (Germany)
  • German (Switzerland)
  • Italian (Italy)
  • Italian (Switzerland )
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Spanish (Spain)
  • Spanish (North America)
  • Spanish (Central America)
  • Spanish (Andes)
  • Spanish (Southern Cone)

 

We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):

languages

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azeri
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Finnish
    • French
    • Galician
    • Georgian
    • German
    • Greek
    • Gujarati
    • Hebrew
    • Hindi
    • Hungarian
    • Icelandic
    • Indonesian
    • Irish Gaelic
    • Italian
    • Japanese
    • Kannada
    • Kazakh
    • Khmer
    • Korean
    • Kyrgyz
    • Lao
    • Latvian
    • Lithuanian
    • Macedonian
    • Malay
    • Malayalam
    • Marathi
    • Mongolian
    • Nepali
    • Norwegian Bokmal
    • Norwegian Nynorsk
    • Oriya
    • Persian
    • Polish
    • Portuguese
    • Punjabi
    • Romanian
    • Russian
    • Serbian
    • Sindhi
    • Sinhala
    • Slovak
    • Slovenian
    • Spanish
    • Swahili
    • Swedish
    • Tagalog
    • Tamil
    • Telugu
    • Thai
    • Turkish
    • Ukrainian
    • Urdu
    • Uzbek
    • Vietnamese
    • Zulu

Variants

    • Arabic (MSA)
    • Arabic (Gulf)
    • Arabic (Najdi)
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Dutch (Netherlands)
    • Dutch (Belgium)
    • English (US)
    • English (UK)
    • English (India)
    • Finnish (Standard)
    • Finnish (Colloquial)
    • French (France)
    • French (Canada)
    • French (Switzerland)
    • German (Germany)
    • German (Switzerland)
    • Italian (Italy)
    • Italian (Switzerland)
    • Portuguese (Portugal)
    • Portuguese (Brazil)
    • Spanish (Spain)
    • Spanish (North America)
    • Spanish (Central America)
    • Spanish (Andes)
    • Spanish (Southern Cone)

      Features

      • Lemma: the canonical form for the inflected word is provided.
      • POS: part of Speech such as noun, verb, adjective, etc. is defined.
      • Voice: verb form is classified as active or passive.
      • Tense: specifies when the action takes place such as past, present, future, etc.
      • Aspect: indicates whether the action is complete, ongoing, habitual, etc.
      • Mood: modality of the verb form is provided: indicative, subjunctive, imperative, etc.
      • Person: verb or pronoun refers to the first, second or third person.
      • Number: state of being singular, dual or plural.
      • Gender: noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.
      • Case: the function that the noun or adjective plays within a sentence.
      • Degree: an adjective is specified as in its positive, comparative or superlative form.
      • Definiteness: specifies whether a noun or adjective refers to a concrete or general concept.
      • Polarity: indicates whether a verb, adjective or noun is in a negative form.
      • Contractions: shortened form of a word or group of words are provided.
      • Pronominal Clitics: clitic pronouns are identified and tagged.
      • Formality: indicates the social status of the speaker in relation to the context.
      • Frequency: relative frequency of the form based on a large general-purpose corpus.
      • Named Entities: pre-defined entities are tagged as person names, places, organization, etc.
      • Offensive: indicates whether the form might be considered offensive in certain contexts.

       

      Per-language specifications

      Download a file with the specifications of 20+ of our languages. Each language includes several lists of forms (inflectional, derivational, extended, named entities, offensive words...) along with frequency indication. The number of total forms in each language goes up to 80 millions. Some examples:

      • Thai: 40,000 forms
      • English: 180,000 forms
      • Japanese: 500,000 forms
      • Spanish: 2,500,000 forms
      • Finnish: 80,000,000 forms

      SAN FRANCISCO, USA

      541 Jefferson Ave., Ste. 100

      Redwood City

      CA 94063

      MADRID, SPAIN

      José Echegaray 8, Building 3, Office 4

      Parque Empresarial Las Rozas

      28232 Las Rozas