linguistic resources
lexical resources
Bitext Lexical Data Resources are the most comprehensive and consistent set of language resources in the world, with support for +100 languages and dialects. This proprietary data has been developed to meet the highest quality standards in the field of computational linguistics.
languages
variants
Bitext data is used in production by some of the world’s largest and most successful software companies, including 3 out of the top 5 NASDAQ companies.
Download a full description document of Bitext’s Lexical Data Resources:

We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):
languages
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese (Simplified)
- Chinese (Traditional)
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
Variants
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland )
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):
languages
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese (Simplified)
- Chinese (Traditional)
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
Variants
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland)
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
Features
- Lemma: the canonical form for the inflected word is provided.
- POS: part of Speech such as noun, verb, adjective, etc. is defined.
- Voice: verb form is classified as active or passive.
- Tense: specifies when the action takes place such as past, present, future, etc.
- Aspect: indicates whether the action is complete, ongoing, habitual, etc.
- Mood: modality of the verb form is provided: indicative, subjunctive, imperative, etc.
- Person: verb or pronoun refers to the first, second or third person.
- Number: state of being singular, dual or plural.
- Gender: noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.
- Case: the function that the noun or adjective plays within a sentence.
- Degree: an adjective is specified as in its positive, comparative or superlative form.
- Definiteness: specifies whether a noun or adjective refers to a concrete or general concept.
- Polarity: indicates whether a verb, adjective or noun is in a negative form.
- Contractions: shortened form of a word or group of words are provided.
- Pronominal Clitics: clitic pronouns are identified and tagged.
- Formality: indicates the social status of the speaker in relation to the context.
- Frequency: relative frequency of the form based on a large general-purpose corpus.
- Named Entities: pre-defined entities are tagged as person names, places, organization, etc.
- Offensive: indicates whether the form might be considered offensive in certain contexts.
Per-language specifications
Download a file with the specifications of 20+ of our languages. Each language includes several lists of forms (inflectional, derivational, extended, named entities, offensive words…) along with frequency indication. The number of total forms in each language goes up to 80 millions. Some examples:
- Thai: 40,000 forms
- English: 180,000 forms
- Japanese: 500,000 forms
- Spanish: 2,500,000 forms
- Finnish: 80,000,000 forms

SAN FRANCISCO, USA
541 Jefferson Ave., Ste. 100
Redwood City
CA 94063

MADRID, SPAIN
José Echegaray 8, Building 3, Office 4
Parque Empresarial Las Rozas
28232 Las Rozas