Arabic (AR) Language Data
Volume of Language Data
Total number of forms
- Verbs: 10,000,000 forms (59%)
- Nouns: 5,000,000 forms (29%)
- Adjectives: 1,500,000 forms (9%)
- Other: 500,000 forms (3%)
Total number of lemmas
Each form is annotated with its corresponding lemma, POS, and morphological attributes: voice, tense, mood, number, person, gender, case, state, possessive-number, possessive-person and possessive-gender.
The canonical form for the inflected word.
Part of Speech such as noun, verb, adjective, etc.
Verb form is classified as active or passive.
Specifies when the action takes place such as past, present, future, etc.
Modality of the verb form: indicative, subjunctive, imperative, etc.
Verb or pronoun refers to the first, second or third person.
State of being singular, dual or plural.
Noun, verb or adjective forms, masculine, feminine, neuter, etc.
The function that the noun or adjective plays within a sentence.
Specifies whether a noun or adjective refers to a concrete or general concept.
Clitic pronouns are identified and tagged.
Relative frequency of the form based on a large general-purpose corpus.
Pre-defined entities are tagged as person names, places, organization, etc.
Indicates whether the form might be considered offensive in certain contexts.
Inflectional Morphology Data
The Lexical Resource for Modern Standard Arabic (MSA) contains all the forms for all POSs: nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, numerals and particles
Derivational Morphology Data
It also includes all derivational forms including adjectives derived from nouns (nisba) and verbal nouns or adverbs derived from adjectives.
Extended Morphology Data
It contains too the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common combinations of productive prefixes.
Contains the data regarding the relative frequency of appearance for the words in the above lists in the given language.
Complementary Semantic Annotations
Contains the data regarding named entities comprising person names, places, companies and organizations.
Contains information per word indicating if the word might be considered offensive in certain contexts.