Arabic (AR)  Language Data

Volume of Language Data

lexical-forms-arabic

Total number of forms

17 million forms

    • Verbs: 10,000,000 forms (59%)
    • Nouns: 5,000,000 forms (29%)
    • Adjectives: 1,500,000 forms (9%)
    • Other: 500,000 forms (3%)
number-of-lemmas-arabic-lexical

Total number of lemmas

22,000 lemmas

Features

Each form is annotated with its corresponding lemma, POS, and morphological attributes: voice, tense, mood, number, person, gender, case, state, possessive-number, possessive-person and possessive-gender.

h

Lemma

The canonical form for the inflected word.

{

POS

Part of Speech such as noun, verb, adjective, etc.

v

Voice

Verb form is classified as active or passive.

+

Tense

Specifies when the action takes place such as past, present, future, etc.

Aspect

Not applicable.

Mood

Modality of the verb form: indicative, subjunctive, imperative, etc.

Person

Verb or pronoun refers to the first, second or third person.

Number

State of being singular, dual or plural.

Gender

Noun, verb or adjective forms, masculine, feminine, neuter, etc.

Case

The function that the noun or adjective plays within a sentence.

R

Degree

Not applicable.

l

Definiteness State

Specifies whether a noun or adjective refers to a concrete or general concept.

O

Negative

Not applicable.

|

Contractions

Not applicable.

Pronominal Clitics

Clitic pronouns are identified and tagged.

w

Formality

Not applicable

Frequency

Relative frequency of the form based on a large general-purpose corpus.

Named Entities

Pre-defined entities are tagged as person names, places, organization, etc.

r

Offensive

Indicates whether the form might be considered offensive in certain contexts.

Regional Variants

In addition to the lexical data for Modern Standard Arabic (MSA), the Lexical Resource also contains the equivalent lexical data for the following dialects:

    • Egyptian Arabic
    • Gulf Arabic
    • Najdi Arabic

Inflectional Morphology Data

The Lexical Resource for Modern Standard Arabic (MSA) contains all the forms for all POSs: nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, numerals and particles

Derivational Morphology Data

It also includes all derivational forms including adjectives derived from nouns (nisba) and verbal nouns or adverbs derived from adjectives.

 

Extended Morphology Data

It contains too the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common combinations of productive prefixes.

Frequency Indication

Contains the data regarding the relative frequency of appearance for the words in the above lists in the given language. 

Complementary Semantic Annotations

 

Named Entities

Contains the data regarding named entities comprising person names, places, companies and organizations.

Offensive Language

Contains information per word indicating if the word might be considered offensive in certain contexts.

SAN FRANCISCO, USA

541 Jefferson Ave., Ste. 100

Redwood City

CA 94063

MADRID, SPAIN

José Echegaray 8, Building 3

Parque Empresarial Las Rozas

28232 Las Rozas