Najdi Arabic Language Data
Najdi is one of the three main Arabic dialects spoken in Saudi Arabia, along with Hejazi and Gulf Arabic. Out of the three main sub-dialects (Northern, Central and Southern Najdi), the Central variant is the one spoken in Riyadh and is the most widely used.
As with Modern Standard Arabic, the main orthographic convention for Najdi is to omit tashkil (used to indicate short vowels), and to include only consonant marking (i’jam) to indicate long vowels.
Volume of Language Data
Total number of forms
The total number of forms is approximately 18,000,000
The following is a breakdown of the approximate number of forms in the Najdi Arabic Language Data:
- Non-inflectional morphology:
- About 1,000 forms for determiners, pronouns, prepositions…
- Note: this will include forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.
- Inflectional and derivational morphology:
- Verbs: 1,000,000 forms
- Nouns: 750,000 forms
- Adjectives: 250,000 forms
Total number of lemmas
The data contains all the forms, lemma, POS and morphological attributes (voice, tense, mood, transitivity, number, person, gender, case, state, pronominal-number, pronominal-person and pronominal-gender) for non-inflectional POSs (determiners, pronouns, prepositions…), i.e. POSs not included in features Inflectional and Derivational morphology.
Inflectional Morphology Data
The Lexical Resource for Najdi contains all the forms for all POSs: nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, numerals and particles.
Derivational Morphology Data
It also includes all derivational forms including adjectives derived from nouns (nisba) and verbal nouns or adverbs derived from adjectives.
Extended Morphology Data
The data also covers the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common combinations of productive prefixes.
Complementary Semantic Annotations
Information regarding the usage of specific forms or lemmas is included. Words that are rarely used will be tagged as “rare”. Words borrowed from foreign languages that are not widely understood, or which are only meaningful as part of a larger phrase, will be tagged by Licensor as “foreign”.
Words or spelling variations which are not officially recognized but are widely used in texting communication will be tagged by Licensor as “SMS”.
The data regarding offensive, vulgar and sensitive words with all the lemmas, POS and attributes, is included. Words that are meant to demean or express hatred for a specific person or group based on race, ethnicity, sexual orientation, etc. will be tagged by Licensor as “offensive”.
Words that make explicit and offensive references to sex or bodily functions, or that are rude or in bad taste (including profanity), will be tagged as “vulgar”.
Words that are not themselves offensive or vulgar, but could be part of potentially vulgar, offensive or discomforting phrases will be tagged by Licensor as “sensitive”.
Words that could potentially be vulgar or offensive when used in a particular context will also be tagged by Licensor as “sensitive”.
Categories include the data regarding frequently used words with all the lemmas, POS and attributes. Frequently used words are considered to fall under, and will be tagged by Licensor with, one of the following categories:
- Body part
- Family name
- Female first name
- Greetings (including multi-word expressions)
- Male first name
The data regarding the transitivity of verbs, which determines the applicability of pronominal suffixes to verbal forms, is included.