Gulf Arabic Language Data

Gulf Arabic, also called Khaliji, is a continuum of Arabic dialects spoken around the coasts of the Persian Gulf in Kuwait, Bahrain, Quatar, UAE and some parts of Saudi Arabia, Oman and Iraq. There is no standardized version of the dialect, but there is a high degree of mutual intelligibility between its various sub-dialects.

As with Modern Standard Arabic, the main orthographic convention for Gulf Arabic is to omit tashkil (used to indicate short vowels), and to include only consonant marking (i’jam) to indicate long vowels.

Volume of Language Data

Total number of forms

The total number of forms is approximately 18,000,000

The following is a breakdown of the approximate number of forms in the Gulf Arabic Language Data:

Non-inflectional morphology:
- - About 5,000 forms for prepositions, conjunctions, interjections…
  - Note: this will include forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.
Inflectional and derivational morphology:
- - Nouns: 750,000 forms
  - Verbs: 1,000,000 forms
  - Adjectives: 250,000 forms

Total number of lemmas

20,000 lemmas

Features

Each form is annotated with its corresponding lemma, POS, and morphological attributes: voice, tense, mood, number, person, gender, case, state, possessive-number, possessive-person and possessive-gender.

Lemma

The canonical form for the inflected word.

{

POS

Part of Speech such as noun, verb, adjective, etc.

Voice

Verb form is classified as active or passive.

Tense

Specifies when the action takes place such as past, present, future, etc.



Aspect

Not applicable.



Mood

Modality of the verb form: indicative, subjunctive, imperative, etc.



Person

Verb or pronoun refers to the first, second or third person.



Number

State of being singular, dual or plural.



Gender

Noun, verb or adjective forms, masculine, feminine, neuter, etc.



Case

The function that the noun or adjective plays within a sentence.

Degree

Not applicable.

Definiteness State

Specifies whether a noun or adjective refers to a concrete or general concept.

Negative

Not applicable.

Contractions

Not applicable.



Pronominal Clitics

Clitic pronouns are identified and tagged.

Formality

Not applicable



Frequency

Relative frequency of the form based on a large general-purpose corpus.



Named Entities

Pre-defined entities are tagged as person names, places, organization, etc.

Offensive

Indicates whether the form might be considered offensive in certain contexts.

Inflectional Morphology Data

The Lexical Resource for Gulf Arabic contains all the forms for all POSs: nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, numerals and particles.

Derivational Morphology Data

It also includes all derivational forms including adjectives derived from nouns (nisba) and verbal nouns or adverbs derived from adjectives.

Extended Morphology Data

The data also covers the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common combinations of productive prefixes.

Frequency Indication

Frequency Indication includes relative frequency information for words in a corpus. Frequency expressed in the form of logarithmic “buckets” or “frequency groups” (from 255, the most common forms, to 0, the least common forms).

Complementary Semantic Annotations

Usage

The usage of specific forms or lemmas information is included. Words that are rarely used will be tagged as “rare”.

Words borrowed from foreign languages that are not widely understood, or which are only meaningful as part of a larger phrase, will be tagged as “foreign”.

Words or spelling variations which are not officially recognized but are widely used in texting communication will be tagged as “non-standard”.

Offensive Language

Bitext provides data regarding offensive, vulgar and sensitive words with all the lemmas, POS and attributes. Offensive and vulgar words will be marked as “offensive” (highly offensive derogatory slurs based on race, sexual orientation, etc.). Vulgar words should be marked as “vulgar” (strong curse words).

Words that are sensitive should be marked “sensitive” (clinical names of genitalia, politically charged words, light curse words, etc.) Words that are not themselves offensive, vulgar, or sensitive, but could be part of potentially disturbing phrases will be marked as “sensitive-in-context”.

Gulf Arabic Language Data

Volume of Language Data

Total number of forms

Total number of lemmas

Features

Lemma

POS

Voice

Tense

Aspect

Mood

Person

Number

Gender

Case

Degree

Definiteness State

Negative

Contractions

Pronominal Clitics

Formality

Frequency

Named Entities

Offensive

Inflectional Morphology Data

Derivational Morphology Data

Extended Morphology Data

Frequency Indication

Complementary Semantic Annotations

Usage

Offensive Language

Categories

MADRID, SPAIN

SAN FRANCISCO, USA