Egyptian Arabic Language Data

Volume of Language Data

lexical-forms-arabic

Total number of forms

The total number of forms is approximately 18,000,000

The following is a breakdown of the approximate number of forms in the Egyptian Arabic Language Data:

  • Non-inflectional morphology:
      • About 5,000 forms for prepositions, conjunctions, interjections…
      • Note: this will include forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.
  • Inflectional and derivational morphology:
      • Nouns: 5,000,000 forms (29%)
      • Verbs: 10,000,000 forms (59%)
      • Adjectives: 1,500,000 forms (9%)
      • Other: 500,000 forms (3%)
number-of-lemmas-arabic-lexical

Total number of lemmas

20,000 lemmas

Features

Each form is annotated with its corresponding lemma, POS, and morphological attributes: voice, tense, mood, number, person, gender, case, state, possessive-number, possessive-person and possessive-gender.
h

Lemma

The canonical form for the inflected word.
{

POS

Part of Speech such as noun, verb, adjective, etc.
v

Voice

Verb form is classified as active or passive.
+

Tense

Specifies when the action takes place such as past, present, future, etc.

Aspect

Not applicable.

Mood

Modality of the verb form: indicative, subjunctive, imperative, etc.

Person

Verb or pronoun refers to the first, second or third person.

Number

State of being singular, dual or plural.

Gender

Noun, verb or adjective forms, masculine, feminine, neuter, etc.

Case

The function that the noun or adjective plays within a sentence.
R

Degree

Not applicable.
l

Definiteness State

Specifies whether a noun or adjective refers to a concrete or general concept.
O

Negative

Not applicable.
|

Contractions

Not applicable.

Pronominal Clitics

Clitic pronouns are identified and tagged.
w

Formality

Not applicable

Frequency

Relative frequency of the form based on a large general-purpose corpus.

Named Entities

Pre-defined entities are tagged as person names, places, organization, etc.
r

Offensive

Indicates whether the form might be considered offensive in certain contexts.

Non-inflectional POSs

This includes the data with all the forms, lemma, POS and morphological attributes (voice, tense, mood, transitivity, number, person, gender, case, state, pronominal-number, pronominal-person and pronominal-gender), for noninflectional POSs (determiners, pronouns, prepositions…), i.e. POSs not included in feature sets 2 and 3. This will include the forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.

Inflectional and Derivational Morphology Data

This includes the data with all the forms, lemma, POS and morphological attributes (voice, tense, mood, transitivity, number, person, gender, case, state, pronominal-number, pronominal-person and pronominal-gender), for POSs verb, noun and adjective. This will include the forms constructed using the pronominal suffixes, but will not include preposition, conjunction, future tense, definite article or any other prefixes.

Frequency Indication

This includes relative frequency information for words in a corpus. Frequency expressed in the form of logarithmic “buckets” or “frequency groups” (from 255, the most common forms, to 0, the least common forms).

 

Usage

This includes information regarding the usage of specific forms or lemmas. Words that are rarely used will be tagged as “rare”. Words borrowed from foreign languages that are not widely understood, or which are only meaningful as part of a larger phrase, will be tagged as “foreign”.

Words or spelling variations which are not officially recognized but are widely used in texting communication will be tagged as “non-standard”.

Offensive Language

this includes the data regarding offensive, vulgar and sensitive words with all the lemmas, POS and attributes. Offensive and vulgar words will be marked as “offensive” (highly offensive derogatory slurs based on race, sexual orientation, etc.). Vulgar words should be marked as “vulgar” (strong curse words).

Words that are sensitive should be marked “sensitive” (clinical names of genitalia, politically charged words, light curse words, etc.) Words that are not themselves offensive, vulgar, or sensitive, but could be part of potentially disturbing phrases will be marked as “sensitive-in-context”.

Categories

This includes the data regarding frequently used words with all the lemmas, POS and attributes. Frequently used words are considered to fall under, and will be tagged with, one of the following categories:

    • Animal
    • Apple Product (Apple-owned trademarks and product names)
    • Brandname (Well-known brands and products)
    • Body part
    • Cities
    • Clothing
    • Color
    • Computer
    • Family name
    • Female first name
    • Fruit/vegetable
    • Georegion (Names of regions not covered by the geopolitical names in the country/state/city/waterway categories)
    • Greetings (including multi-word expressions)
    • Male first name
    • Measures
    • Organization
    • Plant
    • Professions
    • Relation
    • Seasons
    • Sport
    • States
    • Transportation
    • Waterway (Names of rivers, lakes, oceans, etc.)
    • Weather

Delivery Format

The data is delivered in two separate files: one main file for Feature Sets 1, 2, 3, 5, 6 and 7, and one for Feature Set 4 (Frequency). The main delivery file will be a UTF-8 tab-delimited file with the following structure:

    • Header: the header identifies each column or field, such as “form”, “lemma”, “POS”, etc. Every field can have a set of values; for example, “number” can be “singular”, “plural” or its combinations. Where a field is not applicable to a particular POS or form, the field will contain the value “n/a”.
    • Rest of the lines: the rest of the file contains one word form per line, and the attributes for this word form, according to the header fields. Every line contains a unique combination of word form and attributes. Even if two forms are formally identical (i.e., they are the same string), they are displayed in two different lines if their attributes are different. For Egyptian Arabic, the fields are:
        • form: the actual word described in the line
        • lemma: the lemma of the “form”
        • POS: adjective, adverb, conjunction, noun, numeral, particle, preposition, pronoun, verb
        • voice: active, passive (only for verbal forms)
        • tense: past, non-past (only for verbal forms)
        • mood: indicative, imperative (only for verbal forms)
        • transitivity: transitive, intransitive (only for verbal forms)
        • number: singular, dual, plural
        • person: 1, 2, 3 (only for verbal forms)
        • gender: feminine, masculine
        • case: nominative (only for nouns, adjectives and some verbal forms)
        • state: indefinite, definite, construct
        • pronominal-number: singular, plural
        • pronominal-person: 1, 2, 3
        • pronominal-gender: masculine, feminine
        • offensive: n/a, sensitive, vulgar, offensive
        • usage: n/a, rare, foreign, non-standard
        • category: animal, appleproduct, brandname, bodypart, city, clothing, color, computer, country, familyname, femalepersonalname, fruit/vegetable, georegion, greetings, malepersonalname, measures, organization, plant, professions, relation, seasons, sport, state, transportation, waterway, weather

The frequency file will be a UTF-8 tab-delimited file with the following structure:

    • Header: the header identifies each column or field. The only two fields present in the file are “form”, corresponding to the word form, and “freq_group”, corresponding to the frequency assigned to the form.
    • Rest of the lines: the rest of the file contains one word form per line, and the assigned frequency for this word form, according to the header fields. Every line contains a unique combination of word form and frequency.

Research notes

Egyptian Arabic, also called Masri, is a collection of Arabic dialects spoken in Egypt. The main variant is sometimes called Standard Egyptian Arabic (although it is not officially standardized), and is primarily based on Cairene Arabic (the dialect from Cairo), with some loanwords from Modern Standard Arabic.

This dialect is used in media not only throughout Egypt but also across the Arabic-speaking world. Regarding spelling variants, the Language Data will cover the main orthographic convention used to write Arabic: to omit tashkil (used to indicate short vowels), and to include only consonant marking (i’jam) to indicate long vowels.

We propose to focus on Standard Egyptian Arabic.

SAN FRANCISCO, USA

541 Jefferson Ave., Ste. 100

Redwood City

CA 94063

MADRID, SPAIN

José Echegaray 8, Building 3

Parque Empresarial Las Rozas

28232 Las Rozas