Egyptian Arabic Language Data
Volume of Language Data

Total number of forms
The following is a breakdown of the approximate number of forms in the Egyptian Arabic Language Data:
- Non-inflectional morphology:
- About 5,000 forms for prepositions, conjunctions, interjections…
- Note: this will include forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.
- Inflectional and derivational morphology:
- Nouns: 5,000,000 forms (29%)
- Verbs: 10,000,000 forms (59%)
- Adjectives: 1,500,000 forms (9%)
- Other: 500,000 forms (3%)

Total number of lemmas
Features
Lemma
POS
Voice
Tense
Aspect
Mood
Person
Number
Gender
Case
Degree
Definiteness State
Negative
Contractions
Pronominal Clitics
Formality
Frequency
Named Entities
Offensive
Non-inflectional POSs
Inflectional and Derivational Morphology Data
Frequency Indication
Usage
Words or spelling variations which are not officially recognized but are widely used in texting communication will be tagged as “non-standard”.
Offensive Language
Words that are sensitive should be marked “sensitive” (clinical names of genitalia, politically charged words, light curse words, etc.) Words that are not themselves offensive, vulgar, or sensitive, but could be part of potentially disturbing phrases will be marked as “sensitive-in-context”.
Categories
- Animal
- Apple Product (Apple-owned trademarks and product names)
- Brandname (Well-known brands and products)
- Body part
- Cities
- Clothing
- Color
- Computer
- Family name
- Female first name
- Fruit/vegetable
- Georegion (Names of regions not covered by the geopolitical names in the country/state/city/waterway categories)
- Greetings (including multi-word expressions)
- Male first name
- Measures
- Organization
- Plant
- Professions
- Relation
- Seasons
- Sport
- States
- Transportation
- Waterway (Names of rivers, lakes, oceans, etc.)
- Weather
Delivery Format
- Header: the header identifies each column or field, such as “form”, “lemma”, “POS”, etc. Every field can have a set of values; for example, “number” can be “singular”, “plural” or its combinations. Where a field is not applicable to a particular POS or form, the field will contain the value “n/a”.
- Rest of the lines: the rest of the file contains one word form per line, and the attributes for this word form, according to the header fields. Every line contains a unique combination of word form and attributes. Even if two forms are formally identical (i.e., they are the same string), they are displayed in two different lines if their attributes are different. For Egyptian Arabic, the fields are:
- form: the actual word described in the line
- lemma: the lemma of the “form”
- POS: adjective, adverb, conjunction, noun, numeral, particle, preposition, pronoun, verb
- voice: active, passive (only for verbal forms)
- tense: past, non-past (only for verbal forms)
- mood: indicative, imperative (only for verbal forms)
- transitivity: transitive, intransitive (only for verbal forms)
- number: singular, dual, plural
- person: 1, 2, 3 (only for verbal forms)
- gender: feminine, masculine
- case: nominative (only for nouns, adjectives and some verbal forms)
- state: indefinite, definite, construct
- pronominal-number: singular, plural
- pronominal-person: 1, 2, 3
- pronominal-gender: masculine, feminine
- offensive: n/a, sensitive, vulgar, offensive
- usage: n/a, rare, foreign, non-standard
- category: animal, appleproduct, brandname, bodypart, city, clothing, color, computer, country, familyname, femalepersonalname, fruit/vegetable, georegion, greetings, malepersonalname, measures, organization, plant, professions, relation, seasons, sport, state, transportation, waterway, weather
The frequency file will be a UTF-8 tab-delimited file with the following structure:
- Header: the header identifies each column or field. The only two fields present in the file are “form”, corresponding to the word form, and “freq_group”, corresponding to the frequency assigned to the form.
- Rest of the lines: the rest of the file contains one word form per line, and the assigned frequency for this word form, according to the header fields. Every line contains a unique combination of word form and frequency.
Research notes
This dialect is used in media not only throughout Egypt but also across the Arabic-speaking world. Regarding spelling variants, the Language Data will cover the main orthographic convention used to write Arabic: to omit tashkil (used to indicate short vowels), and to include only consonant marking (i’jam) to indicate long vowels.
We propose to focus on Standard Egyptian Arabic.

SAN FRANCISCO, USA
541 Jefferson Ave., Ste. 100
Redwood City
CA 94063

MADRID, SPAIN
José Echegaray 8, Building 3
Parque Empresarial Las Rozas
28232 Las Rozas