Egyptian Arabic Language Data

Egyptian Arabic, also called Masri, is a collection of Arabic dialects spoken in Egypt. The main variant is sometimes called Standard Egyptian Arabic (although it is not officially standardized), and is primarily based on Cairene Arabic (the dialect from Cairo), with some loanwords from Modern Standard Arabic. This dialect is used in media not only throughout Egypt but also across the Arabic-speaking world.

As with Modern Standard Arabic, the main orthographic convention for Egyptian Arabic is to omit tashkil (used to indicate short vowels), and to include only consonant marking (i’jam) to indicate long vowels.

Volume of Language Data


Total number of forms

The total number of forms is approximately 18,000,000

The following is a breakdown of the approximate number of forms in the Egyptian Arabic Language Data:

  • Non-inflectional morphology:
      • About 5,000 forms for prepositions, conjunctions, interjections…
      • Note: this will include forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.
  • Inflectional and derivational morphology:
      • Nouns: 5,000,000 forms (29%)
      • Verbs: 10,000,000 forms (59%)
      • Adjectives: 1,500,000 forms (9%)
      • Other: 500,000 forms (3%)

Total number of lemmas

20,000 lemmas


Each form is annotated with its corresponding lemma, POS, and morphological attributes: voice, tense, mood, number, person, gender, case, state, possessive-number, possessive-person and possessive-gender.


The canonical form for the inflected word.


Part of Speech such as noun, verb, adjective, etc.


Verb form is classified as active or passive.


Specifies when the action takes place such as past, present, future, etc.


Not applicable.


Modality of the verb form: indicative, subjunctive, imperative, etc.


Verb or pronoun refers to the first, second or third person.


State of being singular, dual or plural.


Noun, verb or adjective forms, masculine, feminine, neuter, etc.


The function that the noun or adjective plays within a sentence.


Not applicable.

Definiteness State

Specifies whether a noun or adjective refers to a concrete or general concept.


Not applicable.


Not applicable.

Pronominal Clitics

Clitic pronouns are identified and tagged.


Not applicable


Relative frequency of the form based on a large general-purpose corpus.

Named Entities

Pre-defined entities are tagged as person names, places, organization, etc.


Indicates whether the form might be considered offensive in certain contexts.

Non-inflectional POSs

The data contains all the forms, lemma, POS and morphological attributes (voice, tense, mood, transitivity, number, person, gender, case, state, pronominal-number, pronominal-person and pronominal-gender), for noninflectional POSs (determiners, pronouns, prepositions…), i.e. POSs not included in features Inflectional and derivational morphology.

Inflectional Morphology Data

The Lexical Resource for Egyptian Arabic contains all the forms for all POSs: nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, numerals and particles.

Derivational Morphology Data

It also includes all derivational forms including adjectives derived from nouns (nisba) and verbal nouns or adverbs derived from adjectives.

Extended Morphology Data

The data also covers the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common combinations of productive prefixes.

Frequency Indication

Relative frequency information for words in a corpus is included. Frequency expressed in the form of logarithmic “buckets” or “frequency groups” (from 255, the most common forms, to 0, the least common forms).

Complementary Semantic Annotations


Usage includes information regarding the usage of specific forms or lemmas. Words that are rarely used will be tagged as “rare”. Words borrowed from foreign languages that are not widely understood, or which are only meaningful as part of a larger phrase, will be tagged as “foreign”.

Words or spelling variations which are not officially recognized but are widely used in texting communication will be tagged as “non-standard”.

Offensive Language

Bitext provides data regarding offensive, vulgar and sensitive words with all the lemmas, POS and attributes. Offensive and vulgar words will be marked as “offensive” (highly offensive derogatory slurs based on race, sexual orientation, etc.). Vulgar words should be marked as “vulgar” (strong curse words).

Words that are sensitive should be marked “sensitive” (clinical names of genitalia, politically charged words, light curse words, etc.) Words that are not themselves offensive, vulgar, or sensitive, but could be part of potentially disturbing phrases will be marked as “sensitive-in-context”.


In Categories, the data regarding frequently used words with all the lemmas, POS and attributes are included. Frequently used words are considered to fall under, and will be tagged with, one of the following categories:


    • Animal
    • Apple Product (Apple-owned trademarks and product names)
    • Brandname (Well-known brands and products)
    • Body part
    • Cities
    • Clothing
    • Color
    • Computer
    • Family name
    • Female first name
    • Fruit/vegetable
    • Georegion (Names of regions not covered by the geopolitical names in the country/state/city/waterway categories)
    • Greetings (including multi-word expressions)
    • Male first name
    • Measures
    • Organization
    • Plant
    • Professions
    • Relation
    • Seasons
    • Sport
    • States
    • Transportation
    • Waterway (Names of rivers, lakes, oceans, etc.)
    • Weather


Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain


541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA