Enriched Arabic Text Embeddings

Bitext provides linguistically enriched text embeddings for Arabic for that outperform traditional embeddings in a wide range of downstream tasks such as text classification, topic modeling and semantic search. 

We offer both:

    • Pre-trained embeddings for morphologically-complex languages like Arabic 
    • Services to create custom embeddings for Arabic 

Our Customers

Working with 3 of the Top 5 Largest Companies in NASDAQ



Linguistically enriched embeddings have been proven to increase accuracy in different downstream tasks. In standard semantic similarity tests, our enriched embeddings for outperform regular embeddings by 7% (going from 0.47 to 0.54). 

As a broader showcase of the performance improvements offered by our embeddings for morphologically-rich languages, we have benchmarked the performance of our enriched embeddings on more complex downstream tasks like:

  • Topic Modeling: topic coherence increases by up to 15%
  • Semantic Textual Similarity (STS): F1-score increases by 4%
  • Question Answering (QA): exact match scores by 14%


We provide service to generate enriched embeddings for Arabic that extend traditional unsupervised embeddings into semi-supervised ones leveraging Bitext’s linguistic resources: 

  • lexical resources for MSA with 15M words…
  • dialectal/regional variants like Egyptian, Gulf or Najdi  Arabic 
  • rich morphological attribute tagging: POS, tense, gender, number, aspect…
  • extensive corpora 
  • named entities dictionaries 
  • offensive language tags  

Bitext also offers pre-trained word embeddings and full pipelines for Arabic designed to take full advantage of these embeddings. These pipelines include: 


Our enriched embeddings offer the following features: 

  • Ready to use with virtually any platform/pipeline: Spacy, Gensim, BERTopic… 
  • Wide vocabulary coverage: 50K lemmas, covering up to 15M word forms from Bitext’s balanced 5B word corpus 
  • Compact size: 150 dimensions 


The core strengths of the Bitext embeddings & pipelines are: 

  • Based on large corpora. Embeddings are built on corpora in the range of 25-50 GB of text (5-10 Billion tokens). These corpora are designed to be well-balanced with respect to text typology, texts sources and verticals. 
  • Extensive vocabulary coverage. All pipeline components use comprehensive lexical resources. For example, our lexicon for Arabic covers 15 million word forms. These lexical resources: 
    • ensure high quality linguistic tagging functions, like POS tagging and lemmatization 
    • reduce the number of unknown (OOV) words in embeddings  
    • provide rich morphological features like tense, person, number, gender, case…  
  • Full pipeline coverage. The pipeline allows for control of all the usual steps: 
    • Sentence segmentation. Splits texts into sentences. 
    • Word Segmentation. Covers languages without spaces between words like Japanse or Chinese, or words with spaces between syllables, like Vietnamese 
    • Tokenization. Splits sentences into individual tokens (words, numbers…)  
    • POS tagging. Provides better word type segmentation, grouping word classes that have similar behavior, like nouns, verbs, adjectives or adverbs 
    • Lemmatization. Increases the vocabulary coverage of the embedding, particularly for morphologically rich languages like Spanish, French or Italian 
    • Decompounding. Reduces vocabulary size by splitting compounds into known words, reducing data sparsity 
    • Named Entity Recognition. NER uses entity dictionaries that reduce the number of unknown words 
    • Dependency Parsing.  

            Have doubts? Don´t hesitate to

            MADRID, SPAIN

            Camino de las Huertas, 20, 28223 Pozuelo
            Madrid, Spain

            SAN FRANCISCO, USA

            541 Jefferson Ave Ste 100, Redwood City
            CA 94063, USA