Enriched Arabic Text Embeddings

Bitext provides linguistically enriched text embeddings for Arabic for that outperform traditional embeddings in a wide range of downstream tasks such as text classification, topic modeling and semantic search. 

We offer both:

lexical-forms-arabic

Our Customers

Working with 3 of the Top 5 largest companies in NASDAQ

|

Advantages

Linguistically enriched embeddings have been proven to increase accuracy in different downstream tasks. In standard semantic similarity tests, our enriched embeddings for outperform regular embeddings by 7% (going from 0.47 to 0.54). 

As a broader showcase of the performance improvements offered by our embeddings for morphologically-rich languages, we have benchmarked their performance on more complex downstream tasks like topic modeling, where our enriched embeddings improve topic coherence results by up to 10%

Features

We provide service to generate enriched embeddings for Arabic that extend traditional unsupervised embeddings into semi-supervised ones leveraging Bitext’s linguistic resources: 

  • lexical resources for MSA with 15M words…
  • dialectal/regional variants like Egyptian, Gulf or Najdi  Arabic <Arabic dict url> 
  • rich morphological attribute tagging: POS, tense, gender, number, aspect…
  • extensive corpora 
  • named entities dictionaries 
  • offensive language tags  

Bitext also offers pre-trained word embeddings and full pipelines for Arabic designed to take full advantage of these embeddings. These pipelines include: 

 

Our enriched embeddings offer the following features: 

  • Ready to use with virtually any platform/pipeline: Spacy, Gensim, BERTopic… 
  • Wide vocabulary coverage: 50K lemmas, covering up to 15M word forms from Bitext’s balanced 5B word corpus 
  • Compact size: 150 dimensions 

Features

The core strengths of the Bitext embeddings & pipelines are: 

  • Based on large corpora. Embeddings are built on corpora in the range of 25-50 GB of text (5-10 Billion tokens). These corpora are designed to be well-balanced with respect to text typology, texts sources and verticals. 
  • Extensive vocabulary coverage. All pipeline components use comprehensive lexical resources. For example, our lexicon for Arabic covers 15 million word forms. These lexical resources: 
    • ensure high quality linguistic tagging functions, like POS tagging and lemmatization 
    • reduce the number of unknown (OOV) words in embeddings  
    • provide rich morphological features like tense, person, number, gender, case…  
  • Full pipeline coverage. The pipeline allows for control of all the usual steps: 
    • Sentence segmentation. Splits texts into sentences. 
    • Word Segmentation. Covers languages without spaces between words like Japanse or Chinese, or words with spaces between syllables, like Vietnamese 
    • Tokenization. Splits sentences into individual tokens (words, numbers…)  
    • POS tagging. Provides better word type segmentation, grouping word classes that have similar behavior, like nouns, verbs, adjectives or adverbs 
    • Lemmatization. Increases the vocabulary coverage of the embedding, particularly for morphologically rich languages like Spanish, French or Italian 
    • Decompounding. Reduces vocabulary size by splitting compounds into known words, reducing data sparsity 
    • Named Entity Recognition. NER uses entity dictionaries that reduce the number of unknown words 
    • Dependency Parsing.  

            Have doubts? Don´t hesitate to

            SAN FRANCISCO, USA

            541 Jefferson Ave., Ste. 100

            Redwood City

            CA 94063

            MADRID, SPAIN

            José Echegaray 8, Building 3

            Parque Empresarial Las Rozas

            28232 Las Rozas