Enriched Arabic Text Embeddings

Bitext provides linguistically enriched text embeddings for Arabic for that outperform traditional embeddings in a wide range of downstream tasks such as text classification, topic modeling and semantic search.

We offer both:

- Pre-trained embeddings for morphologically-complex languages like Arabic
- Services to create custom embeddings for Arabic

Request a Demo

Our Customers

Working with 3 of the Top 5 Largest Companies in NASDAQ

Advantages

Linguistically enriched embeddings have been proven to increase accuracy in different downstream tasks. In standard semantic similarity tests, our enriched embeddings for outperform regular embeddings by 7% (going from 0.47 to 0.54).

As a broader showcase of the performance improvements offered by our embeddings for morphologically-rich languages, we have benchmarked the performance of our enriched embeddings on more complex downstream tasks like:

Topic Modeling: topic coherence increases by up to 15%
Semantic Textual Similarity (STS): F1-score increases by 4%
Question Answering (QA): exact match scores by 14%

Enriched Text Embeddings

Features

We provide service to generate enriched embeddings for Arabic that extend traditional unsupervised embeddings into semi-supervised ones leveraging Bitext’s linguistic resources:

lexical resources for MSA with 15M words…
dialectal/regional variants like Egyptian, Gulf or Najdi Arabic
rich morphological attribute tagging: POS, tense, gender, number, aspect…
extensive corpora
named entities dictionaries
offensive language tags

Improve-your-deployed-bot-understanding-chatbot-bitext

Request a Demo

Bitext also offers pre-trained word embeddings and full pipelines for Arabic designed to take full advantage of these embeddings. These pipelines include:

high-quality lemmatization
full pipeline of linguistic services such as tokenization, entity extraction, etc.

Our enriched embeddings offer the following features:

Ready to use with virtually any platform/pipeline: Spacy, Gensim, BERTopic…
Wide vocabulary coverage: 50K lemmas, covering up to 15M word forms from Bitext’s balanced 5B word corpus
Compact size: 150 dimensions

Features

The core strengths of the Bitext embeddings & pipelines are:

Based on large corpora. Embeddings are built on corpora in the range of 25-50 GB of text (5-10 Billion tokens). These corpora are designed to be well-balanced with respect to text typology, texts sources and verticals.
Extensive vocabulary coverage. All pipeline components use comprehensive lexical resources. For example, our lexicon for Arabic covers 15 million word forms. These lexical resources:
- ensure high quality linguistic tagging functions, like POS tagging and lemmatization
- reduce the number of unknown (OOV) words in embeddings
- provide rich morphological features like tense, person, number, gender, case…
Full pipeline coverage. The pipeline allows for control of all the usual steps:
- Sentence segmentation. Splits texts into sentences.
- Word Segmentation. Covers languages without spaces between words like Japanse or Chinese, or words with spaces between syllables, like Vietnamese
- Tokenization. Splits sentences into individual tokens (words, numbers…)
- POS tagging. Provides better word type segmentation, grouping word classes that have similar behavior, like nouns, verbs, adjectives or adverbs
- Lemmatization. Increases the vocabulary coverage of the embedding, particularly for morphologically rich languages like Spanish, French or Italian
- Decompounding. Reduces vocabulary size by splitting compounds into known words, reducing data sparsity
- Named Entity Recognition. NER uses entity dictionaries that reduce the number of unknown words
- Dependency Parsing.

Have doubts? Don´t hesitate to

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA