Enriched Arabic Text Embeddings
Bitext provides linguistically enriched text embeddings for Arabic for that outperform traditional embeddings in a wide range of downstream tasks such as text classification, topic modeling and semantic search.
We offer both:
- Pre-trained embeddings for morphologically-complex languages like Arabic
- Services to create custom embeddings for Arabic
Working with 3 of the Top 5 largest companies in NASDAQ
Linguistically enriched embeddings have been proven to increase accuracy in different downstream tasks. In standard semantic similarity tests, our enriched embeddings for outperform regular embeddings by 7% (going from 0.47 to 0.54).
As a broader showcase of the performance improvements offered by our embeddings for morphologically-rich languages, we have benchmarked the performance of our enriched embeddings on more complex downstream tasks like:
- Topic Modeling: topic coherence increases by up to 15%
- Semantic Textual Similarity (STS): F1-score increases by 4%
- Question Answering (QA): exact match scores by 14%
We provide service to generate enriched embeddings for Arabic that extend traditional unsupervised embeddings into semi-supervised ones leveraging Bitext’s linguistic resources:
Bitext also offers pre-trained word embeddings and full pipelines for Arabic designed to take full advantage of these embeddings. These pipelines include:
- high-quality lemmatization
- full pipeline of linguistic services such as tokenization, entity extraction, etc.
Our enriched embeddings offer the following features:
- Ready to use with virtually any platform/pipeline: Spacy, Gensim, BERTopic…
- Wide vocabulary coverage: 50K lemmas, covering up to 15M word forms from Bitext’s balanced 5B word corpus
- Compact size: 150 dimensions
The core strengths of the Bitext embeddings & pipelines are:
- Based on large corpora. Embeddings are built on corpora in the range of 25-50 GB of text (5-10 Billion tokens). These corpora are designed to be well-balanced with respect to text typology, texts sources and verticals.
- Extensive vocabulary coverage. All pipeline components use comprehensive lexical resources. For example, our lexicon for Arabic covers 15 million word forms. These lexical resources:
- ensure high quality linguistic tagging functions, like POS tagging and lemmatization
- reduce the number of unknown (OOV) words in embeddings
- provide rich morphological features like tense, person, number, gender, case…
- Full pipeline coverage. The pipeline allows for control of all the usual steps:
- Sentence segmentation. Splits texts into sentences.
- Word Segmentation. Covers languages without spaces between words like Japanse or Chinese, or words with spaces between syllables, like Vietnamese
- Tokenization. Splits sentences into individual tokens (words, numbers…)
- POS tagging. Provides better word type segmentation, grouping word classes that have similar behavior, like nouns, verbs, adjectives or adverbs
- Lemmatization. Increases the vocabulary coverage of the embedding, particularly for morphologically rich languages like Spanish, French or Italian
- Decompounding. Reduces vocabulary size by splitting compounds into known words, reducing data sparsity
- Named Entity Recognition. NER uses entity dictionaries that reduce the number of unknown words
- Dependency Parsing.