Enriched Text Embeddings
Bitext provides linguistically enriched text embeddings for that outperform traditional embeddings in a wide range of downstream tasks such as text classification, topic modeling and semantic search.
We offer both:
- Pre-trained embeddings for morphologically-complex languages like Arabic and
- Services to create custom embeddings for different languages and verticals, with your custom data
Working with 3 of the Top 5 largest companies in NASDAQ
Linguistically enriched embeddings have been proven to increase accuracy in different downstream tasks. In standard semantic similarity tests, our enriched embeddings for Arabic outperform regular embeddings by 7% (going from 0.47 to 0.54).
As a broader showcase of the performance improvements offered by our embeddings for morphologically-rich languages, we have benchmarked the performance of our enriched Arabic embeddings on more complex downstream tasks like:
- Topic Modeling: topic coherence increases by up to 15%
- Semantic Textual Similarity (STS): F1-score increases by 4%
- Question Answering (QA): exact match scores by 14%
Additionally, these enriched embeddings offer the following advantages:
- Minimizing out of vocabulary (OOV) issues by increasing effective vocabulary size, especially for morphologically rich languages
- Reducing data sparsity by combining inflected forms of words, so less training data is needed to achieve the same or greater accuracy
Our enriched embeddings extend traditional unsupervised embeddings into semi-supervised ones leveraging Bitext’s linguistic resources:
- lexical resources for 77 languages and 25 dialectal/regional variants
- rich morphological attribute tagging: POS, tense, gender, number, aspect…
- semantic resources including synonyms and antonyms
Together with other resources like extensive corpora, named entities dictionaries, offensive language tags and more.
Bitext also offers pre-trained NLP pipelines designed to take full advantage of these embeddings. These pipelines include:
- high-quality multilingual lemmatization
- full pipeline of linguistic services such as POS tagging, entity extraction, etc.
Our enriched embeddings offer the following features:
- Ready to use with virtually any platform/pipeline: Spacy, Gensim, BERTopic…
- Wide vocabulary coverage: 50K lemmas, covering up to 15M word forms from Bitext’s balanced 5B word corpus
- Compact size: 150 dimensions
The core strengths of the Bitext text embedding & pipelines are:
- Based on large corpora. Text embeddings are built on corpora in the range of 25-50 GB of text (5-10 Billion tokens). These corpora are designed to be well-balanced with respect to text typology, texts sources and verticals.
- Multilingual coverage. Text embeddings are provided in 77 languages (Arabic, Korean, Hindi, German…) and 25 language variants (Canadian French, Egyptian Arabic, US Spanish…)
- Extensive vocabulary coverage. All pipeline components use comprehensive lexical resources.
For example, the lexicon for Finnish covers 80 million word forms; Hungarian covers 18 million word forms; Japanese more than 9 million; Korean more than 6 million; Turkish 3.5 million; and German 2.5 million.
These lexical resources:
- ensure high quality linguistic tagging functions, like POS tagging and lemmatization
- reduce the number of unknown (OOV) words in embeddings
- provide rich morphological features like tense, person, number, gender, case…
- Full pipeline coverage. The pipeline allows for control of all the usual steps:
- Sentence segmentation. Splits texts into sentences.
- Word Segmentation. Covers languages without spaces between words like Japanse or Chinese, or words with spaces between syllables, like Vietnamese
- Tokenization. Splits sentences into individual tokens (words, numbers…)
- POS tagging. Provides better word type segmentation, grouping word classes that have similar behavior, like nouns, verbs, adjectives or adverbs
- Lemmatization. Increases the vocabulary coverage of the embedding, particularly for morphologically rich languages like Spanish, French or Italian
- Decompounding. Reduces vocabulary size by splitting compounds into known words, reducing data sparsity
- Named Entity Recognition. NER uses entity dictionaries that reduce the number of unknown words
- Dependency Parsing.