Bitext NLP in the Progress Data Platform
Multilingual NLP for smarter indexing, better extraction, and reliable GenAI across MarkLogic and Semaphore.
Bitext brings a unique approach to Natural Language Processing by combining
symbolic computational linguistics and statistical machine learning.
Bitext supports 70+ languages and 25 language variants, and works with the world’s largest software companies,
including 3 of the 5 Big Tech.
In the Progress Data Platform, Bitext powers multilingual linguistic analysis and information extraction to improve relevance, consistency, and structured enrichment across large content collections—especially in morphologically rich and compound-heavy languages.
MarkLogic: Linguistic Intelligence for Indexing & Query Processing
In MarkLogic, Bitext provides advanced linguistic analysis used during indexing and query processing, including:
- Lemmatization: groups inflected word forms under a common base form, improving recall while preserving precision.
- Decompounding: splits compound words into meaningful components, critical for languages such as
German, Korean, Dutch, and Nordic languages.
Semaphore: Information Extraction from Unstructured Text
Semaphore uses Bitext NLP to extract information from unstructured text, including:
- Part-of-Speech (POS) tagging: high-quality grammatical annotations to improve classification.
- Named Entity Recognition (NER): accurate identification of people, organizations, locations, companies/brands, and miscellaneous entities.
These features are fully integrated in 13 languages and form part of MarkLogic’s and Semaphore’s standard NLP stack.
Bitext Multilingual NLP SDK (Key Strengths)
Bitext provides linguistic knowledge to make Generative AI reliable, delivering one of the best performing and most accurate
multilingual NLP SDKs in the market.
- Speed: 640,000 words/sec on an 8-core CPU
- Multiplatform: Linux, macOS, Windows; ARM & x64
- Multi-API: native C engine with C, Python, and Java APIs
- Ubiquitous deployment: on-premises or cloud
- Light footprint: ~50MB disk, ~200MB RAM, no external dependencies
Core Capabilities
The Bitext NLP engine covers the full analysis pipeline—from language identification to advanced extraction for 70+ languages and 25 variants:
- Sentence-level Language Identification
- Lemmatization & Word Segmentation (including Chinese & Japanese)
- Decompounding & Agglutination (German, Korean, Swedish, Turkish…)
- POS Tagging, including Phrase Structure Tagging
- Entity Extraction and Concept Extraction
The SDK combines deterministic morphosyntactic analysis with configurable rule pipelines and semantic disambiguation layers,
enabling explainable, scalable extraction across large document volumes and multilingual corpora.
Use Cases
- Semantic Search & Semantic RAG: more grounding and precision, less noise and hallucinations.
- Entity & Concept Extraction: fast multilingual enrichment for vector search, graphs, and compliance workflows.
- Graph RAG: structured signals to accelerate Knowledge Graph creation from unstructured text.
Need More?
Bitext can support customers who require additional functionality beyond the standard MarkLogic/Semaphore feature set, including:
- Additional languages
- Extended entity types
- Customized linguistic behavior
- NLP components outside the standard MarkLogic/Semaphore stack
Get Started
If you are building multilingual indexing, information extraction, semantic retrieval, or GenAI pipelines on the Progress Data Platform,
Bitext can help you improve relevance, reduce noise, and accelerate structured enrichment at scale.
Learn more about Bitext as a trusted Progress Technology Alliance Partner on the official Progress Partner Locator:
View Bitext on Progress Partner Locator
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA