Bitext. We help AI understand humans. – chatbots that work

Some of your RAG-related issues have an easy & quick solution: lemmatization

admin — Wed, 15 Apr 2026 14:34:11 +0000

Some RAG issues have a simpler fix than people think: better text normalization.

One common culprit is stemming. Stemming is a blunt, error-prone approach: it strips word endings mechanically, without properly accounting for morphology, part of speech, or context. That can and will often collapse unrelated words into the same stem just because they look similar on the surface.

The result is noisy normalization.

For example, in English, according to the widely used Porter stemmer:

“organization” is wrongly linked to “organ”

“news” is wrongly associated to “new”

“united” is wrongly connected to “unit”

In languages with more complex morphologies like Spanish, German, French, Italian and others, these problems get worse.

Since stemming is performed at the beginning of the text analysis process, these errors affect every task that follows. The noise does not stay contained. It flows downstream into indexing, retrieval, and search, which means some of the “RAG problems” teams run into actually begin much earlier in the pipeline.

Why lemmatization is different

Lemmatization avoids these noisy associations. Instead of chopping words mechanically, lemmatization maps inflected forms to their correct dictionary form, typically using morphological analysis and part-of-speech information.

That makes it much better at normalizing real linguistic variation while avoiding many of the false matches that stemming introduces.

In the above examples:

“organization” is correctly linked to “organizations”

“news” is not associated to “new”; they are independent, unrelated words

“united” is properly connected to “unite”

Also, lemmatization is a fully deterministic, consistent and reliable process.

fewer false positives
cleaner indexing
better retrieval quality
more robust multilingual search

And since retrieval quality is critical for RAG, improving normalization upstream can have an outsized impact downstream.

The real source of some RAG issues

A lot of teams treat retrieval issues as if they were generation issues.

Often, they are not.

Sometimes the problem starts with stemming.

For a deeper understanding of how normalization impacts search relevance, check out this post on lemmatization vs stemming.

The post Some of your RAG-related issues have an easy & quick solution: lemmatization appeared first on Bitext. We help AI understand humans. - chatbots that work.

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

admin — Thu, 12 Mar 2026 18:37:51 +0000

Every day, millions of news articles are published about technology, business and geopolitics.

But there is a signal hidden inside them that most analytics systems completely miss.

It isn’t in what the articles say.

It’s in which entities appear together.

Once you start measuring that signal, you can see how global narratives form.

This phenomenon is called co-mentions, and it is widely used in knowledge graph construction and large-scale text analysis.

Why Co-mentions Matter

Counting mentions tells you which entities are important.

But co-mentions tell you something far more valuable: how those entities are connected.

That distinction is crucial.

For example: AI might appear in thousands of articles.

But if AI increasingly appears alongside Nvidia, something deeper is happening. It reveals a narrative forming:

AI infrastructure → Nvidia

Similarly, when AI increasingly appears together with the US or China, the story changes. AI is no longer just a technology topic. It has become a geopolitical one.

Co-mentions allow us to detect these narrative shifts early – before they become obvious.

The Experiment

We tested this idea using the Leipzig English News corpora from the Wortschatz Project at Leipzig University. We analyzed datasets from 2023, 2024 and 2025.

Across these datasets, the pipeline processed roughly:

2 million raw news articles
400K articles after topical filtering

From these documents the pipeline extracted:

millions of entity mentions
tens of millions of co-mention relationships

To focus on economic and technology narratives, documents were filtered using the IPTC Media Topics taxonomy, keeping only:

Economy, Business and Finance
Science and Technology

Dataset Scope	Approximate Volume
Raw news articles processed	2 million
Articles after topical filtering	400K
Entity mentions extracted	Millions
Co-mention relationships generated	Tens of millions

How the Analysis Works

The pipeline combines entity extraction with graph analysis:

Entity recognition using the Bitext NLP SDK (companies, countries, technologies)
Entity normalization (e.g. “US”, “United States”, “America” → United States)
Extraction of relationships between entities appearing in the same document
Aggregation of co-mentions across the corpus

Relationships are generated by linking entities that appear in the same document, producing weighted co-mention edges.

For example, if a document mentions US, China, Nvidia and AI, the system generates relationships such as:

US – China
US – AI
China – AI
Nvidia – AI

Pipeline Step	What It Does
Entity recognition	Extracts companies, countries, technologies and other entities from text
Normalization	Maps variants such as “US” and “America” to a canonical entity
Relationship extraction	Links entities appearing in the same document
Aggregation	Builds weighted co-mention patterns across the corpus

From Text to Knowledge Graph

When these relationships are aggregated across hundreds of thousands of articles, they form a knowledge graph that reveals patterns in global narratives.

Even a tiny fragment already tells a story:

AI → Nvidia → U.S. → China

Technology → infrastructure → geopolitics.

Input	Transformation	Output
Unstructured news text	Entity extraction + co-mention analysis	Knowledge graph of entities and relationships

Why This Matters

Most of the world’s knowledge still lives in unstructured text. But once entities and relationships are extracted at scale, that text can be transformed into structured knowledge graphs ready for analysis.

These graphs integrate naturally with platforms such as Neo4j, Stardog, Ontotext and MarkLogic, where the extracted entities and relationships can be explored and analyzed.

In short: text → entities → relationships → knowledge graph

And once the graph exists, hidden signals start to appear.

Stage	Result
Text	Raw unstructured articles
Entities	Normalized companies, countries, technologies and other concepts
Relationships	Weighted co-mentions between entities
Knowledge graph	Structured narrative map ready for analysis

In Summary

Co-mentions are one of the simplest signals you can extract from text.

But at scale, they reveal how the world connects ideas, companies and countries.

What other signals do you think could be extracted from large-scale news analysis?

The post The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form appeared first on Bitext. We help AI understand humans. - chatbots that work.

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

admin — Thu, 05 Feb 2026 15:18:01 +0000

Entity Extraction Is Infrastructure Task, Not a Generative Task

Large Language Models are powerful systems for language generation and reasoning. However, when they are used for entity extraction in enterprise environments, they introduce instability where reliability is required.

Entity extraction is not about creativity or interpretation. It is infrastructure. In production systems, entities must be extracted in a way that is consistent, repeatable, and stable over time.

Why Probabilistic Models Break Deterministic Enterprise Pipelines

In enterprise workflows, the same input must always produce the same entities. LLMs are probabilistic by design. Even with temperature set to zero, their outputs can change due to prompt phrasing, surrounding context, or model updates.

This variability is incompatible with systems that require long-term guarantees, such as search platforms, analytics pipelines, compliance systems, or enterprise RAG architectures.

Enterprise Requirement	LLM Behavior	Impact
Same input → same output	Outputs can vary across runs	Breaks repeatability and auditability
Long-term guarantees	Model updates can change behavior	Pipeline drift over time
Stable extraction contracts	Sensitive to prompts/context	Hidden regressions in production

The Problem with “Interpretation” in Entity Classification

Enterprises do not need models that interpret what an entity might be. They need invariant behavior.

A company name should always be classified as a company.
A regulation reference should never disappear because the model decided it was not important in that context.

LLMs optimize for plausibility. Enterprise systems require strict rules and predictable outcomes.

What Enterprises Need	What LLMs Optimize For
Invariant classification	Plausible interpretation
Predictable outputs	Context-dependent responses
Auditable behavior	Emergent, hard-to-verify behavior

Hallucinated Entities Corrupt Downstream Systems

One of the most dangerous failure modes of LLM-based entity extraction is hallucinated structure. LLMs can infer entities that are not explicitly present, normalize them incorrectly, or over-generalize across domains.

In downstream systems such as search indexes, knowledge graphs, analytics, or RAG pipelines, these hallucinated entities silently corrupt data.

Failure Mode	What Happens	Downstream Risk
Hallucinated entity	Entity appears without textual evidence	Polluted index / KG nodes
Incorrect normalization	Wrong canonical form or mapping	Broken linking & analytics
Over-generalization	Entities merged across domains	False positives in retrieval

Deterministic NLP systems tend to fail conservatively. LLMs fail confidently.

Why LLMs Are a Poor Fit for High-Volume Entity Extraction at Scale

Entity extraction workloads are typically high-volume, low-latency, and CPU-friendly. Using LLMs for large-scale extraction introduces GPU dependency, variable latency, and unpredictable operational costs.

This cost structure does not make sense when deterministic NLP systems can perform the same task faster, cheaper, and with zero variance.

Operational Dimension	Deterministic NLP	LLM-Based Extraction
Latency	Predictable	Variable
Cost	Stable, CPU-efficient	Unpredictable, often GPU-bound
Scaling	Linear & controllable	Operationally complex
Variance	Zero	Non-zero

When LLMs Do Make Sense in Enterprise Architectures

LLMs are extremely effective after entity extraction, not instead of it.

Search platforms: deterministic NLP should extract and normalize entities before indexing. LLMs can then generate summaries, explanations, or conversational answers over clean, structured data.
RAG systems: deterministic extraction ensures stable entities and metadata for retrieval. LLMs can reason over that context without inventing structure.
Compliance and regulatory monitoring: deterministic NLP guarantees that organizations, legal references, and domain terms are always captured. LLMs can then explain changes or summarize impact.
Analytics and knowledge graphs: deterministic extraction ensures consistent nodes and relationships. LLMs can sit on top as an insight or exploration layer, not as the source of truth.

The Right Architecture: Deterministic NLP First, LLMs on Top

The most robust enterprise architectures separate concerns clearly. Deterministic NLP is responsible for structure, normalization, and linguistic guarantees. LLMs are responsible for reasoning, synthesis, and interaction.

Layer	Responsibility	Guarantee
Deterministic NLP	Structure, normalization, extraction	Stable, repeatable outputs
LLMs	Reasoning, synthesis, interaction	Helpful language generation
Rule of thumb	Consume structure	Do not invent structure

Enterprise-Grade Entity Extraction Requires Determinism

LLMs are extraordinary tools, but they are not universal ones. If your system must be predictable, auditable, and stable over time, entity extraction should remain deterministic.

That is how enterprise-grade systems stay reliable as they scale.

The post Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction appeared first on Bitext. We help AI understand humans. - chatbots that work.

German & Korean Retrieval Fails Without Proper Decompounding

admin — Mon, 08 Dec 2025 15:27:25 +0000

Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.

Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.

German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

Same lexemes and identical meaning, yet invisible without segmentation.

Type	Value
Query	Wasch Maschine Filter
Product	Waschmaschinenfilter
Translation	“washing machine filter”

2. Query: Staub Sauger Beutel

Users type separated words; systems that do not split the compound fail to match.

Type	Value
Query	Staub Sauger Beutel
Product	Staubsaugerbeutel
Translation	“vacuum cleaner bag”

3. Query: Kinder Wagen Zubehör

Separated input does not align with the glued compound form.

Type	Value
Query	Kinder Wagen Zubehör
Product	Kinderwagenzubehör
Translation	“stroller accessories”

4. Query: Tisch Lampe Schirm

Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.

Type	Value
Query	Tisch Lampe Schirm
Product	Tischlampenschirm
Translation	“table lamp shade”

5. Query: Schnee Schuh Herren

Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.

Type	Value
Query	Schnee Schuh Herren
Product	Schneeschuhherren
Translation	“men’s snowshoes”

6. Query: Bett Decke Bezug

A common pattern in German catalogues and enterprise documents.

Type	Value
Query	Bett Decke Bezug
Product	Bettdeckenbezug
Translation	“bed duvet cover” / “bed comforter cover”

Korean — Pure Eojeol Segmentation Failures

Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.

1. Query: 세탁기 필터

Exact same lexemes; retrieval fails without splitting.

Type	Value
Query	세탁기 필터
Product	세탁기필터
Translation	“washing machine filter”

2. Query: 가습기 물통

The terms exist inside the eojeol but remain unreachable.

Type	Value
Query	가습기 물통
Product	가습기물통
Translation	“humidifier water tank”

3. Query: 블루투스 헤드폰

Without segmentation, it is treated as a single opaque token.

Type	Value
Query	블루투스 헤드폰
Product	블루투스헤드폰
Translation	“Bluetooth headphones”

4. Query: 기차표 가격

Even simple combinations cannot match unless morphemes are exposed.

Type	Value
Query	기차표 가격
Product	기차표가격
Translation	“train ticket price”

5. Query: 도어 손잡이

The user’s intention is present but hidden inside the long unit.

Type	Value
Query	도어 손잡이
Product	도어손잡이
Translation	“door handle”

6. Query: 휴대폰 케이스

A recurring cause of low recall in Korean e-commerce search.

Type	Value
Query	휴대폰 케이스
Product	휴대폰케이스
Translation	“mobile phone case” / “cellphone case”

Why This Breaks Modern Retrieval Pipelines

Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.

Keyword Search: Split queries never match unsegmented compounds.
Vector Search / Embeddings: Long compounds become single opaque tokens, harming embedding quality and preventing semantic alignment.
RAG Pipelines: Relevant chunks are not retrieved, which leads to incomplete context and weaker answers.
LLM Interpretation: When the model receives unsegmented tokens, internal semantic structure is lost.

Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.

A Practical Note on Decompounding

Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:

Original	Segmented	Translation
Waschmaschinenfilter	Waschmaschine Filter	Waschmaschine = “washing machine” Filter = “filter”
Staubsaugerbeutel	Staubsauger Beutel	Staubsauger = “vacuum cleaner” Beutel = “bag”
세탁기필터	세탁기 필터	세탁기 = “washing machine” 필터 = “filter”
휴대폰케이스	휴대폰 케이스	휴대폰 = “mobile phone / cellphone” 케이스 = “case”

Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.

Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.

Conclusion

German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.

The post German & Korean Retrieval Fails Without Proper Decompounding appeared first on Bitext. We help AI understand humans. - chatbots that work.

Lemmatization vs Stemming

admin — Mon, 17 Nov 2025 00:06:30 +0000

Almost all of us use a search engine in our daily work. It has become a key tool to get things done.

However, as the amount of data grows exponentially, providing high-quality results that truly match user queries becomes more complex.

One of the issues that complicates this process is ambiguous words.

These are terms that have different meanings depending on their role in the sentence.

Example:

“Let’s take a five-minute break in this meeting.”

“This vase made of glass can break easily.”

In both sentences we use “break”, but with different meanings:
as a noun in the first case, and as a verb in the second.

When working with large datasets, this ambiguity introduces noise. Search results may include documents that match the same word form, but not the intended meaning.

Some results are relevant, but many are not. This noise slows down the user and reduces search precision.

Why ambiguity gets worse in multilingual environments

Ambiguity may not be the biggest issue in English, but it becomes much more critical in highly inflected languages such as French, Spanish or Polish.

These languages rely heavily on:

declensions
adjective and noun inflections
pronoun variations

This makes normalization much more complex and much more important.

How normalization affects search

When a user enters a query, the system must normalize both the query and the indexed data so they can match correctly.

There are two main approaches:

Lemmatization

Maps a word to its correct dictionary form based on its usage and context.

Stemming

Removes characters from the end of a word using predefined rules, without understanding context.

In weakly inflected languages, the choice may not significantly impact results.

But in highly inflected languages, the normalization method directly determines the accuracy of search results.

Why lemmatization performs better

The main advantage of lemmatization is that it takes context into account to determine the intended meaning of a word.

This reduces ambiguity and significantly decreases noise in search results.

more precise matching
less noise in results
better handling of ambiguity
faster and more efficient user experience

In practice, when dealing with ambiguous words, stemming often produces the same root for different meanings, while lemmatization preserves the distinction between them.

In summary

Ambiguity is a fundamental challenge in search, especially in multilingual and highly inflected environments.

Choosing the right normalization strategy makes a significant difference in the quality of the results.

And in many cases, improving normalization upstream is the simplest way to improve search performance overall.

The post Lemmatization vs Stemming appeared first on Bitext. We help AI understand humans. - chatbots that work.

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

admin — Fri, 07 Nov 2025 19:29:45 +0000

Problem. There’s broad consensus today: LLMs are phenomenal personal productivity tools — they draft, summarize, and assist effortlessly.
But there’s also growing recognition that they’re still not ready for enterprise-grade deployment.

Why? Because enterprises need more than good prose. They need structured, reliable, explainable data — not probabilistic text. An LLM that hallucinates a CEO name or mislabels a supplier can break compliance, contracts, and trust.

Solution. The way forward is to extract key data and structure it as Knowledge Graphs (KGs). These graphs become the backbone knowledge that LLMs can safely reason over — grounding their outputs in verified, linked data.

This architectural shift is emerging under the GraphRAG and NodeRAG paradigms:

GraphRAG: retrieval-augmented generation where context comes from relationships between entities in a graph (not flat embeddings).
NodeRAG: fine-grained RAG where specific nodes and their properties are retrieved as context for the model.

Example:

Instead of asking an LLM “Who supplies lithium to Tesla?” and hoping it guesses right, a GraphRAG pipeline retrieves verified entities and relations:

Tesla —[supplier]→ Albemarle Corporation —[product]→ Lithium hydroxide

The LLM then uses this context to generate a grounded, auditable response.

Challenge. Building these knowledge graphs manually is impossible at enterprise scale.
To populate them, we need (semi-)automated extraction pipelines that are:

Accurate — 90%+ precision/recall for entity and relation detection,
Performant — capable of processing millions of documents per day,
Ubiquitous — deployable on-prem, in cloud, or hybrid setups,
Portable — running equally well on Windows, Linux, and ARM environments.

Current LLMs can’t meet these constraints. They are resource-hungry, unpredictable, and non-deterministic. Enterprise knowledge graphs need precision and reproducibility, not probabilistic outputs.

That’s where Symbolic NLP — combined with efficient ML components — steps in. Rule-based and morphology-aware engines can deterministically extract entities, relations, and attributes, feeding clean data into a knowledge graph layer.

Example:

Symbolic NLP can reliably parse “Generalversammlung der Vereinten Nationen” as Organization: United Nations General Assembly, recognizing inflection and structure without hallucination. An LLM might miss that entirely or translate it inconsistently.

Even Microsoft acknowledges this reality in their internal taxonomy of retrieval architectures. They now distinguish between:

Standard GraphRAG — LLM-driven pipelines, flexible but slow and opaque;
FastGraphRAG — deterministic and efficient symbolic/ML pipelines that pre-compute structure for high throughput. Microsoft FastGraphRAG reference

The trend is clear: the future of enterprise AI lies in combining symbolic precision with generative flexibility.

Bitext is releasing a new suite of Symbolic NLP engines designed for this hybrid AI architecture:

Speed: 3.2 MB of plain text per second on an 8-core CPU — no GPU needed.
Accuracy: Over 90% F1 measured on standard multilingual benchmark corpora.
Compatibility: Runs on Windows, Linux, and ARM; deployable locally or in cloud pipelines.

Conclusion. The industry is shifting from “prompting models” to building structured knowledge backbones.
Symbolic NLP isn’t old-school anymore — it’s the precision machinery that makes enterprise AI trustworthy, explainable, and scalable.

Now is the moment to pay attention to NLP.

The post The Moment to Pay Attention to Hybrid NLP (Symbolic + ML) appeared first on Bitext. We help AI understand humans. - chatbots that work.

Using Public Corpora to Build Your NER systems

admin — Mon, 20 Oct 2025 08:06:57 +0000

Rationale. NER tools are at the heart of how the scientific community is solving LLM issues using GraphRAG and NodeRAG architectures.

LLMs need knowledge graphs to control hallucinations and make them more solid for enterprise-level use.

And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.

Open-Source Tools. When starting an Entity Extraction project, it’s typical to start by leveraging open-source, machine-learning-based tools.

Open-source tools are widespread and adapt to different levels of execution, from POC to production-ready, like Hugging Face, Spark NLP or spaCy.

Open-Source Data. These tools rely on third-party datasets for model training and evaluation, typically manually tagged corpora with NER information (Person, Place, Organization, Company…).

Developing new data is expensive and complex, which is why most projects avoid producing their own tagged data.

Therefore, the main alternative to get started is a combination of open-source tools and data. OntoNotes or CoNLL are good examples of this type of datasets for English.

Data is Critical. These datasets are used for two critical purposes:

for training, i.e. building the core of our NER tool
for evaluation, i.e. determining if our project is a success and can be used in public settings

Data is a Blackbox? The datasets are open, meaning anyone can examine the text, the tagging… However, these datasets are often treated as “black boxes”, i.e. they are used to build NER models without much analysis or understanding of their weaknesses and the implications of these weaknesses. (We will not focus on their strengths, since they are definitely well-known to the community, that’s why they are so popular.)

In this series of posts, we are going to try and make those black boxes more transparent, drawing on our experience in using them at Bitext for evaluation purposes.

We will identify areas where the datasets can be improved and will provide some tips on how to avoid these issues, whenever possible with (semi-)automatic techniques.

First, we classify the different types of issues into 3 groups:

Training issues: common types of inconsistencies, both in gold (manual) and silver (semi-automatic) datasets — more on this in future posts.
Evaluation: how misleading it can be to use the same corpus for training and evaluation.
Deployment issues: licensing has a strong impact when moving from POC to production.

Next Post: “Open-Source Data and Training Issues”

The post Using Public Corpora to Build Your NER systems appeared first on Bitext. We help AI understand humans. - chatbots that work.

Open-Source Data and Training Issues

admin — Mon, 20 Oct 2025 08:04:06 +0000

As described in our previous post “Using Public Corpora to Build Your NER systems”, we are going to highlight areas where public datasets like OntoNotes or CoNLL can be improved. We will provide some tips on how to avoid these issues, whenever possible, using (semi-)automatic techniques.

Tagging consistency is essential to ensure that training is smooth. Contradictions and inconsistencies not only decrease accuracy but also generate hidden costs in MLOps when trying to debug and fix errors. We often take this consistency for granted, but that is rarely the case, not only in these datasets but also in any other manual tagging work.

Consistency starts with having a solid and clear definition of what an entity is. Typically, if not always, that’s not the case.

Entities vs Non-Entities. What’s an entity anyway? The definition of “entity” is a cornerstone for a NER project and should be 100% clear if we are automating the detection of entities, but this is not always the case.

For example, in WikiNEuRal, a well-known multilingual set of corpora, entities like “MVP” (Most Valuable Player) or “DJ” (Disc Jockey) are not tagged. In our view, they should be tagged – in this case as PERSON:

Example in Spanish tagging:

Input Sentence: “En 1980 y 1983 fue elegido como el MVP en toda Europa”
Gold Tagging: Europa:LOCATION (MVP-missing)

Example in Portuguese:

Input Sentence: Esse estilo era exclusivamente um fenômeno de Chicago , mas em 1987 virou febre no Reino Unido e na Europa Continental , sendo muito tocado por Djs .
Gold Tagging: Chicago:LOCATION Reino Unido:LOCATION Europa Continental:LOCATION (Djs-missing)

This same problem happens with other corpora, such as the UNER Swedish PUD corpus:

Example in Swedish: Entity “Paris Agreement” should be tagged as MISCELLANEOUS

Input Sentence: Det är fantastiskt att de fick Parisavtalet men deras insatser är för tillfället inte i närheten av målet på 1,5 grader.
Gold Tagging: (Parisavtale-missing)

Example in Swedish: Entity “Brexit” should be tagged as MISCELLANEOUS

Input Sentence: May har fått stor kritik för att ha undvikit och inte svarat öppet till media efter rättsutlåtandet om Brexit.
Gold Tagging: May:PERSON (Brexit-missing)

And similar cases occur across other languages and corpora:

Example in Russian (in WikiNEuRal Russian): “Альмохады” (“Almohads”) not tagged as MISCELLANEOUS

Input Sentence: В 1130-е годы Альмохады расширяли своё влияние в горных областях Марокко , в восточных и южных районах страны .
Gold Tagging: Марокко:LOCATION (Альмохады-missing)

Example in Korean (in KLUE): “인권센터는” (“Human Rights Center”) not tagged as ORGANIZATION

Input Sentence: 시 인권센터는 민간조사전문가 1 명을 포함한 사건조사팀을 구성 , 21 일간 신청인과 참고인 , 피신청인 16 명에 대한 진술조사와 현장조사를 한 결과 이같이 판정했다고 30 일 밝혔다 .
Gold Tagging: (인권센터는-missing)

This same problem happens with many other entities, often of type MISCELLANEOUS: GDP (Gross Domestic Product), DVD, Blu-ray, VHS… The list is long and not documented in any corpus as far as we know.

A Possible Solution. For languages that use capitalization (like English, Spanish…), the solution involves a significant amount of work. To detect entities not tagged we will need to extract all capitalized strings from the corpus, separate the ones that are not labelled and check them, either manually (safest way) or against gazetteers, to shortcut the task. The main complication, but not the only one, is that words at the beginning of sentences are always capitalized in many languages, even when they are regular words.

For languages that do not use capital letters (Arabic, Korean, Chinese, Japanese…) the solution is even harder; it would involve checking the corpus without the help of capitalization.

Given that this solution involves significant work, a good shortcut for all languages is to compile a list of most relevant entities we need to tag, and make sure they are tagged in our training corpora. This is not a perfect solution but at least it ensures that we will not miss the most relevant entities.

We will review more cases that involve different entity types, ambiguity, lack of criteria…

Previous Post: “Using Public Corpora to Build Your NER systems”

The post Open-Source Data and Training Issues appeared first on Bitext. We help AI understand humans. - chatbots that work.

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

admin — Sat, 13 Sep 2025 07:30:33 +0000

The Semantic Gap in Today’s Governance Platforms

Forrester’s evaluations show that, despite strong advances in automation and lineage, many platforms underperform on semantic depth.

Collibra: strong in workflows and policy management, but AI-driven semantic enforcement is still limited; customers face significant manual work.
Informatica: powerful in technical lineage, but limited in semantic capabilities beyond structured metadata.
Alation: ambitious vision of agentic governance, but still weak in multilingual semantic enrichment and natural-language rule creation.
Atlan and Ataccama: leaders in user experience, quality, and observability, but entity, concept, and relationship extraction from unstructured sources remains immature.
data.world, Solidatus, Anjana Data: innovative in lineage or collaboration, but their semantic and entity resolution functions require heavy effort from customers.

Without robust semantics, active metadata is not possible.

Why This Matters: The Unstructured Data Blind Spot

Around 80% of enterprise data is unstructured: reports, contracts, presentations, emails, logs, customer interactions, and knowledge bases.

A bank may need to align compliance rules with contracts, call transcripts, and transaction logs.
A global enterprise may need to unify customer records, policy documents, and legal texts across multiple languages.
A technology company may want to automatically tag and classify knowledge bases to create a chatbot for employee support.

Without advanced NLP (entity recognition, concept extraction and relationship mapping) this vast body of information remains invisible to governance platforms or customer support teams.

The Role of Multilingual Semantics in Active Metadata

Active metadata should not just catalog technical objects; it should understand what data means. For that, governance platforms require a Semantic Enrichment Engine with the following capabilities:

Entity and concept extraction: automatically detect business objects such as “customer ID,” “AML regulation,” or “support ticket.”
Relationship discovery: link concepts across unstructured datasets.
Multilingual coverage: enable governance in languages like Chinese, Japanese, Spanish, German, French, Korean, Arabic… ensuring consistency and accuracy.
Unstructured data enrichment: transform PDFs, reports, and communications into governed, discoverable knowledge.
Ontology and taxonomy support: integrate existing business glossaries, identify synonyms and semantic variants, and connect data elements within a broader knowledge graph.
Automation through semantics: trigger workflows, policy enforcement, and recommendations based on semantic signals, not just technical metadata.

Where Bitext Helps

At Bitext, we provide an OEM Semantic Enrichment Engine designed to power active metadata and data governance platforms with the semantic depth most vendors still lack.

Key technical advantages of our Semantic Enrichment Engine include:

Flexible deployment: available for both on-premises and cloud installations, accessible via REST API or native integration.
Developer-friendly integration: bindings for C, Python, and Java for seamless embedding into existing stacks.
Multiplatform by design: platform-independent C, supporting Windows, Linux, macOS, x64, and ARM.
High-performance NLP pipeline: from language identification to entity/concept extraction, processing over 640,000 words per second (3.2MB/sec) on a single 8-core CPU.
Lightweight footprint: average storage per language pipeline is only 50MB with no external dependencies, and average memory usage 200MB.
Extreme compression: client data sources compressed at ratios up to 1:100 (100MB reduced to 1MB).
Ultra-fast querying: compressed external data accessed at speeds of more than 400 million queries per second on a single 8-core CPU.

With these capabilities, our Semantic Enrichment Engine allows governance platforms to scale semantic enrichment across massive volumes of unstructured data, in multiple languages, without compromising performance or cost.

Final Thought

The Forrester Wave highlights the progress of data governance vendors, but also their weakness: semantic depth is not yet where it should be. Active metadata is the future, but without strong semantic intelligence it remains incomplete.

If data governance is to truly drive trust, compliance, and monetization, semantics must evolve from being an optional extra to becoming a core capability.

That is exactly what Bitext delivers with its Semantic Enrichment Engine.

More info about Bitext NAMER

The post Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance appeared first on Bitext. We help AI understand humans. - chatbots that work.

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

admin — Sun, 16 Mar 2025 14:59:40 +0000

The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and extract actionable insights from their data. However, traditional methods of constructing Knowledge Graphs are often slow, expensive, and complex, requiring significant expertise and manual effort. Bitext NAMER changes the game by automating key steps in the Knowledge Graph creation process, making it faster, more cost-effective, and accessible for businesses of all sizes.

The Knowledge Graph Creation Workflow Simplified

The process of constructing a knowledge graph involves multiple stages, including ontology or taxonomy creation, entity extraction, relationship mapping, and integration of structured and unstructured data. Traditionally, this process required extensive manual effort from domain experts and data engineers. Bitext NAMER automates key components of this workflow:

Ontology and Taxonomy Development: While manual ontology creation can take weeks or months, Bitext NAMER simplifies this by providing pre-built dictionaries with over 100,000 entities per language and customizable annotated corpora. These resources serve as the foundation for creating domain-specific ontologies.
Entity Extraction: Bitext NAMER identifies 20 types of entities (e.g., people, organizations, locations) with over 95% accuracy across multiple languages. This eliminates the need for manual tagging or annotation while ensuring high-quality data for the KG.
Relationship Mapping: The tool detects semantic relationships between entities in real time, enabling the automatic creation of connections within the knowledge graph.
Data Integration: By processing both structured and unstructured data from diverse sources, Bitext NAMER ensures seamless integration into existing knowledge frameworks.

This automation reduces the time required to construct a knowledge graph from months to days or even hours, depending on the complexity of the data.

Time and Cost Efficiency

The use of Bitext NAMER significantly reduces the time and cost associated with knowledge graph construction:

Time Savings: Manual KG construction typically requires 200-300 hours for domain-specific projects. With Bitext NAMER, this can be reduced by up to 90%, allowing completion in as little as 15-25 hours.

Cost Reduction: Automating entity extraction and relationship mapping eliminates the need for large teams of annotators or ontology engineers. This translates into cost savings of up to 70%, particularly for organizations processing large volumes of text across multiple languages.

For example, a financial services company using Bitext NAMER to build a KG for market intelligence could process thousands of documents daily without incurring the high costs associated with manual efforts.

The Challenges of Multilingual NER and Its Importance for Global Knowledge Graphs

Global enterprises often operate in multilingual environments, necessitating NER solutions that:

Handle linguistic diversity and nuances.

Maintain consistency across languages.

Address region-specific variations, such as named entity formats and cultural context.

Failure to address these complexities can lead to fragmented KGs, diminishing their utility and reliability.

Technical Performance Highlights

Bitext NAMER’s technical capabilities are optimized for enterprise-scale KG construction:

Processing Speed: Up to 100KB of raw text per second per CPU core.

Multilingual Support: Covers over 20 languages natively (e.g., English, Spanish, French) with dictionaries available in 77 languages.

Entity Coverage: Recognizes diverse entity types such as people, places, companies/brands, account numbers, and phone numbers.

Deployment Flexibility: Available as an on-premise SDK or via SaaS API.

These features make it possible to handle complex datasets across industries such as finance, e-commerce, and cybersecurity.

Applications in Knowledge Graph Automation

The automation enabled by Bitext NAMER has transformative applications in various domains:

Semantic Systems: Enhances search engines by creating semantic relationships between structured and unstructured data.
Financial Intelligence: Identifies key entities like accounts and transactions to build real-time market intelligence systems.
E-commerce: Recognizes brands and products to create recommendation systems based on customer behavior.
Cybersecurity: Detects suspicious patterns by connecting disparate datasets into unified graphs.

More info about Bitext NAMER

The post Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction appeared first on Bitext. We help AI understand humans. - chatbots that work.

Bitext. We help AI understand humans. – chatbots that work

Some of your RAG-related issues have an easy & quick solution: lemmatization

Why lemmatization is different

The real source of some RAG issues

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

Why Co-mentions Matter

The Experiment

How the Analysis Works

From Text to Knowledge Graph

Why This Matters

In Summary

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

Entity Extraction Is Infrastructure Task, Not a Generative Task

Why Probabilistic Models Break Deterministic Enterprise Pipelines

The Problem with “Interpretation” in Entity Classification

Hallucinated Entities Corrupt Downstream Systems

Why LLMs Are a Poor Fit for High-Volume Entity Extraction at Scale

When LLMs Do Make Sense in Enterprise Architectures

The Right Architecture: Deterministic NLP First, LLMs on Top

Enterprise-Grade Entity Extraction Requires Determinism

German & Korean Retrieval Fails Without Proper Decompounding

Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

2. Query: Staub Sauger Beutel

3. Query: Kinder Wagen Zubehör

4. Query: Tisch Lampe Schirm

5. Query: Schnee Schuh Herren

6. Query: Bett Decke Bezug

Korean — Pure Eojeol Segmentation Failures

1. Query: 세탁기 필터

2. Query: 가습기 물통

3. Query: 블루투스 헤드폰

4. Query: 기차표 가격

5. Query: 도어 손잡이

6. Query: 휴대폰 케이스

Why This Breaks Modern Retrieval Pipelines

A Practical Note on Decompounding

Conclusion

Lemmatization vs Stemming

Why ambiguity gets worse in multilingual environments

How normalization affects search

Why lemmatization performs better

In summary

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction