NER

Some of your RAG-related issues have an easy & quick solution: lemmatization

.bitext-example-box p { margin: 0 0 10px; font-size: 16px; color: #333333; line-height: 1.6; }

.bitext-example-box p:last-child { margin-bottom: 0; }

.bitext-highlight { display: inline-block; background: #fdeaea; color: #b71c1c; font-weight: 700; padding: 2px 6px; border-radius: 4px; }

.bitext-benefits { background: #fafafa; border: 1px solid #e6e6e6; padding: 14px 16px; margin: 18px 0 22px; border-radius: 6px; }

.bitext-benefits ul { margin: 0; padding-left: 20px; }

.bitext-benefits li { margin: 6px 0; font-size: 16px; color: #333333; line-height: 1.6; }

Some RAG issues have a simpler fix than people think: better text normalization.

One common culprit is stemming. Stemming is a blunt, error-prone approach: it strips word endings mechanically, without properly accounting for morphology, part of speech, or context. That can and will often collapse unrelated words into the same stem just because they look similar on the surface.

The result is noisy normalization.

For example, in English, according to the widely used Porter stemmer:

“organization” is wrongly linked to “organ”

“news” is wrongly associated to “new”

“united” is wrongly connected to “unit”

In languages with more complex morphologies like Spanish, German, French, Italian and others, these problems get worse.

Since stemming is performed at the beginning of the text analysis process, these errors affect every task that follows. The noise does not stay contained. It flows downstream into indexing, retrieval, and search, which means some of the “RAG problems” teams run into actually begin much earlier in the pipeline.


Why lemmatization is different

Lemmatization avoids these noisy associations. Instead of chopping words mechanically, lemmatization maps inflected forms to their correct dictionary form, typically using morphological analysis and part-of-speech information.

That makes it much better at normalizing real linguistic variation while avoiding many of the false matches that stemming introduces.

In the above examples:

“organization” is correctly linked to “organizations”

“news” is not associated to “new”; they are independent, unrelated words

“united” is properly connected to “unite”

Also, lemmatization is a fully deterministic, consistent and reliable process.

  • fewer false positives
  • cleaner indexing
  • better retrieval quality
  • more robust multilingual search

And since retrieval quality is critical for RAG, improving normalization upstream can have an outsized impact downstream.


The real source of some RAG issues

A lot of teams treat retrieval issues as if they were generation issues.

Often, they are not.

Sometimes the problem starts with stemming.

For a deeper understanding of how normalization impacts search relevance, check out this post on lemmatization vs stemming.

 

admin

Recent Posts

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

The Experiment We tested this idea using the Leipzig English News corpora from the Wortschatz…

1 month ago

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

Large Language Models are powerful systems for language generation and reasoning. However, when they are…

2 months ago

German & Korean Retrieval Fails Without Proper Decompounding

German and Korean do not break retrieval because they are unusually complex; they break retrieval…

4 months ago

Lemmatization vs Stemming

Almost all of us use a search engine in our daily working routine, it has…

5 months ago

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Problem. There’s broad consensus today: LLMs are phenomenal personal productivity tools — they draft, summarize,…

5 months ago

Using Public Corpora to Build Your NER systems

Rationale. NER tools are at the heart of how the scientific community is solving LLM…

6 months ago