How to Increase Search Relevance with Better Text Normalization

Some RAG issues have a simpler fix than people think: better text normalization.

One common culprit is stemming. Stemming is a blunt, error-prone approach: it strips word endings mechanically, without properly accounting for morphology, part of speech, or context. That can and will often collapse unrelated words into the same stem just because they look similar on the surface.

The result is noisy normalization.

For example, in English, according to the widely used Porter stemmer:

“organization” is wrongly linked to “organ”

“news” is wrongly associated to “new”

“united” is wrongly connected to “unit”

In languages with more complex morphologies like Spanish, German, French, Italian and others, these problems get worse.

Since stemming is performed at the beginning of the text analysis process, these errors affect every task that follows. The noise does not stay contained. It flows downstream into indexing, retrieval, and search, which means some of the “RAG problems” teams run into actually begin much earlier in the pipeline.

Why lemmatization is different

Lemmatization avoids these noisy associations. Instead of chopping words mechanically, lemmatization maps inflected forms to their correct dictionary form, typically using morphological analysis and part-of-speech information.

That makes it much better at normalizing real linguistic variation while avoiding many of the false matches that stemming introduces.

In the above examples:

“organization” is correctly linked to “organizations”

“news” is not associated to “new”; they are independent, unrelated words

“united” is properly connected to “unite”

Also, lemmatization is a fully deterministic, consistent and reliable process.

fewer false positives
cleaner indexing
better retrieval quality
more robust multilingual search

And since retrieval quality is critical for RAG, improving normalization upstream can have an outsized impact downstream.

The real source of some RAG issues

A lot of teams treat retrieval issues as if they were generation issues.

Often, they are not.

Sometimes the problem starts with stemming.

For a deeper understanding of how normalization impacts search relevance, check out this post on lemmatization vs stemming.

Some of your RAG-related issues have an easy & quick solution: lemmatization

Why lemmatization is different

The real source of some RAG issues