Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.

Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.


German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

Same lexemes and identical meaning, yet invisible without segmentation.

TypeValue
QueryWasch Maschine Filter
ProductWaschmaschinenfilter
Translation“washing machine filter”

2. Query: Staub Sauger Beutel

Users type separated words; systems that do not split the compound fail to match.

TypeValue
QueryStaub Sauger Beutel
ProductStaubsaugerbeutel
Translation“vacuum cleaner bag”

3. Query: Kinder Wagen Zubehör

Separated input does not align with the glued compound form.

TypeValue
QueryKinder Wagen Zubehör
ProductKinderwagenzubehör
Translation“stroller accessories”

4. Query: Tisch Lampe Schirm

Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.

TypeValue
QueryTisch Lampe Schirm
ProductTischlampenschirm
Translation“table lamp shade”

5. Query: Schnee Schuh Herren

Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.

TypeValue
QuerySchnee Schuh Herren
ProductSchneeschuhherren
Translation“men’s snowshoes”

6. Query: Bett Decke Bezug

A common pattern in German catalogues and enterprise documents.

TypeValue
QueryBett Decke Bezug
ProductBettdeckenbezug
Translation“bed duvet cover” / “bed comforter cover”

Korean — Pure Eojeol Segmentation Failures

Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.

1. Query: 세탁기 필터

Exact same lexemes; retrieval fails without splitting.

TypeValue
Query세탁기 필터
Product세탁기필터
Translation“washing machine filter”

2. Query: 가습기 물통

The terms exist inside the eojeol but remain unreachable.

TypeValue
Query가습기 물통
Product가습기물통
Translation“humidifier water tank”

3. Query: 블루투스 헤드폰

Without segmentation, it is treated as a single opaque token.

TypeValue
Query블루투스 헤드폰
Product블루투스헤드폰
Translation“Bluetooth headphones”

4. Query: 기차표 가격

Even simple combinations cannot match unless morphemes are exposed.

TypeValue
Query기차표 가격
Product기차표가격
Translation“train ticket price”

5. Query: 도어 손잡이

The user’s intention is present but hidden inside the long unit.

TypeValue
Query도어 손잡이
Product도어손잡이
Translation“door handle”

6. Query: 휴대폰 케이스

A recurring cause of low recall in Korean e-commerce search.

TypeValue
Query휴대폰 케이스
Product휴대폰케이스
Translation“mobile phone case” / “cellphone case”

Why This Breaks Modern Retrieval Pipelines

Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.

  • Keyword Search: Split queries never match unsegmented compounds.
  • Vector Search / Embeddings: Long compounds become single opaque tokens, harming embedding quality and preventing semantic alignment.
  • RAG Pipelines: Relevant chunks are not retrieved, which leads to incomplete context and weaker answers.
  • LLM Interpretation: When the model receives unsegmented tokens, internal semantic structure is lost.

Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.

A Practical Note on Decompounding

Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:

OriginalSegmentedTranslation
WaschmaschinenfilterWaschmaschine FilterWaschmaschine = “washing machine”
Filter = “filter”
StaubsaugerbeutelStaubsauger BeutelStaubsauger = “vacuum cleaner” Beutel = “bag”
세탁기필터세탁기 필터세탁기 = “washing machine” 필터 = “filter”
휴대폰케이스휴대폰 케이스휴대폰 = “mobile phone / cellphone” 케이스 = “case”

Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.

Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.

Conclusion

German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.

 

 

Sharing is caring!