Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG
Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.
Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.
German — Pure Decompounding Failures
1. Query: Wasch Maschine Filter
Same lexemes and identical meaning, yet invisible without segmentation.
| Type | Value |
|---|---|
| Query | Wasch Maschine Filter |
| Product | Waschmaschinenfilter |
| Translation | “washing machine filter” |
2. Query: Staub Sauger Beutel
Users type separated words; systems that do not split the compound fail to match.
| Type | Value |
|---|---|
| Query | Staub Sauger Beutel |
| Product | Staubsaugerbeutel |
| Translation | “vacuum cleaner bag” |
3. Query: Kinder Wagen Zubehör
Separated input does not align with the glued compound form.
| Type | Value |
|---|---|
| Query | Kinder Wagen Zubehör |
| Product | Kinderwagenzubehör |
| Translation | “stroller accessories” |
4. Query: Tisch Lampe Schirm
Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.
| Type | Value |
|---|---|
| Query | Tisch Lampe Schirm |
| Product | Tischlampenschirm |
| Translation | “table lamp shade” |
5. Query: Schnee Schuh Herren
Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.
| Type | Value |
|---|---|
| Query | Schnee Schuh Herren |
| Product | Schneeschuhherren |
| Translation | “men’s snowshoes” |
6. Query: Bett Decke Bezug
A common pattern in German catalogues and enterprise documents.
| Type | Value |
|---|---|
| Query | Bett Decke Bezug |
| Product | Bettdeckenbezug |
| Translation | “bed duvet cover” / “bed comforter cover” |
Korean — Pure Eojeol Segmentation Failures
Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.
1. Query: 세탁기 필터
Exact same lexemes; retrieval fails without splitting.
| Type | Value |
|---|---|
| Query | 세탁기 필터 |
| Product | 세탁기필터 |
| Translation | “washing machine filter” |
2. Query: 가습기 물통
The terms exist inside the eojeol but remain unreachable.
| Type | Value |
|---|---|
| Query | 가습기 물통 |
| Product | 가습기물통 |
| Translation | “humidifier water tank” |
3. Query: 블루투스 헤드폰
Without segmentation, it is treated as a single opaque token.
| Type | Value |
|---|---|
| Query | 블루투스 헤드폰 |
| Product | 블루투스헤드폰 |
| Translation | “Bluetooth headphones” |
4. Query: 기차표 가격
Even simple combinations cannot match unless morphemes are exposed.
| Type | Value |
|---|---|
| Query | 기차표 가격 |
| Product | 기차표가격 |
| Translation | “train ticket price” |
5. Query: 도어 손잡이
The user’s intention is present but hidden inside the long unit.
| Type | Value |
|---|---|
| Query | 도어 손잡이 |
| Product | 도어손잡이 |
| Translation | “door handle” |
6. Query: 휴대폰 케이스
A recurring cause of low recall in Korean e-commerce search.
| Type | Value |
|---|---|
| Query | 휴대폰 케이스 |
| Product | 휴대폰케이스 |
| Translation | “mobile phone case” / “cellphone case” |
Why This Breaks Modern Retrieval Pipelines
Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.
- Keyword Search: Split queries never match unsegmented compounds.
- Vector Search / Embeddings: Long compounds become single opaque tokens, harming embedding quality and preventing semantic alignment.
- RAG Pipelines: Relevant chunks are not retrieved, which leads to incomplete context and weaker answers.
- LLM Interpretation: When the model receives unsegmented tokens, internal semantic structure is lost.
Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.
A Practical Note on Decompounding
Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:
| Original | Segmented | Translation |
|---|---|---|
| Waschmaschinenfilter | Waschmaschine Filter | Waschmaschine = “washing machine” Filter = “filter” |
| Staubsaugerbeutel | Staubsauger Beutel | Staubsauger = “vacuum cleaner” Beutel = “bag” |
| 세탁기필터 | 세탁기 필터 | 세탁기 = “washing machine” 필터 = “filter” |
| 휴대폰케이스 | 휴대폰 케이스 | 휴대폰 = “mobile phone / cellphone” 케이스 = “case” |
Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.
Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.
Conclusion
German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.