NER

German & Korean Retrieval Fails Without Proper Decompounding

table.bitext-table {

width:100%;

border-collapse:collapse;

font-size:15px;

margin:10px 0 22px;

}

table.bitext-table th {

background-color:#b71c1c ; /* rojo Bitext */

color:#ffffff ;

padding:8px 10px;

border:1px solid #9c1515;

text-align:left;

}

table.bitext-table td {

padding:8px 10px;

border:1px solid #e0e0e0;

color:#333333;

}

table.bitext-table tr:nth-child(even) td {

background-color:#fafafa;

}

Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.

Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.

German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

Same lexemes and identical meaning, yet invisible without segmentation.

Type	Value
Query	Wasch Maschine Filter
Product	Waschmaschinenfilter
Translation	“washing machine filter”

2. Query: Staub Sauger Beutel

Users type separated words; systems that do not split the compound fail to match.

Type	Value
Query	Staub Sauger Beutel
Product	Staubsaugerbeutel
Translation	“vacuum cleaner bag”

3. Query: Kinder Wagen Zubehör

Separated input does not align with the glued compound form.

Type	Value
Query	Kinder Wagen Zubehör
Product	Kinderwagenzubehör
Translation	“stroller accessories”

4. Query: Tisch Lampe Schirm

Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.

Type	Value
Query	Tisch Lampe Schirm
Product	Tischlampenschirm
Translation	“table lamp shade”

5. Query: Schnee Schuh Herren

Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.

Type	Value
Query	Schnee Schuh Herren
Product	Schneeschuhherren
Translation	“men’s snowshoes”

6. Query: Bett Decke Bezug

A common pattern in German catalogues and enterprise documents.

Type	Value
Query	Bett Decke Bezug
Product	Bettdeckenbezug
Translation	“bed duvet cover” / “bed comforter cover”

Korean — Pure Eojeol Segmentation Failures

Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.

1. Query: 세탁기 필터

Exact same lexemes; retrieval fails without splitting.

Type	Value
Query	세탁기 필터
Product	세탁기필터
Translation	“washing machine filter”

2. Query: 가습기 물통

The terms exist inside the eojeol but remain unreachable.

Type	Value
Query	가습기 물통
Product	가습기물통
Translation	“humidifier water tank”

3. Query: 블루투스 헤드폰

Without segmentation, it is treated as a single opaque token.

Type	Value
Query	블루투스 헤드폰
Product	블루투스헤드폰
Translation	“Bluetooth headphones”

4. Query: 기차표 가격

Even simple combinations cannot match unless morphemes are exposed.

Type	Value
Query	기차표 가격
Product	기차표가격
Translation	“train ticket price”

5. Query: 도어 손잡이

The user’s intention is present but hidden inside the long unit.

Type	Value
Query	도어 손잡이
Product	도어손잡이
Translation	“door handle”

6. Query: 휴대폰 케이스

A recurring cause of low recall in Korean e-commerce search.

Type	Value
Query	휴대폰 케이스
Product	휴대폰케이스
Translation	“mobile phone case” / “cellphone case”

Why This Breaks Modern Retrieval Pipelines

Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.

Keyword Search: Split queries never match unsegmented compounds.
Vector Search / Embeddings: Long compounds become single opaque tokens, harming embedding quality and preventing semantic alignment.
RAG Pipelines: Relevant chunks are not retrieved, which leads to incomplete context and weaker answers.
LLM Interpretation: When the model receives unsegmented tokens, internal semantic structure is lost.

Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.

A Practical Note on Decompounding

Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:

Original	Segmented	Translation
Waschmaschinenfilter	Waschmaschine Filter	Waschmaschine = “washing machine” Filter = “filter”
Staubsaugerbeutel	Staubsauger Beutel	Staubsauger = “vacuum cleaner” Beutel = “bag”
세탁기필터	세탁기 필터	세탁기 = “washing machine” 필터 = “filter”
휴대폰케이스	휴대폰 케이스	휴대폰 = “mobile phone / cellphone” 케이스 = “case”

Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.

Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.

Conclusion

German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.

admin

Next Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction »

Previous « The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Published by

admin

Tags: Entity extraction

4 months ago

German & Korean Retrieval Fails Without Proper Decompounding

Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

2. Query: Staub Sauger Beutel

3. Query: Kinder Wagen Zubehör

4. Query: Tisch Lampe Schirm

5. Query: Schnee Schuh Herren

6. Query: Bett Decke Bezug

Korean — Pure Eojeol Segmentation Failures

1. Query: 세탁기 필터

2. Query: 가습기 물통

3. Query: 블루투스 헤드폰

4. Query: 기차표 가격

5. Query: 도어 손잡이

6. Query: 휴대폰 케이스

Why This Breaks Modern Retrieval Pipelines

A Practical Note on Decompounding

Conclusion

Recent Posts

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

German & Korean Retrieval Fails Without Proper Decompounding

Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

2. Query: Staub Sauger Beutel

3. Query: Kinder Wagen Zubehör

4. Query: Tisch Lampe Schirm

5. Query: Schnee Schuh Herren

6. Query: Bett Decke Bezug

Korean — Pure Eojeol Segmentation Failures

1. Query: 세탁기 필터

2. Query: 가습기 물통

3. Query: 블루투스 헤드폰

4. Query: 기차표 가격

5. Query: 도어 손잡이

6. Query: 휴대폰 케이스

Why This Breaks Modern Retrieval Pipelines

A Practical Note on Decompounding

Conclusion

Related Post

Recent Posts

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance