NER

German & Korean Retrieval Fails Without Proper Decompounding

table.bitext-table {

width:100%;

border-collapse:collapse;

font-size:15px;

margin:10px 0 22px;

}

table.bitext-table th {

background-color:#b71c1c ; /* rojo Bitext */

color:#ffffff ;

padding:8px 10px;

border:1px solid #9c1515;

text-align:left;

}

table.bitext-table td {

padding:8px 10px;

border:1px solid #e0e0e0;

color:#333333;

}

table.bitext-table tr:nth-child(even) td {

background-color:#fafafa;

}

Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG

Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.

Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.


German — Pure Decompounding Failures

1. Query: Wasch Maschine Filter

Same lexemes and identical meaning, yet invisible without segmentation.

Type Value
Query Wasch Maschine Filter
Product Waschmaschinenfilter
Translation “washing machine filter”

2. Query: Staub Sauger Beutel

Users type separated words; systems that do not split the compound fail to match.

Type Value
Query Staub Sauger Beutel
Product Staubsaugerbeutel
Translation “vacuum cleaner bag”

3. Query: Kinder Wagen Zubehör

Separated input does not align with the glued compound form.

Type Value
Query Kinder Wagen Zubehör
Product Kinderwagenzubehör
Translation “stroller accessories”

4. Query: Tisch Lampe Schirm

Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.

Type Value
Query Tisch Lampe Schirm
Product Tischlampenschirm
Translation “table lamp shade”

5. Query: Schnee Schuh Herren

Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.

Type Value
Query Schnee Schuh Herren
Product Schneeschuhherren
Translation “men’s snowshoes”

6. Query: Bett Decke Bezug

A common pattern in German catalogues and enterprise documents.

Type Value
Query Bett Decke Bezug
Product Bettdeckenbezug
Translation “bed duvet cover” / “bed comforter cover”

Korean — Pure Eojeol Segmentation Failures

Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.

1. Query: 세탁기 필터

Exact same lexemes; retrieval fails without splitting.

Type Value
Query 세탁기 필터
Product 세탁기필터
Translation “washing machine filter”

2. Query: 가습기 물통

The terms exist inside the eojeol but remain unreachable.

Type Value
Query 가습기 물통
Product 가습기물통
Translation “humidifier water tank”

3. Query: 블루투스 헤드폰

Without segmentation, it is treated as a single opaque token.

Type Value
Query 블루투스 헤드폰
Product 블루투스헤드폰
Translation “Bluetooth headphones”

4. Query: 기차표 가격

Even simple combinations cannot match unless morphemes are exposed.

Type Value
Query 기차표 가격
Product 기차표가격
Translation “train ticket price”

5. Query: 도어 손잡이

The user’s intention is present but hidden inside the long unit.

Type Value
Query 도어 손잡이
Product 도어손잡이
Translation “door handle”

6. Query: 휴대폰 케이스

A recurring cause of low recall in Korean e-commerce search.

Type Value
Query 휴대폰 케이스
Product 휴대폰케이스
Translation “mobile phone case” / “cellphone case”

Why This Breaks Modern Retrieval Pipelines

Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.

  • Keyword Search: Split queries never match unsegmented compounds.
  • Vector Search / Embeddings: Long compounds become single opaque tokens, harming embedding quality and preventing semantic alignment.
  • RAG Pipelines: Relevant chunks are not retrieved, which leads to incomplete context and weaker answers.
  • LLM Interpretation: When the model receives unsegmented tokens, internal semantic structure is lost.

Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.

A Practical Note on Decompounding

Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:

Original Segmented Translation
Waschmaschinenfilter Waschmaschine Filter Waschmaschine = “washing machine”
Filter = “filter”
Staubsaugerbeutel Staubsauger Beutel Staubsauger = “vacuum cleaner” Beutel = “bag”
세탁기필터 세탁기 필터 세탁기 = “washing machine” 필터 = “filter”
휴대폰케이스 휴대폰 케이스 휴대폰 = “mobile phone / cellphone” 케이스 = “case”

Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.

Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.

Conclusion

German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.

 

admin

Recent Posts

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Problem. There’s broad consensus today: LLMs are phenomenal personal productivity tools — they draft, summarize,…

1 month ago

Using Public Corpora to Build Your NER systems

Rationale. NER tools are at the heart of how the scientific community is solving LLM…

2 months ago

Open-Source Data and Training Issues

As described in our previous post “Using Public Corpora to Build Your NER systems”, we…

2 months ago

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is…

3 months ago

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and…

9 months ago

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…

11 months ago