table.bitext-table {
width:100%;
border-collapse:collapse;
font-size:15px;
margin:10px 0 22px;
}
table.bitext-table th {
background-color:#b71c1c ; /* rojo Bitext */
color:#ffffff ;
padding:8px 10px;
border:1px solid #9c1515;
text-align:left;
}
table.bitext-table td {
padding:8px 10px;
border:1px solid #e0e0e0;
color:#333333;
}
table.bitext-table tr:nth-child(even) td {
background-color:#fafafa;
}
Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.
Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.
Same lexemes and identical meaning, yet invisible without segmentation.
| Type | Value |
|---|---|
| Query | Wasch Maschine Filter |
| Product | Waschmaschinenfilter |
| Translation | “washing machine filter” |
Users type separated words; systems that do not split the compound fail to match.
| Type | Value |
|---|---|
| Query | Staub Sauger Beutel |
| Product | Staubsaugerbeutel |
| Translation | “vacuum cleaner bag” |
Separated input does not align with the glued compound form.
| Type | Value |
|---|---|
| Query | Kinder Wagen Zubehör |
| Product | Kinderwagenzubehör |
| Translation | “stroller accessories” |
Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.
| Type | Value |
|---|---|
| Query | Tisch Lampe Schirm |
| Product | Tischlampenschirm |
| Translation | “table lamp shade” |
Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.
| Type | Value |
|---|---|
| Query | Schnee Schuh Herren |
| Product | Schneeschuhherren |
| Translation | “men’s snowshoes” |
A common pattern in German catalogues and enterprise documents.
| Type | Value |
|---|---|
| Query | Bett Decke Bezug |
| Product | Bettdeckenbezug |
| Translation | “bed duvet cover” / “bed comforter cover” |
Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.
Exact same lexemes; retrieval fails without splitting.
| Type | Value |
|---|---|
| Query | 세탁기 필터 |
| Product | 세탁기필터 |
| Translation | “washing machine filter” |
The terms exist inside the eojeol but remain unreachable.
| Type | Value |
|---|---|
| Query | 가습기 물통 |
| Product | 가습기물통 |
| Translation | “humidifier water tank” |
Without segmentation, it is treated as a single opaque token.
| Type | Value |
|---|---|
| Query | 블루투스 헤드폰 |
| Product | 블루투스헤드폰 |
| Translation | “Bluetooth headphones” |
Even simple combinations cannot match unless morphemes are exposed.
| Type | Value |
|---|---|
| Query | 기차표 가격 |
| Product | 기차표가격 |
| Translation | “train ticket price” |
The user’s intention is present but hidden inside the long unit.
| Type | Value |
|---|---|
| Query | 도어 손잡이 |
| Product | 도어손잡이 |
| Translation | “door handle” |
A recurring cause of low recall in Korean e-commerce search.
| Type | Value |
|---|---|
| Query | 휴대폰 케이스 |
| Product | 휴대폰케이스 |
| Translation | “mobile phone case” / “cellphone case” |
Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.
Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.
Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:
| Original | Segmented | Translation |
|---|---|---|
| Waschmaschinenfilter | Waschmaschine Filter | Waschmaschine = “washing machine” Filter = “filter” |
| Staubsaugerbeutel | Staubsauger Beutel | Staubsauger = “vacuum cleaner” Beutel = “bag” |
| 세탁기필터 | 세탁기 필터 | 세탁기 = “washing machine” 필터 = “filter” |
| 휴대폰케이스 | 휴대폰 케이스 | 휴대폰 = “mobile phone / cellphone” 케이스 = “case” |
Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.
Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.
German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.
Problem. There’s broad consensus today: LLMs are phenomenal personal productivity tools — they draft, summarize,…
Rationale. NER tools are at the heart of how the scientific community is solving LLM…
As described in our previous post “Using Public Corpora to Build Your NER systems”, we…
The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is…
The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and…
In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…