<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Bitext. We help AI understand humans. &#8211; chatbots that work</title>
	<atom:link href="https://www.bitext.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.bitext.com/</link>
	<description>We offer you all the tools needed to solve your NLP requests for chatbots and CX Analytics.</description>
	<lastBuildDate>Wed, 15 Apr 2026 18:26:41 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.6.5</generator>

<image>
	<url>https://www.bitext.com/wp-content/uploads/2020/04/favicon-150x150.ico</url>
	<title>Bitext. We help AI understand humans. &#8211; chatbots that work</title>
	<link>https://www.bitext.com/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Some of your RAG-related issues have an easy &#038; quick solution: lemmatization</title>
		<link>https://www.bitext.com/blog/some-of-your-rag-related-issues-have-an-easy-quick-solution-lemmatization/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Wed, 15 Apr 2026 14:34:11 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Lemmatization]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[semantic]]></category>
		<category><![CDATA[Stemming]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=44378</guid>

					<description><![CDATA[<p>Some RAG issues have a simpler fix than people think: better text normalization.</p>
<p>One common culprit is stemming. Stemming is a blunt, error-prone approach: it strips word endings mechanically, without properly accounting for morphology, part of speech, or context. That can and will often collapse unrelated words into the same stem just because they look similar on the surface. The result is noisy normalization.</p>
<p>The post <a href="https://www.bitext.com/blog/some-of-your-rag-related-issues-have-an-easy-quick-solution-lemmatization/">Some of your RAG-related issues have an easy &#038; quick solution: lemmatization</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_0 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_0">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_0  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_0  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><style>
  .bitext-example-box {
    background: #fff5f5;
    border-left: 4px solid #b71c1c;
    padding: 14px 16px;
    margin: 14px 0 22px;
    border-radius: 6px;
  }</p>
<p>  .bitext-example-box p {
    margin: 0 0 10px;
    font-size: 16px;
    color: #333333;
    line-height: 1.6;
  }</p>
<p>  .bitext-example-box p:last-child {
    margin-bottom: 0;
  }</p>
<p>  .bitext-highlight {
    display: inline-block;
    background: #fdeaea;
    color: #b71c1c;
    font-weight: 700;
    padding: 2px 6px;
    border-radius: 4px;
  }</p>
<p>  .bitext-benefits {
    background: #fafafa;
    border: 1px solid #e6e6e6;
    padding: 14px 16px;
    margin: 18px 0 22px;
    border-radius: 6px;
  }</p>
<p>  .bitext-benefits ul {
    margin: 0;
    padding-left: 20px;
  }</p>
<p>  .bitext-benefits li {
    margin: 6px 0;
    font-size: 16px;
    color: #333333;
    line-height: 1.6;
  }
</style>
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Some RAG issues have a simpler fix than people think: <strong>better text normalization</strong>.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  One common culprit is <strong>stemming</strong>. Stemming is a blunt, error-prone approach: it strips word endings mechanically, without properly accounting for morphology, part of speech, or context. That can and will often collapse unrelated words into the same stem just because they look similar on the surface.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  The result is noisy normalization.
</p>
<div class="bitext-example-box">
<p><strong>For example, in English, according to the widely used Porter stemmer:</strong></p>
<p><span class="bitext-highlight">“organization”</span> is wrongly linked to <span class="bitext-highlight">“organ”</span></p>
<p><span class="bitext-highlight">“news”</span> is wrongly associated to <span class="bitext-highlight">“new”</span></p>
<p><span class="bitext-highlight">“united”</span> is wrongly connected to <span class="bitext-highlight">“unit”</span></p>
</div>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  In languages with more complex morphologies like <strong>Spanish, German, French, Italian</strong> and others, these problems get worse.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Since stemming is performed at the beginning of the text analysis process, these errors affect every task that follows. The noise does not stay contained. It flows downstream into indexing, retrieval, and search, which means some of the “RAG problems” teams run into actually begin much earlier in the pipeline.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">
  Why lemmatization is different<br />
</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  <strong>Lemmatization</strong> avoids these noisy associations. Instead of chopping words mechanically, lemmatization maps inflected forms to their correct dictionary form, typically using morphological analysis and part-of-speech information.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  That makes it much better at normalizing real linguistic variation while avoiding many of the false matches that stemming introduces.
</p>
<div class="bitext-example-box">
<p><strong>In the above examples:</strong></p>
<p><span class="bitext-highlight">“organization”</span> is correctly linked to <span class="bitext-highlight">“organizations”</span></p>
<p><span class="bitext-highlight">“news”</span> is <strong>not</strong> associated to <span class="bitext-highlight">“new”</span>; they are independent, unrelated words</p>
<p><span class="bitext-highlight">“united”</span> is properly connected to <span class="bitext-highlight">“unite”</span></p>
</div>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Also, lemmatization is a fully deterministic, consistent and reliable process.
</p>
<div class="bitext-benefits">
<ul>
<li>fewer false positives</li>
<li>cleaner indexing</li>
<li>better retrieval quality</li>
<li>more robust multilingual search</li>
</ul>
</div>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  And since retrieval quality is critical for RAG, improving normalization upstream can have an outsized impact downstream.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">
  The real source of some RAG issues<br />
</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  A lot of teams treat retrieval issues as if they were generation issues.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Often, they are not.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Sometimes the problem starts with <strong>stemming</strong>.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin-top: 20px;">For a deeper understanding of how normalization impacts search relevance, check out this post on <a href="https://www.bitext.com/blog/lemmatization-vs-stemming/" style="color: #b71c1c; font-weight: 600;">lemmatization vs stemming</a>.</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_1  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_2  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_3 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_4 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_5  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_6  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_0">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_7  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_8  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_1">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_9  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_10  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_2">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_11  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_12  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_13  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_0">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/some-of-your-rag-related-issues-have-an-easy-quick-solution-lemmatization/">Some of your RAG-related issues have an easy &#038; quick solution: lemmatization</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form</title>
		<link>https://www.bitext.com/blog/the-hidden-signal-in-millions-of-news-articles-that-reveals-how-global-narratives-form/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Thu, 12 Mar 2026 18:37:51 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=44341</guid>

					<description><![CDATA[<p>The Experiment<br />
We tested this idea using the Leipzig English News corpora from the Wortschatz Project at Leipzig University. We analyzed datasets from 2023, 2024 and 2025.</p>
<p>Across these datasets, the pipeline processed roughly:</p>
<p>2 million raw news articles<br />
400K articles after topical filtering<br />
From these documents the pipeline extracted:</p>
<p>millions of entity mentions<br />
tens of millions of co-mention relationships<br />
To focus on economic and technology narratives, documents were filtered using the IPTC Media Topics taxonomy, keeping only:</p>
<p>Economy, Business and Finance<br />
Science and Technology</p>
<p>The post <a href="https://www.bitext.com/blog/the-hidden-signal-in-millions-of-news-articles-that-reveals-how-global-narratives-form/">The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_1 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_1">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_1  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_14  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><style>
  /* Estilo Bitext para tablas */</p>
<p>  table.bitext-table {
    width:100%;
    border-collapse:collapse;
    font-size:15px;
    margin:10px 0 22px;
  }</p>
<p>  table.bitext-table th {
    background-color:#b71c1c !important;  /* rojo Bitext */
    color:#ffffff !important;
    padding:8px 10px;
    border:1px solid #9c1515;
    text-align:left;
  }</p>
<p>  table.bitext-table td {
    padding:8px 10px;
    border:1px solid #e0e0e0;
    color:#333333;
  }</p>
<p>  table.bitext-table tr:nth-child(even) td {
    background-color:#fafafa;
  }
</style>
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Every day, millions of news articles are published about technology, business and geopolitics.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">But there is a signal hidden inside them that most analytics systems completely miss.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">It isn’t in what the articles say.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">It’s in which entities appear together.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Once you start measuring that signal, you can see how global narratives form.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">This phenomenon is called co-mentions, and it is widely used in knowledge graph construction and large-scale text analysis.</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Why Co-mentions Matter</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Counting mentions tells you which entities are important.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">But co-mentions tell you something far more valuable: how those entities are connected.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">That distinction is crucial.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">For example: AI might appear in thousands of articles.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">But if AI increasingly appears alongside Nvidia, something deeper is happening. It reveals a narrative forming:</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin-left: 20px;"><strong>AI infrastructure → Nvidia</strong></p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Similarly, when AI increasingly appears together with the US or China, the story changes. AI is no longer just a technology topic. It has become a geopolitical one.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Co-mentions allow us to detect these narrative shifts early – before they become obvious.</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">The Experiment</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">We tested this idea using the Leipzig English News corpora from the Wortschatz Project at Leipzig University. We analyzed datasets from 2023, 2024 and 2025.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Across these datasets, the pipeline processed roughly:</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px; margin-top: 10px;">
<li>2 million raw news articles</li>
<li>400K articles after topical filtering</li>
</ul>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">From these documents the pipeline extracted:</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px; margin-top: 10px;">
<li>millions of entity mentions</li>
<li>tens of millions of co-mention relationships</li>
</ul>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">To focus on economic and technology narratives, documents were filtered using the IPTC Media Topics taxonomy, keeping only:</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px; margin-top: 10px;">
<li>Economy, Business and Finance</li>
<li>Science and Technology</li>
</ul>
<table class="bitext-table">
<tbody>
<tr>
<th>Dataset Scope</th>
<th>Approximate Volume</th>
</tr>
<tr>
<td>Raw news articles processed</td>
<td>2 million</td>
</tr>
<tr>
<td>Articles after topical filtering</td>
<td>400K</td>
</tr>
<tr>
<td>Entity mentions extracted</td>
<td>Millions</td>
</tr>
<tr>
<td>Co-mention relationships generated</td>
<td>Tens of millions</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">How the Analysis Works</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">The pipeline combines entity extraction with graph analysis:</p>
<ol style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 22px; margin-top: 10px;">
<li>Entity recognition using the Bitext NLP SDK (companies, countries, technologies)</li>
<li>Entity normalization (e.g. “US”, “United States”, “America” → United States)</li>
<li>Extraction of relationships between entities appearing in the same document</li>
<li>Aggregation of co-mentions across the corpus</li>
</ol>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Relationships are generated by linking entities that appear in the same document, producing weighted co-mention edges.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">For example, if a document mentions US, China, Nvidia and AI, the system generates relationships such as:</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px; margin-top: 10px;">
<li>US – China</li>
<li>US – AI</li>
<li>China – AI</li>
<li>Nvidia – AI</li>
</ul>
<table class="bitext-table">
<tbody>
<tr>
<th>Pipeline Step</th>
<th>What It Does</th>
</tr>
<tr>
<td>Entity recognition</td>
<td>Extracts companies, countries, technologies and other entities from text</td>
</tr>
<tr>
<td>Normalization</td>
<td>Maps variants such as “US” and “America” to a canonical entity</td>
</tr>
<tr>
<td>Relationship extraction</td>
<td>Links entities appearing in the same document</td>
</tr>
<tr>
<td>Aggregation</td>
<td>Builds weighted co-mention patterns across the corpus</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">From Text to Knowledge Graph</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">When these relationships are aggregated across hundreds of thousands of articles, they form a knowledge graph that reveals patterns in global narratives.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Even a tiny fragment already tells a story:</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin-left: 20px;"><strong>AI → Nvidia → U.S. → China</strong></p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Technology → infrastructure → geopolitics.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Input</th>
<th>Transformation</th>
<th>Output</th>
</tr>
<tr>
<td>Unstructured news text</td>
<td>Entity extraction + co-mention analysis</td>
<td>Knowledge graph of entities and relationships</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Why This Matters</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Most of the world&#8217;s knowledge still lives in unstructured text. But once entities and relationships are extracted at scale, that text can be transformed into structured knowledge graphs ready for analysis.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">These graphs integrate naturally with platforms such as Neo4j, Stardog, Ontotext and MarkLogic, where the extracted entities and relationships can be explored and analyzed.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">In short: <strong>text → entities → relationships → knowledge graph</strong></p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">And once the graph exists, hidden signals start to appear.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Stage</th>
<th>Result</th>
</tr>
<tr>
<td>Text</td>
<td>Raw unstructured articles</td>
</tr>
<tr>
<td>Entities</td>
<td>Normalized companies, countries, technologies and other concepts</td>
</tr>
<tr>
<td>Relationships</td>
<td>Weighted co-mentions between entities</td>
</tr>
<tr>
<td>Knowledge graph</td>
<td>Structured narrative map ready for analysis</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">In Summary</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Co-mentions are one of the simplest signals you can extract from text.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">But at scale, they reveal how the world connects ideas, companies and countries.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">What other signals do you think could be extracted from large-scale news analysis?</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_15  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_16  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_17 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_18 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_19  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_20  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_3">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_21  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_22  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_4">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_23  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_24  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_5">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_25  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_26  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_27  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_1">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/the-hidden-signal-in-millions-of-news-articles-that-reveals-how-global-narratives-form/">The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction</title>
		<link>https://www.bitext.com/blog/why-llms-are-the-wrong-tool-for-enterprise-grade-entity-extraction/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Thu, 05 Feb 2026 15:18:01 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=44325</guid>

					<description><![CDATA[<p>Large Language Models are powerful systems for language generation and reasoning.<br />
However, when they are used for entity extraction in enterprise environments, they introduce instability where reliability is required.<br />
Entity extraction is not about creativity or interpretation. It is infrastructure. In production systems, entities must be extracted in a way that is consistent, repeatable, and stable over time.</p>
<p>Tagging consistency is essential to ensure that training is smooth. Contradictions and inconsistencies not only decrease accuracy but also generate hidden costs in MLOps when trying to debug and fix errors. We often take this consistency for granted, but that is rarely the case, not only in these datasets but also in any other manual tagging work.</p>
<p>Consistency starts with having a solid and clear definition of what an entity is. Typically, if not always, that’s not the case.</p>
<p>And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.</p>
<p>The post <a href="https://www.bitext.com/blog/why-llms-are-the-wrong-tool-for-enterprise-grade-entity-extraction/">Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_2 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_2">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_2  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_28  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><style>
  /* Estilo Bitext para tablas */
  table.bitext-table {
    width:100%;
    border-collapse:collapse;
    font-size:15px;
    margin:10px 0 22px;
  }
  table.bitext-table th {
    background-color:#b71c1c !important;  /* rojo Bitext */
    color:#ffffff !important;
    padding:8px 10px;
    border:1px solid #9c1515;
    text-align:left;
  }
  table.bitext-table td {
    padding:8px 10px;
    border:1px solid #e0e0e0;
    color:#333333;
  }
  table.bitext-table tr:nth-child(even) td {
    background-color:#fafafa;
  }
</style>
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Entity Extraction Is Infrastructure Task, Not a Generative Task</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Large Language Models are powerful systems for language generation and reasoning. However, when they are used for entity extraction in enterprise environments, they introduce instability where reliability is required.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Entity extraction is not about creativity or interpretation. It is infrastructure. In production systems, entities must be extracted in a way that is consistent, repeatable, and stable over time.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Why Probabilistic Models Break Deterministic Enterprise Pipelines</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  In enterprise workflows, the same input must always produce the same entities. LLMs are probabilistic by design. Even with temperature set to zero, their outputs can change due to prompt phrasing, surrounding context, or model updates.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  This variability is incompatible with systems that require long-term guarantees, such as search platforms, analytics pipelines, compliance systems, or enterprise RAG architectures.
</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Enterprise Requirement</th>
<th>LLM Behavior</th>
<th>Impact</th>
</tr>
<tr>
<td>Same input → same output</td>
<td>Outputs can vary across runs</td>
<td>Breaks repeatability and auditability</td>
</tr>
<tr>
<td>Long-term guarantees</td>
<td>Model updates can change behavior</td>
<td>Pipeline drift over time</td>
</tr>
<tr>
<td>Stable extraction contracts</td>
<td>Sensitive to prompts/context</td>
<td>Hidden regressions in production</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">The Problem with “Interpretation” in Entity Classification</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Enterprises do not need models that interpret what an entity might be. They need invariant behavior.
</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px; margin-top: 10px;">
<li>A company name should always be classified as a company.</li>
<li>A regulation reference should never disappear because the model decided it was not important in that context.</li>
</ul>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  LLMs optimize for plausibility. Enterprise systems require strict rules and predictable outcomes.
</p>
<table class="bitext-table">
<tbody>
<tr>
<th>What Enterprises Need</th>
<th>What LLMs Optimize For</th>
</tr>
<tr>
<td>Invariant classification</td>
<td>Plausible interpretation</td>
</tr>
<tr>
<td>Predictable outputs</td>
<td>Context-dependent responses</td>
</tr>
<tr>
<td>Auditable behavior</td>
<td>Emergent, hard-to-verify behavior</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Hallucinated Entities Corrupt Downstream Systems</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  One of the most dangerous failure modes of LLM-based entity extraction is hallucinated structure. LLMs can infer entities that are not explicitly present, normalize them incorrectly, or over-generalize across domains.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  In downstream systems such as search indexes, knowledge graphs, analytics, or RAG pipelines, these hallucinated entities silently corrupt data.
</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Failure Mode</th>
<th>What Happens</th>
<th>Downstream Risk</th>
</tr>
<tr>
<td>Hallucinated entity</td>
<td>Entity appears without textual evidence</td>
<td>Polluted index / KG nodes</td>
</tr>
<tr>
<td>Incorrect normalization</td>
<td>Wrong canonical form or mapping</td>
<td>Broken linking &#038; analytics</td>
</tr>
<tr>
<td>Over-generalization</td>
<td>Entities merged across domains</td>
<td>False positives in retrieval</td>
</tr>
</tbody>
</table>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Deterministic NLP systems tend to fail conservatively. LLMs fail confidently.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Why LLMs Are a Poor Fit for High-Volume Entity Extraction at Scale</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Entity extraction workloads are typically high-volume, low-latency, and CPU-friendly. Using LLMs for large-scale extraction introduces GPU dependency, variable latency, and unpredictable operational costs.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  This cost structure does not make sense when deterministic NLP systems can perform the same task faster, cheaper, and with zero variance.
</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Operational Dimension</th>
<th>Deterministic NLP</th>
<th>LLM-Based Extraction</th>
</tr>
<tr>
<td>Latency</td>
<td>Predictable</td>
<td>Variable</td>
</tr>
<tr>
<td>Cost</td>
<td>Stable, CPU-efficient</td>
<td>Unpredictable, often GPU-bound</td>
</tr>
<tr>
<td>Scaling</td>
<td>Linear &#038; controllable</td>
<td>Operationally complex</td>
</tr>
<tr>
<td>Variance</td>
<td>Zero</td>
<td>Non-zero</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">When LLMs Do Make Sense in Enterprise Architectures</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  LLMs are extremely effective after entity extraction, not instead of it.
</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px;">
<li><strong>Search platforms:</strong> deterministic NLP should extract and normalize entities before indexing. LLMs can then generate summaries, explanations, or conversational answers over clean, structured data.</li>
<li><strong>RAG systems:</strong> deterministic extraction ensures stable entities and metadata for retrieval. LLMs can reason over that context without inventing structure.</li>
<li><strong>Compliance and regulatory monitoring:</strong> deterministic NLP guarantees that organizations, legal references, and domain terms are always captured. LLMs can then explain changes or summarize impact.</li>
<li><strong>Analytics and knowledge graphs:</strong> deterministic extraction ensures consistent nodes and relationships. LLMs can sit on top as an insight or exploration layer, not as the source of truth.</li>
</ul>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">The Right Architecture: Deterministic NLP First, LLMs on Top</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  The most robust enterprise architectures separate concerns clearly. Deterministic NLP is responsible for structure, normalization, and linguistic guarantees. LLMs are responsible for reasoning, synthesis, and interaction.
</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Layer</th>
<th>Responsibility</th>
<th>Guarantee</th>
</tr>
<tr>
<td>Deterministic NLP</td>
<td>Structure, normalization, extraction</td>
<td>Stable, repeatable outputs</td>
</tr>
<tr>
<td>LLMs</td>
<td>Reasoning, synthesis, interaction</td>
<td>Helpful language generation</td>
</tr>
<tr>
<td>Rule of thumb</td>
<td>Consume structure</td>
<td>Do not invent structure</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">Enterprise-Grade Entity Extraction Requires Determinism</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  LLMs are extraordinary tools, but they are not universal ones. If your system must be predictable, auditable, and stable over time, entity extraction should remain deterministic.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  That is how enterprise-grade systems stay reliable as they scale.
</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_29  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_30  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_31 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_32 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_33  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_34  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_6">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_35  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_36  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_7">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_37  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_38  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_8">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_39  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_40  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_41  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_2">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/why-llms-are-the-wrong-tool-for-enterprise-grade-entity-extraction/">Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>German &#038; Korean Retrieval Fails Without Proper Decompounding</title>
		<link>https://www.bitext.com/blog/german-korean-retrieval-fails-without-proper-decompounding/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Mon, 08 Dec 2025 15:27:25 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=44162</guid>

					<description><![CDATA[<p>German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.</p>
<p>Tagging consistency is essential to ensure that training is smooth. Contradictions and inconsistencies not only decrease accuracy but also generate hidden costs in MLOps when trying to debug and fix errors. We often take this consistency for granted, but that is rarely the case, not only in these datasets but also in any other manual tagging work.</p>
<p>Consistency starts with having a solid and clear definition of what an entity is. Typically, if not always, that’s not the case.</p>
<p>And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.</p>
<p>The post <a href="https://www.bitext.com/blog/german-korean-retrieval-fails-without-proper-decompounding/">German &#038; Korean Retrieval Fails Without Proper Decompounding</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_3 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_3">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_3  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_42  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><style>
  /* Estilo Bitext para tablas */</p>
<p>table.bitext-table {</p>
<p>width:100%;</p>
<p>border-collapse:collapse;</p>
<p>font-size:15px;</p>
<p>margin:10px 0 22px;</p>
<p>}</p>
<p>table.bitext-table th {</p>
<p>background-color:#b71c1c !important;  /* rojo Bitext */</p>
<p>color:#ffffff !important;</p>
<p>padding:8px 10px;</p>
<p>border:1px solid #9c1515;</p>
<p>text-align:left;</p>
<p>}</p>
<p>table.bitext-table td {</p>
<p>padding:8px 10px;</p>
<p>border:1px solid #e0e0e0;</p>
<p>color:#333333;</p>
<p>}</p>
<p>table.bitext-table tr:nth-child(even) td {</p>
<p>background-color:#fafafa;</p>
<p>}</p>
</style>
<h2 style="font-size: 26px; color: #333333; margin-bottom: 20px; font-weight: bold;">Why decompounding is a must-have non-optional requirement for e-commerce search, vector search, and RAG</h2>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Search systems that work well in English, Spanish or French often collapse when they encounter German compounds or Korean eojeols. The issue is not ranking quality, not embedding quality, and not a lack of training data. The root cause is much simpler: compounding is a complex problem that involves tokenization, morphological analysis / lemmatization and connectors / Fugenelements. When a search or retrieval engine cannot see the internal structure of a word, it cannot align user queries with documents that contain the exact same meaning.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Below are rigorous examples where the query and the product documentation contain the same lexemes and the same intention, the only difference is the morphological form. However, without decompounding, retrieval fails.</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h2 style="font-size: 24px; color: #424242; margin: 24px 0 12px; font-weight: bold;">German — Pure Decompounding Failures</h2>
<p><!-- 1 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">1. Query: Wasch Maschine Filter</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Same lexemes and identical meaning, yet invisible without segmentation.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>Wasch Maschine Filter</td>
</tr>
<tr>
<td>Product</td>
<td>Waschmaschinenfilter</td>
</tr>
<tr>
<td>Translation</td>
<td>“washing machine filter”</td>
</tr>
</tbody>
</table>
<p><!-- 2 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">2. Query: Staub Sauger Beutel</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Users type separated words; systems that do not split the compound fail to match.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>Staub Sauger Beutel</td>
</tr>
<tr>
<td>Product</td>
<td>Staubsaugerbeutel</td>
</tr>
<tr>
<td>Translation</td>
<td>“vacuum cleaner bag”</td>
</tr>
</tbody>
</table>
<p><!-- 3 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">3. Query: Kinder Wagen Zubehör</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Separated input does not align with the glued compound form.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>Kinder Wagen Zubehör</td>
</tr>
<tr>
<td>Product</td>
<td>Kinderwagenzubehör</td>
</tr>
<tr>
<td>Translation</td>
<td>“stroller accessories”</td>
</tr>
</tbody>
</table>
<p><!-- 4 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">4. Query: Tisch Lampe Schirm</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Unless the engine identifies Tisch + Lampe(n) + Schirm, it cannot retrieve the item.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>Tisch Lampe Schirm</td>
</tr>
<tr>
<td>Product</td>
<td>Tischlampenschirm</td>
</tr>
<tr>
<td>Translation</td>
<td>“table lamp shade”</td>
</tr>
</tbody>
</table>
<p><!-- 5 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">5. Query: Schnee Schuh Herren</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Both sides refer to men’s snowshoes; the retrieval failure is purely for morphological reasons.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>Schnee Schuh Herren</td>
</tr>
<tr>
<td>Product</td>
<td>Schneeschuhherren</td>
</tr>
<tr>
<td>Translation</td>
<td>“men’s snowshoes”</td>
</tr>
</tbody>
</table>
<p><!-- 6 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">6. Query: Bett Decke Bezug</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">A common pattern in German catalogues and enterprise documents.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>Bett Decke Bezug</td>
</tr>
<tr>
<td>Product</td>
<td>Bettdeckenbezug</td>
</tr>
<tr>
<td>Translation</td>
<td>“bed duvet cover” / “bed comforter cover”</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h2 style="font-size: 24px; color: #424242; margin: 24px 0 12px; font-weight: bold;">Korean — Pure Eojeol Segmentation Failures</h2>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Korean packs multiple morphemes into a single orthographic unit. If the system cannot segment the eojeol, retrieval breaks for both keyword and vector search, even when the meaning is identical.</p>
<p><!-- KR 1 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">1. Query: 세탁기 필터</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Exact same lexemes; retrieval fails without splitting.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>세탁기 필터</td>
</tr>
<tr>
<td>Product</td>
<td>세탁기필터</td>
</tr>
<tr>
<td>Translation</td>
<td>“washing machine filter”</td>
</tr>
</tbody>
</table>
<p><!-- KR 2 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">2. Query: 가습기 물통</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">The terms exist inside the eojeol but remain unreachable.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>가습기 물통</td>
</tr>
<tr>
<td>Product</td>
<td>가습기물통</td>
</tr>
<tr>
<td>Translation</td>
<td>“humidifier water tank”</td>
</tr>
</tbody>
</table>
<p><!-- KR 3 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">3. Query: 블루투스 헤드폰</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Without segmentation, it is treated as a single opaque token.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>블루투스 헤드폰</td>
</tr>
<tr>
<td>Product</td>
<td>블루투스헤드폰</td>
</tr>
<tr>
<td>Translation</td>
<td>“Bluetooth headphones”</td>
</tr>
</tbody>
</table>
<p><!-- KR 4 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">4. Query: 기차표 가격</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">Even simple combinations cannot match unless morphemes are exposed.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>기차표 가격</td>
</tr>
<tr>
<td>Product</td>
<td>기차표가격</td>
</tr>
<tr>
<td>Translation</td>
<td>“train ticket price”</td>
</tr>
</tbody>
</table>
<p><!-- KR 5 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">5. Query: 도어 손잡이</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">The user’s intention is present but hidden inside the long unit.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>도어 손잡이</td>
</tr>
<tr>
<td>Product</td>
<td>도어손잡이</td>
</tr>
<tr>
<td>Translation</td>
<td>“door handle”</td>
</tr>
</tbody>
</table>
<p><!-- KR 6 --></p>
<h3 style="font-size: 18px; color: #333333; font-weight: bold; margin-top: 18px; margin-bottom: 6px;">6. Query: 휴대폰 케이스</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin: 0 0 6px 0;">A recurring cause of low recall in Korean e-commerce search.</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Type</th>
<th>Value</th>
</tr>
<tr>
<td>Query</td>
<td>휴대폰 케이스</td>
</tr>
<tr>
<td>Product</td>
<td>휴대폰케이스</td>
</tr>
<tr>
<td>Translation</td>
<td>“mobile phone case” / “cellphone case”</td>
</tr>
</tbody>
</table>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h2 style="font-size: 24px; color: #424242; margin: 24px 0 12px; font-weight: bold;">Why This Breaks Modern Retrieval Pipelines</h2>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Why This Breaks Modern Retrieval Pipelines: Retrieval depends on aligning user input with textual content. Without decompounding, this alignment cannot happen.</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px;">
<li><strong>Keyword Search:</strong> Split queries never match unsegmented compounds.</li>
<li><strong>Vector Search / Embeddings:</strong> Long compounds become single opaque tokens, harming embedding quality and preventing semantic alignment.</li>
<li><strong>RAG Pipelines:</strong> Relevant chunks are not retrieved, which leads to incomplete context and weaker answers.</li>
<li><strong>LLM Interpretation:</strong> When the model receives unsegmented tokens, internal semantic structure is lost.</li>
</ul>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Business Impact: In e-commerce, products remain hidden, recall drops, and conversion decreases. In enterprise search and RAG, relevant documents remain undiscovered, reducing accuracy and productivity.</p>
<h2 style="font-size: 24px; color: #424242; margin: 24px 0 12px; font-weight: bold;">A Practical Note on Decompounding</h2>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Any multilingual search or RAG system operating in German or Korean requires deterministic, high-accuracy decompounding. This is not a feature to add later; it is a foundational preprocessing layer. A proper decompounder should reliably segment forms such as:</p>
<table class="bitext-table">
<tbody>
<tr>
<th>Original</th>
<th>Segmented</th>
<th>Translation</th>
</tr>
<tr>
<td>Waschmaschinenfilter</td>
<td>Waschmaschine Filter</td>
<td>Waschmaschine = “washing machine”<br />Filter = “filter”</td>
</tr>
<tr>
<td>Staubsaugerbeutel</td>
<td>Staubsauger Beutel</td>
<td>Staubsauger = “vacuum cleaner” Beutel = “bag”</td>
</tr>
<tr>
<td>세탁기필터</td>
<td>세탁기 필터</td>
<td>세탁기 = “washing machine” 필터 = “filter”</td>
</tr>
<tr>
<td>휴대폰케이스</td>
<td>휴대폰 케이스</td>
<td>휴대폰 = “mobile phone / cellphone” 케이스 = “case”</td>
</tr>
</tbody>
</table>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">Segmented text leads to higher recall, more meaningful embeddings, more stable keyword and vector retrieval, and RAG systems that actually surface the right passages.</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6; margin-top: 18px;">Additionally, compounding is common phenomenon also in other languages beyond German and Korean; many other languages are affected by compounding and similar phenomena like agglutination: Dutch, Swedish, Norwegian Bokmål / Nynorsk, Danish, Finnish, Russian, Ukrainian, Hungarian, Turkish, Estonian, Latvian, Lithuanian or Czech among others.</p>
<h2 style="font-size: 24px; color: #424242; margin: 24px 0 12px; font-weight: bold;">Conclusion</h2>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">German and Korean do not break retrieval because they are unusually complex; they break retrieval because most systems still treat complex words as monolithic strings. When compounds and eojeols remain opaque, search engines cannot align queries with documents—even when they contain the same meaning. Any team building multilingual search, vector search or RAG must incorporate reliable decompounding as a foundational step to avoid systematic retrieval failures.</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_43  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_44  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_45 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_46 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_47  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_48  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_9">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_49  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_50  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_10">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_51  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_52  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_11">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_53  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_54  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_55  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_3">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/german-korean-retrieval-fails-without-proper-decompounding/">German &#038; Korean Retrieval Fails Without Proper Decompounding</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Lemmatization vs Stemming</title>
		<link>https://www.bitext.com/blog/lemmatization-vs-stemming/</link>
					<comments>https://www.bitext.com/blog/lemmatization-vs-stemming/#respond</comments>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Mon, 17 Nov 2025 00:06:30 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Lemmatization]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[artificial training data]]></category>
		<category><![CDATA[synthetic data]]></category>
		<category><![CDATA[synthetic text]]></category>
		<category><![CDATA[synthetic training data]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=39051</guid>

					<description><![CDATA[<p>Almost all of us use a search engine in our daily working routine, it has become a key tool to get our tasks done.</p>
<p>The post <a href="https://www.bitext.com/blog/lemmatization-vs-stemming/">Lemmatization vs Stemming</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_4 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_4">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_4  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_56  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><style>
  .bitext-example-box {
    background: #fff5f5;
    border-left: 4px solid #b71c1c;
    padding: 14px 16px;
    margin: 14px 0 22px;
    border-radius: 6px;
  }</p>
<p>  .bitext-example-box p {
    margin: 0 0 10px;
    font-size: 16px;
    color: #333333;
    line-height: 1.6;
  }</p>
<p>  .bitext-example-box p:last-child {
    margin-bottom: 0;
  }</p>
<p>  .bitext-highlight {
    display: inline-block;
    background: #fdeaea;
    color: #b71c1c;
    font-weight: 700;
    padding: 2px 6px;
    border-radius: 4px;
  }</p>
<p>  .bitext-benefits {
    background: #fafafa;
    border: 1px solid #e6e6e6;
    padding: 14px 16px;
    margin: 18px 0 22px;
    border-radius: 6px;
  }</p>
<p>  .bitext-benefits ul {
    margin: 0;
    padding-left: 20px;
  }</p>
<p>  .bitext-benefits li {
    margin: 6px 0;
    font-size: 16px;
    color: #333333;
    line-height: 1.6;
  }
</style>
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Almost all of us use a search engine in our daily work. It has become a key tool to get things done.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  However, as the amount of data grows exponentially, providing high-quality results that truly match user queries becomes more complex.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  One of the issues that complicates this process is <strong>ambiguous words</strong>.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  These are terms that have different meanings depending on their role in the sentence.
</p>
<div class="bitext-example-box">
<p><strong>Example:</strong></p>
<p><span class="bitext-highlight">“Let’s take a five-minute break in this meeting.”</span></p>
<p><span class="bitext-highlight">“This vase made of glass can break easily.”</span></p>
<p>
    In both sentences we use <span class="bitext-highlight">“break”</span>, but with different meanings:<br />
    as a noun in the first case, and as a verb in the second.
  </p>
</div>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  When working with large datasets, this ambiguity introduces noise. Search results may include documents that match the same word form, but not the intended meaning.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Some results are relevant, but many are not. This noise slows down the user and reduces search precision.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">
  Why ambiguity gets worse in multilingual environments<br />
</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Ambiguity may not be the biggest issue in English, but it becomes much more critical in highly inflected languages such as <strong>French, Spanish or Polish</strong>.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  These languages rely heavily on:
</p>
<ul style="font-size: 16px; color: #333333; line-height: 1.6; padding-left: 20px;">
<li>declensions</li>
<li>adjective and noun inflections</li>
<li>pronoun variations</li>
</ul>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  This makes normalization much more complex and much more important.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">
  How normalization affects search<br />
</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  When a user enters a query, the system must normalize both the query and the indexed data so they can match correctly.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  There are two main approaches:
</p>
<div class="bitext-example-box">
<p><strong>Lemmatization</strong></p>
<p>Maps a word to its correct dictionary form based on its usage and context.</p>
<p><strong>Stemming</strong></p>
<p>Removes characters from the end of a word using predefined rules, without understanding context.</p>
</div>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  In weakly inflected languages, the choice may not significantly impact results.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  But in highly inflected languages, the normalization method directly determines the accuracy of search results.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">
  Why lemmatization performs better<br />
</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  The main advantage of lemmatization is that it takes context into account to determine the intended meaning of a word.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  This reduces ambiguity and significantly decreases noise in search results.
</p>
<div class="bitext-benefits">
<ul>
<li>more precise matching</li>
<li>less noise in results</li>
<li>better handling of ambiguity</li>
<li>faster and more efficient user experience</li>
</ul>
</div>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  In practice, when dealing with ambiguous words, stemming often produces the same root for different meanings, while lemmatization preserves the distinction between them.
</p>
<hr style="border: 0; border-top: 1px solid #e0e0e0; margin: 24px 0;" />
<h3 style="font-size: 20px; color: #424242; margin: 18px 0 10px; font-weight: bold;">
  In summary<br />
</h3>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Ambiguity is a fundamental challenge in search, especially in multilingual and highly inflected environments.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  Choosing the right normalization strategy makes a significant difference in the quality of the results.
</p>
<p style="font-size: 16px; color: #333333; line-height: 1.6;">
  And in many cases, improving normalization upstream is the simplest way to improve search performance overall.
</p></div>
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/lemmatization-vs-stemming/">Lemmatization vs Stemming</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.bitext.com/blog/lemmatization-vs-stemming/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)</title>
		<link>https://www.bitext.com/blog/the-moment-to-pay-attention-to-hybrid-nlp-symbolic-ml/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Fri, 07 Nov 2025 19:29:45 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=44142</guid>

					<description><![CDATA[<p>Problem. There’s broad consensus today: LLMs are phenomenal personal productivity tools — they draft, summarize, and assist effortlessly.<br />
But there’s also growing recognition that they’re still not ready for enterprise-grade deployment.</p>
<p>The post <a href="https://www.bitext.com/blog/the-moment-to-pay-attention-to-hybrid-nlp-symbolic-ml/">The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_5 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_5">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_5  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_57  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>Problem. There’s broad consensus today: LLMs are phenomenal personal productivity tools — they draft, summarize, and assist effortlessly.<br />But there’s also growing recognition that they’re still not ready for enterprise-grade deployment.</p>
<p>Why? Because enterprises need more than good prose. They need structured, reliable, explainable data — not probabilistic text. An LLM that hallucinates a CEO name or mislabels a supplier can break compliance, contracts, and trust.</p>
<p>Solution. The way forward is to extract key data and structure it as Knowledge Graphs (KGs). These graphs become the <em>backbone knowledge</em> that LLMs can safely reason over — grounding their outputs in verified, linked data.</p>
<p>This architectural shift is emerging under the GraphRAG and NodeRAG paradigms:</p>
<ul>
<li>GraphRAG: retrieval-augmented generation where context comes from <em>relationships</em> between entities in a graph (not flat embeddings).</li>
<li>NodeRAG: fine-grained RAG where specific <em>nodes</em> and their properties are retrieved as context for the model.</li>
</ul>
<p><strong>Example:</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p data-start="1254" data-end="1422">Instead of asking an LLM “Who supplies lithium to Tesla?” and hoping it guesses right, a GraphRAG pipeline retrieves verified entities and relations:</p>
<p data-start="1429" data-end="1501">Tesla —[supplier]→ Albemarle Corporation —[product]→ Lithium hydroxide</p>
</blockquote>
<p data-start="1508" data-end="1584">The LLM then uses this context to generate a grounded, auditable response.</p>
<p>The LLM then uses this context to generate a grounded, auditable response.</p>
<p>Challenge. Building these knowledge graphs manually is impossible at enterprise scale.<br />To populate them, we need (semi-)automated extraction pipelines that are:</p>
<ul>
<li>Accurate — 90%+ precision/recall for entity and relation detection,</li>
<li>Performant — capable of processing millions of documents per day,</li>
<li>Ubiquitous — deployable on-prem, in cloud, or hybrid setups,</li>
<li>Portable — running equally well on Windows, Linux, and ARM environments.</li>
</ul>
<p>Current LLMs can’t meet these constraints. They are resource-hungry, unpredictable, and non-deterministic. Enterprise knowledge graphs need precision and reproducibility, not probabilistic outputs.</p>
<p>That’s where Symbolic NLP — combined with efficient ML components — steps in. Rule-based and morphology-aware engines can deterministically extract entities, relations, and attributes, feeding clean data into a knowledge graph layer.</p>
<p><strong>Example:</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p data-start="1254" data-end="1422">Symbolic NLP can reliably parse <em>“Generalversammlung der Vereinten Nationen”</em> as Organization: United Nations General Assembly, recognizing inflection and structure without hallucination. An LLM might miss that entirely or translate it inconsistently.</p>
</blockquote>
<p>Even Microsoft acknowledges this reality in their internal taxonomy of retrieval architectures. They now distinguish between:</p>
<ul>
<li>Standard GraphRAG — LLM-driven pipelines, flexible but slow and opaque;</li>
<li>FastGraphRAG — deterministic and efficient symbolic/ML pipelines that pre-compute structure for high throughput. <a href="https://microsoft.github.io/graphrag/index/methods/">Microsoft FastGraphRAG reference</a></li>
</ul>
<p>The trend is clear: the future of enterprise AI lies in combining symbolic precision with generative flexibility.</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_58  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>Bitext is releasing a new suite of Symbolic NLP engines designed for this hybrid AI architecture:</p>
<ul>
<li>Speed: 3.2 MB of plain text per second on an 8-core CPU — no GPU needed.</li>
<li>Accuracy: Over 90% F1 measured on standard multilingual benchmark corpora.</li>
<li>Compatibility: Runs on Windows, Linux, and ARM; deployable locally or in cloud pipelines.</li>
</ul>
<p>Conclusion. The industry is shifting from “prompting models” to building structured knowledge backbones.<br />Symbolic NLP isn’t old-school anymore — it’s the precision machinery that makes enterprise AI trustworthy, explainable, and scalable.</p>
<p>Now is the moment to pay attention to NLP.</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_59  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_60 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_61 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_62  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_63  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_12">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_64  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_65  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_13">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_66  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_67  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_14">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_68  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_69  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_70  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_4">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/the-moment-to-pay-attention-to-hybrid-nlp-symbolic-ml/">The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Using Public Corpora to Build Your NER systems</title>
		<link>https://www.bitext.com/blog/using-public-corpora-to-build-your-ner-systems-post/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Mon, 20 Oct 2025 08:06:57 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=43998</guid>

					<description><![CDATA[<p>Rationale. NER tools are at the heart of how the scientific community is solving LLM issues using GraphRAG and NodeRAG architectures.</p>
<p>LLMs need knowledge graphs to control hallucinations and make them more solid for enterprise-level use.</p>
<p>And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.</p>
<p>The post <a href="https://www.bitext.com/blog/using-public-corpora-to-build-your-ner-systems-post/">Using Public Corpora to Build Your NER systems</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_6 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_6">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_6  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_71  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><div style="font-family: Verdana, Geneva, sans-serif; font-size: 19px; line-height: 1.7; color: #555; font-weight: 400;">
<p><strong>Rationale.</strong> NER tools are at the heart of how the scientific community is solving LLM issues using GraphRAG and NodeRAG architectures.</p>
<p>LLMs need knowledge graphs to control hallucinations and make them more solid for enterprise-level use.</p>
<p>And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.</p>
<p><strong>Open-Source Tools.</strong> When starting an Entity Extraction project, it’s typical to start by leveraging open-source, machine-learning-based tools.</p>
<p>Open-source tools are widespread and adapt to different levels of execution, from POC to production-ready, like Hugging Face, Spark NLP or spaCy.</p>
<p><strong>Open-Source Data.</strong> These tools rely on third-party datasets for model training and evaluation, typically manually tagged corpora with NER information (Person, Place, Organization, Company…).</p>
<p>Developing new data is expensive and complex, which is why most projects avoid producing their own tagged data.</p>
<p>Therefore, the main alternative to get started is a combination of open-source tools and data. OntoNotes or CoNLL are good examples of this type of datasets for English.</p>
<p><strong>Data is Critical.</strong> These datasets are used for two critical purposes:</p>
<ul style="margin-left: 20px; padding-left: 0;">
<li>for training, i.e. building the core of our NER tool</li>
<li>for evaluation, i.e. determining if our project is a success and can be used in public settings</li>
</ul>
<p><strong>Data is a Blackbox?</strong> The datasets are open, meaning anyone can examine the text, the tagging… However, these datasets are often treated as “black boxes”, i.e. they are used to build NER models without much analysis or understanding of their weaknesses and the implications of these weaknesses. (We will not focus on their strengths, since they are definitely well-known to the community, that’s why they are so popular.)</p>
<p>In this series of posts, we are going to try and make those black boxes more transparent, drawing on our experience in using them at Bitext for evaluation purposes.</p>
<p>We will identify areas where the datasets can be improved and will provide some tips on how to avoid these issues, whenever possible with (semi-)automatic techniques.</p>
<p><strong>First, we classify the different types of issues into 3 groups:</strong></p>
<ol style="margin-left: 20px; list-style-position: outside;">
<li><strong>Training issues:</strong> common types of inconsistencies, both in gold (manual) and silver (semi-automatic) datasets — more on this in future posts.</li>
<li><strong>Evaluation:</strong> how misleading it can be to use the same corpus for training and evaluation.</li>
<li><strong>Deployment issues:</strong> licensing has a strong impact when moving from POC to production.</li>
</ol>
</div></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_72  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_73  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_74 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_75 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><span data-olk-copy-source="MessageBody">Next Post: <a href="https://www.bitext.com/blog/open-source-data-and-training-issues/">“Open-Source Data and Training Issues”</a></span></p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_76  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_77  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_15">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_78  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_79  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_16">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_80  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_81  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_17">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_82  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_83  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_84  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_5">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/using-public-corpora-to-build-your-ner-systems-post/">Using Public Corpora to Build Your NER systems</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Open-Source Data and Training Issues</title>
		<link>https://www.bitext.com/blog/open-source-data-and-training-issues-post/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Mon, 20 Oct 2025 08:04:06 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=44007</guid>

					<description><![CDATA[<p>As described in our previous post “Using Public Corpora to Build Your NER systems”, we are going to highlight areas where public datasets like OntoNotes or CoNLL can be improved. We will provide some tips on how to avoid these issues, whenever possible, using (semi-)automatic techniques.</p>
<p>Tagging consistency is essential to ensure that training is smooth. Contradictions and inconsistencies not only decrease accuracy but also generate hidden costs in MLOps when trying to debug and fix errors. We often take this consistency for granted, but that is rarely the case, not only in these datasets but also in any other manual tagging work.</p>
<p>Consistency starts with having a solid and clear definition of what an entity is. Typically, if not always, that’s not the case.</p>
<p>And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.</p>
<p>The post <a href="https://www.bitext.com/blog/open-source-data-and-training-issues-post/">Open-Source Data and Training Issues</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_7 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_7">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_7  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_85  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>As described in our previous post “Using Public Corpora to Build Your NER systems”, we are going to highlight areas where public datasets like <strong>OntoNotes</strong> or <strong>CoNLL</strong> can be improved. We will provide some tips on how to avoid these issues, whenever possible, using (semi-)automatic techniques.</p>
<p>Tagging consistency is essential to ensure that training is smooth. Contradictions and inconsistencies not only decrease accuracy but also generate hidden costs in MLOps when trying to debug and fix errors. We often take this consistency for granted, but that is rarely the case, not only in these datasets but also in any other manual tagging work.</p>
<p>Consistency starts with having a solid and clear definition of what an entity is. Typically, if not always, that’s not the case.</p>
<p><strong>Entities vs Non-Entities.</strong> What’s an entity anyway? The definition of “entity” is a cornerstone for a NER project and should be 100% clear if we are automating the detection of entities, but this is not always the case.</p>
<p>For example, in <strong>WikiNEuRal</strong>, a well-known multilingual set of corpora, entities like “MVP” (Most Valuable Player) or “DJ” (Disc Jockey) are not tagged. In our view, they should be tagged – in this case as <strong>PERSON</strong>:</p>
<p><strong>Example in Spanish tagging:</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p><strong>Input Sentence:</strong> “En 1980 y 1983 fue elegido como el MVP en toda Europa”<br />
  <strong>Gold Tagging:</strong> Europa:LOCATION (MVP-missing)</p>
</blockquote>
<p><strong>Example in Portuguese:</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p><strong>Input Sentence:</strong> Esse estilo era exclusivamente um fenômeno de Chicago , mas em 1987 virou febre no Reino Unido e na Europa Continental , sendo muito tocado por Djs .<br />
  <strong>Gold Tagging:</strong> Chicago:LOCATION Reino Unido:LOCATION Europa Continental:LOCATION (Djs-missing)</p>
</blockquote>
<p>This same problem happens with other corpora, such as the <strong>UNER Swedish PUD</strong> corpus:</p>
<p><strong>Example in Swedish: Entity “Paris Agreement” should be tagged as MISCELLANEOUS</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p><strong>Input Sentence:</strong> Det är fantastiskt att de fick Parisavtalet men deras insatser är för tillfället inte i närheten av målet på 1,5 grader.<br />
  <strong>Gold Tagging:</strong> (Parisavtale-missing)</p>
</blockquote>
<p><strong>Example in Swedish: Entity “Brexit” should be tagged as MISCELLANEOUS</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p><strong>Input Sentence:</strong> May har fått stor kritik för att ha undvikit och inte svarat öppet till media efter rättsutlåtandet om Brexit.<br />
  <strong>Gold Tagging:</strong> May:PERSON (Brexit-missing)</p>
</blockquote>
<p>And similar cases occur across other languages and corpora:</p>
<p><strong>Example in Russian (in WikiNEuRal Russian): “Альмохады” (“Almohads”) not tagged as MISCELLANEOUS</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p><strong>Input Sentence:</strong> В 1130-е годы Альмохады расширяли своё влияние в горных областях Марокко , в восточных и южных районах страны .<br />
  <strong>Gold Tagging:</strong> Марокко:LOCATION (Альмохады-missing)</p>
</blockquote>
<p><strong>Example in Korean (in KLUE): “인권센터는” (“Human Rights Center”) not tagged as ORGANIZATION</strong></p>
<blockquote style="border-left: 4px solid #888; padding-left: 15px; margin-left: 0;">
<p><strong>Input Sentence:</strong> 시 인권센터는 민간조사전문가 1 명을 포함한 사건조사팀을 구성 , 21 일간 신청인과 참고인 , 피신청인 16 명에 대한 진술조사와 현장조사를 한 결과 이같이 판정했다고 30 일 밝혔다 .<br />
  <strong>Gold Tagging:</strong> (인권센터는-missing)</p>
</blockquote>
<p>This same problem happens with many other entities, often of type <strong>MISCELLANEOUS</strong>: GDP (Gross Domestic Product), DVD, Blu-ray, VHS… The list is long and not documented in any corpus as far as we know.</p>
<p><strong>A Possible Solution.</strong> For languages that use capitalization (like English, Spanish…), the solution involves a significant amount of work. To detect entities not tagged we will need to extract all capitalized strings from the corpus, separate the ones that are not labelled and check them, either manually (safest way) or against gazetteers, to shortcut the task. The main complication, but not the only one, is that words at the beginning of sentences are always capitalized in many languages, even when they are regular words.</p>
<p>For languages that do not use capital letters (Arabic, Korean, Chinese, Japanese…) the solution is even harder; it would involve checking the corpus without the help of capitalization.</p>
<p>Given that this solution involves significant work, a good shortcut for all languages is to compile a list of most relevant entities we need to tag, and make sure they are tagged in our training corpora. This is not a perfect solution but at least it ensures that we will not miss the most relevant entities.</p>
<p>We will review more cases that involve different entity types, ambiguity, lack of criteria…</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_86  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_87  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_88 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_89 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><span data-olk-copy-source="MessageBody">Previous Post: <a href="https://www.bitext.com/blog/using-public-corpora-to-build-your-ner-systems/">&#8220;Using Public Corpora to Build Your NER systems&#8221;</a></span></p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_90  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_91  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_18">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_92  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_93  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_19">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_94  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_95  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_20">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_96  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_97  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_98  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_6">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/open-source-data-and-training-issues-post/">Open-Source Data and Training Issues</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance</title>
		<link>https://www.bitext.com/blog/why-semantic-intelligence-is-the-missing-link-in-active-metadata-and-data-governance/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Sat, 13 Sep 2025 07:30:33 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=43993</guid>

					<description><![CDATA[<p>The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is no longer about static catalogs. Vendors are moving fast into Active Metadata and Agentic AI, with features like lineage, observability, policy enforcement, and marketplaces for data assets.</p>
<p>The post <a href="https://www.bitext.com/blog/why-semantic-intelligence-is-the-missing-link-in-active-metadata-and-data-governance/">Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_8 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_8">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_8  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_99  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><strong>The Semantic Gap in Today’s Governance Platforms</strong></p>
<p id="ember2207" class="ember-view reader-text-block__paragraph">Forrester’s evaluations show that, despite strong advances in automation and lineage, many platforms underperform on semantic depth.</p>
<ul>
<li>Collibra: strong in workflows and policy management, but AI-driven semantic enforcement is still limited; customers face significant manual work.</li>
<li>Informatica: powerful in technical lineage, but limited in semantic capabilities beyond structured metadata.</li>
<li>Alation: ambitious vision of agentic governance, but still weak in multilingual semantic enrichment and natural-language rule creation.</li>
<li>Atlan and Ataccama: leaders in user experience, quality, and observability, but entity, concept, and relationship extraction from unstructured sources remains immature.</li>
<li><a class="pwHwmXBqmofSClxsctVkNQQzvNIASaSVkFYLg " tabindex="0" href="http://data.world/" target="_self" data-test-app-aware-link="" rel="noopener">data.world</a>, Solidatus, Anjana Data: innovative in lineage or collaboration, but their semantic and entity resolution functions require heavy effort from customers.</li>
</ul>
<p id="ember2209" class="ember-view reader-text-block__paragraph"><strong>Without robust semantics, active metadata is not possible. </strong></p>
<p id="ember2210" class="ember-view reader-text-block__paragraph"><strong>Why This Matters: The Unstructured Data Blind Spot</strong></p>
<p id="ember2211" class="ember-view reader-text-block__paragraph">Around 80% of enterprise data is unstructured: reports, contracts, presentations, emails, logs, customer interactions, and knowledge bases.</p>
<ul>
<li>A bank may need to align compliance rules with contracts, call transcripts, and transaction logs.</li>
<li>A global enterprise may need to unify customer records, policy documents, and legal texts across multiple languages.</li>
<li>A technology company may want to automatically tag and classify knowledge bases to create a chatbot for employee support.</li>
</ul>
<p id="ember2213" class="ember-view reader-text-block__paragraph">Without advanced NLP (entity recognition, concept extraction and relationship mapping)  this vast body of information remains invisible to governance platforms or customer support teams.</p>
<p id="ember2214" class="ember-view reader-text-block__paragraph"><strong>The Role of Multilingual Semantics in Active Metadata</strong></p>
<p id="ember2215" class="ember-view reader-text-block__paragraph">Active metadata should not just catalog technical objects; it should understand what data means. For that, governance platforms require a Semantic Enrichment Engine with the following capabilities:</p>
<ul>
<li>Entity and concept extraction: automatically detect business objects such as “customer ID,” “AML regulation,” or “support ticket.”</li>
<li>Relationship discovery: link concepts across unstructured datasets.</li>
<li>Multilingual coverage: enable governance in languages like Chinese, Japanese, Spanish, German, French, Korean, Arabic… ensuring consistency and accuracy.</li>
<li>Unstructured data enrichment: transform PDFs, reports, and communications into governed, discoverable knowledge.</li>
<li>Ontology and taxonomy support: integrate existing business glossaries, identify synonyms and semantic variants, and connect data elements within a broader knowledge graph.</li>
<li>Automation through semantics: trigger workflows, policy enforcement, and recommendations based on semantic signals, not just technical metadata.</li>
</ul>
<p id="ember2217" class="ember-view reader-text-block__paragraph"><strong> Where </strong><a class="pwHwmXBqmofSClxsctVkNQQzvNIASaSVkFYLg " tabindex="0" href="https://bitext.com/" target="_self" data-test-app-aware-link="" rel="noopener"><strong>Bitext</strong></a><strong> Helps</strong></p>
<p id="ember2218" class="ember-view reader-text-block__paragraph">At Bitext, we provide an OEM Semantic Enrichment Engine designed to power active metadata and data governance platforms with the semantic depth most vendors still lack.</p>
<p id="ember2219" class="ember-view reader-text-block__paragraph">Key technical advantages of our Semantic Enrichment Engine include:</p>
<ul>
<li>Flexible deployment: available for both on-premises and cloud installations, accessible via REST API or native integration.</li>
<li>Developer-friendly integration: bindings for C, Python, and Java for seamless embedding into existing stacks.</li>
<li>Multiplatform by design: platform-independent C, supporting Windows, Linux, macOS, x64, and ARM.</li>
<li>High-performance NLP pipeline: from language identification to entity/concept extraction, processing over 640,000 words per second (3.2MB/sec) on a single 8-core CPU.</li>
<li>Lightweight footprint: average storage per language pipeline is only 50MB with no external dependencies, and average memory usage 200MB.</li>
<li>Extreme compression: client data sources compressed at ratios up to 1:100 (100MB reduced to 1MB).</li>
<li>Ultra-fast querying: compressed external data accessed at speeds of more than 400 million queries per second on a single 8-core CPU.</li>
</ul>
<p id="ember2221" class="ember-view reader-text-block__paragraph">With these capabilities, our Semantic Enrichment Engine allows governance platforms to scale semantic enrichment across massive volumes of unstructured data, in multiple languages, without compromising performance or cost.</p>
<p id="ember2222" class="ember-view reader-text-block__paragraph"><strong>Final Thought</strong></p>
<p id="ember2223" class="ember-view reader-text-block__paragraph">The Forrester Wave highlights the progress of data governance vendors, but also their weakness: semantic depth is not yet where it should be. Active metadata is the future, but without strong semantic intelligence it remains incomplete.</p>
<p id="ember2224" class="ember-view reader-text-block__paragraph">If data governance is to truly drive trust, compliance, and monetization, semantics must evolve from being an optional extra to becoming a core capability.</p>
<p id="ember2225" class="ember-view reader-text-block__paragraph">That is exactly what Bitext delivers with its Semantic Enrichment Engine.</p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_100  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_101  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_102 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_103 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">More info about <a href="https://www.bitext.com/namer_entity_recognition/">Bitext NAMER</a></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_104  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_105  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_21">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_106  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_107  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_22">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_108  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_109  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_23">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_110  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_111  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_112  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_7">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/why-semantic-intelligence-is-the-missing-link-in-active-metadata-and-data-governance/">Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction</title>
		<link>https://www.bitext.com/blog/bitext-namer-slashing-time-and-costs-in-automated-knowledge-graph-construction/</link>
		
		<dc:creator><![CDATA[admin]]></dc:creator>
		<pubDate>Sun, 16 Mar 2025 14:59:40 +0000</pubDate>
				<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[knowledge graph]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[Entity extraction]]></category>
		<guid isPermaLink="false">https://www.bitext.com/?p=43885</guid>

					<description><![CDATA[<p>The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and extract actionable insights from their data. However, traditional methods of constructing Knowledge Graphs are often slow, expensive, and complex, requiring significant expertise and manual effort. Bitext NAMER changes the game by automating key steps in the Knowledge Graph creation process, making it faster, more cost-effective, and accessible for businesses of all sizes. </p>
<p>The post <a href="https://www.bitext.com/blog/bitext-namer-slashing-time-and-costs-in-automated-knowledge-graph-construction/">Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="et_pb_section et_pb_section_9 et_pb_with_background et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_9">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_9  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_113  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><span class="TextRun SCXW249045630 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW249045630 BCX0">The process of building Knowledge Graphs is essential for organizations </span><span class="NormalTextRun SCXW249045630 BCX0">seeking</span><span class="NormalTextRun SCXW249045630 BCX0"> to organize, structure, and extract actionable insights from their data. However, traditional methods of constructing Knowledge Graphs are often slow, expensive, and complex, requiring significant </span><span class="NormalTextRun SCXW249045630 BCX0">expertise</span><span class="NormalTextRun SCXW249045630 BCX0"> and manual effort. Bitext NAMER changes the game by automating key steps in the Knowledge Graph creation process, making it faster, more cost-effective, and accessible for businesses of all sizes.</span></span><span class="EOP SCXW249045630 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0}"> </span></p>
<p><b><span data-contrast="auto">The Knowledge Graph Creation Workflow Simplified</span></b><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><span data-contrast="auto">The process of constructing a knowledge graph involves multiple stages, including ontology or taxonomy creation, entity extraction, relationship mapping, and integration of structured and unstructured data. Traditionally, this process required extensive manual effort from domain experts and data engineers. Bitext NAMER automates key components of this workflow:</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<ol>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">Ontology and Taxonomy Development</span></b><span data-contrast="auto">: While manual ontology creation can take weeks or months, Bitext NAMER simplifies this by providing pre-built dictionaries with over 100,000 entities per language and customizable annotated corpora. These resources serve as the foundation for creating domain-specific ontologies.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Entity Extraction</span></b><span data-contrast="auto">: Bitext NAMER identifies 20 types of entities (e.g., people, organizations, locations) with over 95% accuracy across multiple languages. This eliminates the need for manual tagging or annotation while ensuring high-quality data for the KG.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><b><span data-contrast="auto">Relationship Mapping</span></b><span data-contrast="auto">: The tool detects semantic relationships between entities in real time, enabling the automatic creation of connections within the knowledge graph.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"><b><span data-contrast="auto">Data Integration</span></b><span data-contrast="auto">: By processing both structured and unstructured data from diverse sources, Bitext NAMER ensures seamless integration into existing knowledge frameworks.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ol>
<p><span data-contrast="auto">This automation reduces the time required to construct a knowledge graph from months to days or even hours, depending on the complexity of the data.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><b><span data-contrast="auto">Time and Cost Efficiency</span></b><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><span data-contrast="auto">The use of Bitext NAMER significantly reduces the time and cost associated with knowledge graph construction:</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">Time Savings</span></b><span data-contrast="auto">: Manual KG construction typically requires 200-300 hours for domain-specific projects. With Bitext NAMER, this can be reduced by up to 90%, allowing completion in as little as 15-25 hours.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Cost Reduction</span></b><span data-contrast="auto">: Automating entity extraction and relationship mapping eliminates the need for large teams of annotators or ontology engineers. This translates into cost savings of up to 70%, particularly for organizations processing large volumes of text across multiple languages.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<p><span data-contrast="auto">For example, a financial services company using Bitext NAMER to build a KG for market intelligence could process thousands of documents daily without incurring the high costs associated with manual efforts.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_114  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><b><span data-contrast="auto">The Challenges of Multilingual NER and Its Importance for Global <span class="TextRun SCXW128346597 BCX0" lang="EN-GB" xml:lang="EN-GB" data-contrast="auto"><span class="NormalTextRun SCXW128346597 BCX0">Knowledge Graphs</span></span></span></b><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><span data-contrast="auto">Global enterprises often operate in multilingual environments, necessitating NER solutions that:</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><span data-contrast="auto">Handle linguistic diversity and nuances.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><span data-contrast="auto">Maintain consistency across languages.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="2" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><span data-contrast="auto">Address region-specific variations, such as named entity formats and cultural context.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<p><span data-contrast="auto">Failure to address these complexities can lead to fragmented KGs, diminishing their utility and reliability.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_115  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><b><span data-contrast="auto">Technical Performance Highlights</span></b><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><span data-contrast="auto">Bitext NAMER&#8217;s technical capabilities are optimized for enterprise-scale KG construction:</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">Processing Speed</span></b><span data-contrast="auto">: Up to 100KB of raw text per second per CPU core.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Multilingual Support</span></b><span data-contrast="auto">: Covers over 20 languages natively (e.g., English, Spanish, French) with dictionaries available in 77 languages.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><b><span data-contrast="auto">Entity Coverage</span></b><span data-contrast="auto">: Recognizes diverse entity types such as people, places, companies/brands, account numbers, and phone numbers.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<ul>
<li data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:&#091;8226&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"><b><span data-contrast="auto">Deployment Flexibility</span></b><span data-contrast="auto">: Available as an on-premise SDK or via SaaS API.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ul>
<p><span data-contrast="auto">These features make it possible to handle complex datasets across industries such as finance, e-commerce, and cybersecurity.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><b><span data-contrast="auto">Applications in Knowledge Graph Automation</span></b><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<p><span data-contrast="auto">The automation enabled by Bitext NAMER has transformative applications in various domains:</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></p>
<ol>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="5" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"><b><span data-contrast="auto">Semantic Systems</span></b><span data-contrast="auto">: Enhances search engines by creating semantic relationships between structured and unstructured data.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="5" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"><b><span data-contrast="auto">Financial Intelligence</span></b><span data-contrast="auto">: Identifies key entities like accounts and transactions to build real-time market intelligence systems.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="5" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"><b><span data-contrast="auto">E-commerce</span></b><span data-contrast="auto">: Recognizes brands and products to create recommendation systems based on customer behavior.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
<li data-leveltext="%1." data-font="Calibri,Times New Roman" data-listid="5" data-list-defn-props="{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:&#091;65533,0&#093;,&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"><b><span data-contrast="auto">Cybersecurity</span></b><span data-contrast="auto">: Detects suspicious patterns by connecting disparate datasets into unified graphs.</span><span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559740&quot;:240}"> </span></li>
</ol></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_116 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_117 landing-page-list  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">More info about <a href="https://www.bitext.com/namer_entity_recognition/">Bitext NAMER</a></div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_118  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner">&nbsp;</p>
<p>&nbsp;</div>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_119  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_24">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_120  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_121  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_25">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_122  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_123  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_image et_pb_image_26">
				
				
				
				
				<span class="et_pb_image_wrap "></span>
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_124  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_125  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_text et_pb_text_126  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				
			</div><div class="et_pb_module none divienhancer-hover-effect et_pb_code et_pb_code_8">
				
				
				
				
				
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
<p>The post <a href="https://www.bitext.com/blog/bitext-namer-slashing-time-and-costs-in-automated-knowledge-graph-construction/">Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction</a> appeared first on <a href="https://www.bitext.com">Bitext. We help AI understand humans. - chatbots that work</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
