NER

Open-Source Data and Training Issues

As described in our previous post “Using Public Corpora to Build Your NER systems”, we are going to highlight areas where public datasets like OntoNotes or CoNLL can be improved. We will provide some tips on how to avoid these issues, whenever possible, using (semi-)automatic techniques.

Tagging consistency is essential to ensure that training is smooth. Contradictions and inconsistencies not only decrease accuracy but also generate hidden costs in MLOps when trying to debug and fix errors. We often take this consistency for granted, but that is rarely the case, not only in these datasets but also in any other manual tagging work.

Consistency starts with having a solid and clear definition of what an entity is. Typically, if not always, that’s not the case.

Entities vs Non-Entities. What’s an entity anyway? The definition of “entity” is a cornerstone for a NER project and should be 100% clear if we are automating the detection of entities, but this is not always the case.

For example, in WikiNEuRal, a well-known multilingual set of corpora, entities like “MVP” (Most Valuable Player) or “DJ” (Disc Jockey) are not tagged. In our view, they should be tagged – in this case as PERSON:

Example in Spanish tagging:

Input Sentence: “En 1980 y 1983 fue elegido como el MVP en toda Europa”
Gold Tagging: Europa:LOCATION (MVP-missing)

Example in Portuguese:

Input Sentence: Esse estilo era exclusivamente um fenômeno de Chicago , mas em 1987 virou febre no Reino Unido e na Europa Continental , sendo muito tocado por Djs .
Gold Tagging: Chicago:LOCATION Reino Unido:LOCATION Europa Continental:LOCATION (Djs-missing)

This same problem happens with other corpora, such as the UNER Swedish PUD corpus:

Example in Swedish: Entity “Paris Agreement” should be tagged as MISCELLANEOUS

Input Sentence: Det är fantastiskt att de fick Parisavtalet men deras insatser är för tillfället inte i närheten av målet på 1,5 grader.
Gold Tagging: (Parisavtale-missing)

Example in Swedish: Entity “Brexit” should be tagged as MISCELLANEOUS

Input Sentence: May har fått stor kritik för att ha undvikit och inte svarat öppet till media efter rättsutlåtandet om Brexit.
Gold Tagging: May:PERSON (Brexit-missing)

And similar cases occur across other languages and corpora:

Example in Russian (in WikiNEuRal Russian): “Альмохады” (“Almohads”) not tagged as MISCELLANEOUS

Input Sentence: В 1130-е годы Альмохады расширяли своё влияние в горных областях Марокко , в восточных и южных районах страны .
Gold Tagging: Марокко:LOCATION (Альмохады-missing)

Example in Korean (in KLUE): “인권센터는” (“Human Rights Center”) not tagged as ORGANIZATION

Input Sentence: 시 인권센터는 민간조사전문가 1 명을 포함한 사건조사팀을 구성 , 21 일간 신청인과 참고인 , 피신청인 16 명에 대한 진술조사와 현장조사를 한 결과 이같이 판정했다고 30 일 밝혔다 .
Gold Tagging: (인권센터는-missing)

This same problem happens with many other entities, often of type MISCELLANEOUS: GDP (Gross Domestic Product), DVD, Blu-ray, VHS… The list is long and not documented in any corpus as far as we know.

A Possible Solution. For languages that use capitalization (like English, Spanish…), the solution involves a significant amount of work. To detect entities not tagged we will need to extract all capitalized strings from the corpus, separate the ones that are not labelled and check them, either manually (safest way) or against gazetteers, to shortcut the task. The main complication, but not the only one, is that words at the beginning of sentences are always capitalized in many languages, even when they are regular words.

For languages that do not use capital letters (Arabic, Korean, Chinese, Japanese…) the solution is even harder; it would involve checking the corpus without the help of capitalization.

Given that this solution involves significant work, a good shortcut for all languages is to compile a list of most relevant entities we need to tag, and make sure they are tagged in our training corpora. This is not a perfect solution but at least it ensures that we will not miss the most relevant entities.

We will review more cases that involve different entity types, ambiguity, lack of criteria…

 

admin

Recent Posts

Using Public Corpora to Build Your NER systems

Rationale. NER tools are at the heart of how the scientific community is solving LLM…

16 hours ago

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is…

1 month ago

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and…

7 months ago

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…

9 months ago

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Verticalizing AI21’s Jamba 1.5 with Bitext Synthetic Text Efficiency and Benefits of Verticalizing LLMs –…

10 months ago

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

10 months ago