NER

Using Public Corpora to Build Your NER systems

Rationale. NER tools are at the heart of how the scientific community is solving LLM issues using GraphRAG and NodeRAG architectures.

LLMs need knowledge graphs to control hallucinations and make them more solid for enterprise-level use.

And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.

Open-Source Tools. When starting an Entity Extraction project, it’s typical to start by leveraging open-source, machine-learning-based tools.

Open-source tools are widespread and adapt to different levels of execution, from POC to production-ready, like Hugging Face, Spark NLP or spaCy.

Open-Source Data. These tools rely on third-party datasets for model training and evaluation, typically manually tagged corpora with NER information (Person, Place, Organization, Company…).

Developing new data is expensive and complex, which is why most projects avoid producing their own tagged data.

Therefore, the main alternative to get started is a combination of open-source tools and data. OntoNotes or CoNLL are good examples of this type of datasets for English.

Data is Critical. These datasets are used for two critical purposes:

  • for training, i.e. building the core of our NER tool
  • for evaluation, i.e. determining if our project is a success and can be used in public settings

Data is a Blackbox? The datasets are open, meaning anyone can examine the text, the tagging… However, these datasets are often treated as “black boxes”, i.e. they are used to build NER models without much analysis or understanding of their weaknesses and the implications of these weaknesses. (We will not focus on their strengths, since they are definitely well-known to the community, that’s why they are so popular.)

In this series of posts, we are going to try and make those black boxes more transparent, drawing on our experience in using them at Bitext for evaluation purposes.

We will identify areas where the datasets can be improved and will provide some tips on how to avoid these issues, whenever possible with (semi-)automatic techniques.

First, we classify the different types of issues into 3 groups:

  1. Training issues: common types of inconsistencies, both in gold (manual) and silver (semi-automatic) datasets — more on this in future posts.
  2. Evaluation: how misleading it can be to use the same corpus for training and evaluation.
  3. Deployment issues: licensing has a strong impact when moving from POC to production.

 

admin

Recent Posts

Open-Source Data and Training Issues

As described in our previous post “Using Public Corpora to Build Your NER systems”, we…

16 hours ago

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is…

1 month ago

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and…

7 months ago

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…

9 months ago

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Verticalizing AI21’s Jamba 1.5 with Bitext Synthetic Text Efficiency and Benefits of Verticalizing LLMs –…

10 months ago

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

10 months ago