Using Public Corpora to Build Your NER systems

Rationale. NER tools are at the heart of how the scientific community is solving LLM issues using GraphRAG and NodeRAG architectures.

LLMs need knowledge graphs to control hallucinations and make them more solid for enterprise-level use.

And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.

Open-Source Tools. When starting an Entity Extraction project, it’s typical to start by leveraging open-source, machine-learning-based tools.

Open-source tools are widespread and adapt to different levels of execution, from POC to production-ready, like Hugging Face, Spark NLP or spaCy.

Open-Source Data. These tools rely on third-party datasets for model training and evaluation, typically manually tagged corpora with NER information (Person, Place, Organization, Company…).

Developing new data is expensive and complex, which is why most projects avoid producing their own tagged data.

Therefore, the main alternative to get started is a combination of open-source tools and data. OntoNotes or CoNLL are good examples of this type of datasets for English.

Data is Critical. These datasets are used for two critical purposes:

for training, i.e. building the core of our NER tool
for evaluation, i.e. determining if our project is a success and can be used in public settings

Data is a Blackbox? The datasets are open, meaning anyone can examine the text, the tagging… However, these datasets are often treated as “black boxes”, i.e. they are used to build NER models without much analysis or understanding of their weaknesses and the implications of these weaknesses. (We will not focus on their strengths, since they are definitely well-known to the community, that’s why they are so popular.)

In this series of posts, we are going to try and make those black boxes more transparent, drawing on our experience in using them at Bitext for evaluation purposes.

We will identify areas where the datasets can be improved and will provide some tips on how to avoid these issues, whenever possible with (semi-)automatic techniques.

First, we classify the different types of issues into 3 groups:

Training issues: common types of inconsistencies, both in gold (manual) and silver (semi-automatic) datasets — more on this in future posts.
Evaluation: how misleading it can be to use the same corpus for training and evaluation.
Deployment issues: licensing has a strong impact when moving from POC to production.

Next Post: “Open-Source Data and Training Issues”