Rationale. NER tools are at the heart of how the scientific community is solving LLM issues using GraphRAG and NodeRAG architectures.
LLMs need knowledge graphs to control hallucinations and make them more solid for enterprise-level use.
And knowledge graphs are built using automatic data extraction tools: not only entity extraction but also concept extraction and relationships among entities or concepts.
Open-Source Tools. When starting an Entity Extraction project, it’s typical to start by leveraging open-source, machine-learning-based tools.
Open-source tools are widespread and adapt to different levels of execution, from POC to production-ready, like Hugging Face, Spark NLP or spaCy.
Open-Source Data. These tools rely on third-party datasets for model training and evaluation, typically manually tagged corpora with NER information (Person, Place, Organization, Company…).
Developing new data is expensive and complex, which is why most projects avoid producing their own tagged data.
Therefore, the main alternative to get started is a combination of open-source tools and data. OntoNotes or CoNLL are good examples of this type of datasets for English.
Data is Critical. These datasets are used for two critical purposes:
Data is a Blackbox? The datasets are open, meaning anyone can examine the text, the tagging… However, these datasets are often treated as “black boxes”, i.e. they are used to build NER models without much analysis or understanding of their weaknesses and the implications of these weaknesses. (We will not focus on their strengths, since they are definitely well-known to the community, that’s why they are so popular.)
In this series of posts, we are going to try and make those black boxes more transparent, drawing on our experience in using them at Bitext for evaluation purposes.
We will identify areas where the datasets can be improved and will provide some tips on how to avoid these issues, whenever possible with (semi-)automatic techniques.
First, we classify the different types of issues into 3 groups:
Next Post: “Open-Source Data and Training Issues”
As described in our previous post “Using Public Corpora to Build Your NER systems”, we…
The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is…
The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and…
In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…
Verticalizing AI21’s Jamba 1.5 with Bitext Synthetic Text Efficiency and Benefits of Verticalizing LLMs –…
A robust discussion persists within the technical and academic communities about the suitability of LLMs…