The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

table.bitext-table { width:100%; border-collapse:collapse; font-size:15px; margin:10px 0 22px; }

table.bitext-table th { background-color:#b71c1c ; /* rojo Bitext */ color:#ffffff ; padding:8px 10px; border:1px solid #9c1515; text-align:left; }

table.bitext-table td { padding:8px 10px; border:1px solid #e0e0e0; color:#333333; }

table.bitext-table tr:nth-child(even) td { background-color:#fafafa; }

Every day, millions of news articles are published about technology, business and geopolitics.

But there is a signal hidden inside them that most analytics systems completely miss.

It isn’t in what the articles say.

It’s in which entities appear together.

Once you start measuring that signal, you can see how global narratives form.

This phenomenon is called co-mentions, and it is widely used in knowledge graph construction and large-scale text analysis.

Why Co-mentions Matter

Counting mentions tells you which entities are important.

But co-mentions tell you something far more valuable: how those entities are connected.

That distinction is crucial.

For example: AI might appear in thousands of articles.

But if AI increasingly appears alongside Nvidia, something deeper is happening. It reveals a narrative forming:

AI infrastructure → Nvidia

Similarly, when AI increasingly appears together with the US or China, the story changes. AI is no longer just a technology topic. It has become a geopolitical one.

Co-mentions allow us to detect these narrative shifts early – before they become obvious.

The Experiment

We tested this idea using the Leipzig English News corpora from the Wortschatz Project at Leipzig University. We analyzed datasets from 2023, 2024 and 2025.

Across these datasets, the pipeline processed roughly:

2 million raw news articles
400K articles after topical filtering

From these documents the pipeline extracted:

millions of entity mentions
tens of millions of co-mention relationships

To focus on economic and technology narratives, documents were filtered using the IPTC Media Topics taxonomy, keeping only:

Economy, Business and Finance
Science and Technology

Dataset Scope	Approximate Volume
Raw news articles processed	2 million
Articles after topical filtering	400K
Entity mentions extracted	Millions
Co-mention relationships generated	Tens of millions

How the Analysis Works

The pipeline combines entity extraction with graph analysis:

Entity recognition using the Bitext NLP SDK (companies, countries, technologies)
Entity normalization (e.g. “US”, “United States”, “America” → United States)
Extraction of relationships between entities appearing in the same document
Aggregation of co-mentions across the corpus

Relationships are generated by linking entities that appear in the same document, producing weighted co-mention edges.

For example, if a document mentions US, China, Nvidia and AI, the system generates relationships such as:

US – China
US – AI
China – AI
Nvidia – AI

Pipeline Step	What It Does
Entity recognition	Extracts companies, countries, technologies and other entities from text
Normalization	Maps variants such as “US” and “America” to a canonical entity
Relationship extraction	Links entities appearing in the same document
Aggregation	Builds weighted co-mention patterns across the corpus

From Text to Knowledge Graph

When these relationships are aggregated across hundreds of thousands of articles, they form a knowledge graph that reveals patterns in global narratives.

Even a tiny fragment already tells a story:

AI → Nvidia → U.S. → China

Technology → infrastructure → geopolitics.

Input	Transformation	Output
Unstructured news text	Entity extraction + co-mention analysis	Knowledge graph of entities and relationships

Why This Matters

Most of the world’s knowledge still lives in unstructured text. But once entities and relationships are extracted at scale, that text can be transformed into structured knowledge graphs ready for analysis.

These graphs integrate naturally with platforms such as Neo4j, Stardog, Ontotext and MarkLogic, where the extracted entities and relationships can be explored and analyzed.

In short: text → entities → relationships → knowledge graph

And once the graph exists, hidden signals start to appear.

Stage	Result
Text	Raw unstructured articles
Entities	Normalized companies, countries, technologies and other concepts
Relationships	Weighted co-mentions between entities
Knowledge graph	Structured narrative map ready for analysis

In Summary

Co-mentions are one of the simplest signals you can extract from text.

But at scale, they reveal how the world connects ideas, companies and countries.

What other signals do you think could be extracted from large-scale news analysis?

admin

Previous « Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

Published by

admin

Tags: Entity extraction

3 weeks ago

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

Why Co-mentions Matter

The Experiment

How the Analysis Works

From Text to Knowledge Graph

Why This Matters

In Summary

Recent Posts

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

German & Korean Retrieval Fails Without Proper Decompounding

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

The Hidden Signal in Millions of News Articles That Reveals How Global Narratives Form

Why Co-mentions Matter

The Experiment

How the Analysis Works

From Text to Knowledge Graph

Why This Matters

In Summary

Related Post

Recent Posts

Why LLMs Are the Wrong Tool for Enterprise-Grade Entity Extraction

German & Korean Retrieval Fails Without Proper Decompounding

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance