AI

On the Stanford parser (and Bitext parser)

In some of our recent talks, colleagues have asked us about the Stanford parser and how it compared to Bitext technology (namely at our last workshop on Semantic Analysis of Big Data in San Francisco, and in our presentation in the Semantic Garage also in San Francisco).

We have revisited this parser and, as expected, results were impressive. Sentences taken from news sources get parsed elegantly and with a nice dependency tree for their constituents. The parser is based on a probabilistic approach, i.e., it needs to be trained using a hand-tagged corpus with POS (Part of Speech) tags; typically, the Wall Street Journal corpus from Penn Treebank.

The question arises whether this approach is effective when a different type of text needs to be addressed (such as Social Media content, User Generated Reviews…). In this case, it is necessary to hand-tag new corpora for training the parser.

Let’s think about parsing tweets, which often feature ungrammatical sentences, slang, abbreviations, emoticons… and if this needs to be done in languages other than English, hand-tagging corpora for training a parser can be a hurdle for fully automating tasks like Social Media Analysis.

In Bitext, we have had to face this situation since our customers deal with many different types of data and in multiple languages.

We follow a linguistic approach whereby we can quickly develop grammars which describe the dependency structure of sentences at various level of detail, depending on whether we analyze Social Media content or News texts. 

Our linguistic parser is grammar-independent and dictionary-independent, and for a new language we can use the same parser just changing the linguistic data sources and, most importantly, we do not require hand-coded corpora to “train” our parser since it does not require any training.

This is the downside; on the upside, the probabilistic information associated to rules can prove very useful. We are looking into ways to incorporate statistical information in our grammars so that the parser can select the correct analysis in cases where structural ambiguity is pervasive. 

A hybrid approach combining linguistic knowledge and statistical information could help in solving these sentences, which are more frequent than we usually think.

admin

Recent Posts

Using Public Corpora to Build Your NER systems

Rationale. NER tools are at the heart of how the scientific community is solving LLM…

1 week ago

Open-Source Data and Training Issues

As described in our previous post “Using Public Corpora to Build Your NER systems”, we…

1 week ago

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

The new Forrester Wave™: Data Governance Solutions, Q3 2025 makes one thing clear: governance is…

2 months ago

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

The process of building Knowledge Graphs is essential for organizations seeking to organize, structure, and…

8 months ago

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…

9 months ago

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Verticalizing AI21’s Jamba 1.5 with Bitext Synthetic Text Efficiency and Benefits of Verticalizing LLMs –…

10 months ago