On the Stanford parser (and Bitext parser)

In some of our recent talks, colleagues have asked us about the Stanford parser and how it compared to Bitext technology (namely at our last workshop on Semantic Analysis of Big Data in San Francisco, and in our presentation in the Semantic Garage also in San Francisco).

We have revisited this parser and, as expected, results were impressive. Sentences taken from news sources get parsed elegantly and with a nice dependency tree for their constituents. The parser is based on a probabilistic approach, i.e., it needs to be trained using a hand-tagged corpus with POS (Part of Speech) tags; typically, the Wall Street Journal corpus from Penn Treebank.

The question arises whether this approach is effective when a different type of text needs to be addressed (such as Social Media content, User Generated Reviews…). In this case, it is necessary to hand-tag new corpora for training the parser.

Let’s think about parsing tweets, which often feature ungrammatical sentences, slang, abbreviations, emoticons… and if this needs to be done in languages other than English, hand-tagging corpora for training a parser can be a hurdle for fully automating tasks like Social Media Analysis.

In Bitext, we have had to face this situation since our customers deal with many different types of data and in multiple languages.

We follow a linguistic approach whereby we can quickly develop grammars which describe the dependency structure of sentences at various level of detail, depending on whether we analyze Social Media content or News texts.

Our linguistic parser is grammar-independent and dictionary-independent, and for a new language we can use the same parser just changing the linguistic data sources and, most importantly, we do not require hand-coded corpora to “train” our parser since it does not require any training.

This is the downside; on the upside, the probabilistic information associated to rules can prove very useful. We are looking into ways to incorporate statistical information in our grammars so that the parser can select the correct analysis in cases where structural ambiguity is pervasive.

A hybrid approach combining linguistic knowledge and statistical information could help in solving these sentences, which are more frequent than we usually think.

admin

Next Automatic IAB tagging enables now semantic ad targeting »

On the Stanford parser (and Bitext parser)

Recent Posts

German & Korean Retrieval Fails Without Proper Decompounding

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

On the Stanford parser (and Bitext parser)

Related Post

Recent Posts

German & Korean Retrieval Fails Without Proper Decompounding

The Moment to Pay Attention to Hybrid NLP (Symbolic + ML)

Using Public Corpora to Build Your NER systems

Open-Source Data and Training Issues

Why Semantic Intelligence Is the Missing Link in Active Metadata and Data Governance

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction