The use of word embeddings has become the standard approach for dealing with text input in AI models.
While an extensive research has been carried out during these years to analyze all theoretical underpinnings of algorithms such as word2vec, fastText and BERT, it is surprising that little has been done, in turn, to solve some of the more complex linguistic issues raised when getting down to business.
In our previous post, ‘Main Challenges for Word Embeddings: Part I’ we described two main challenges posed by linguistic phenomena such as homographs and inflection. In this post, we will discuss additional problems that can be easily overcome thanks to linguistic resources.
While the issues exposed in our previous post are problematic, this one is particularly risky. Since word embedding algorithms generate vectors based on the context of words, they also tend to generate similar vectors for words with opposite meanings.
It is not unusual, for instance, that love and adore share a similarity of 0.72. The problem comes, in turn, when love and hate present a close similarity of 0.62.
This issue may cause disastrous effects when dealing with text classification in sentiment analysis or conversational agents. On the one hand, a system that cannot properly distinguish between good and bad will not analyze user reviews correctly. On the other hand, a home automation system considering up and down to be the same will have many problems when controlling a thermostat.
Solution: Bitext technologies are solving this problem using lexical knowledge. This solution includes a reliable identification of antonyms and synonyms during the preprocessing stage.
Phrases, Entities and Expressions
Although some words that are spelled alike can be distinguished by their part of speech, the issue of polysemy (same spelling and POS but different meaning) remains largely unsolved.
A good illustration may be the adjective social and its different meanings depending on the context as in social security and social media.
When talking about token-based word embeddings, social media is not considered a token. Therefore, it would be hard for a ML system to compare it to the word Twitter, for instance, even if combining both vectors for social and media:
- social vs. Twitter: 0.38
- media vs. Twitter: 0.32
- social + media vs. Twitter: 0.42
Solution: A word embedding model trained on a corpus where all noun phrases are marked as single tokens. In this case, if a comparison between social_media_NounPhrase and Twitter_NOUN is made, a similarity of 0.68 will be easily reached.
Bitext’s approach not only helps to deal with polysemy but also with expressions or verb phrases such as you’d better or I’d rather. Bitext’s Entity Extraction tool helps also apply this solution to related entities such as:
- Places: United States, Buenos Aires, New England, New York…
- Companies: Standard & Poor’s, Home Depot…
- Discourse markers: on the one hand, of course, by the way…
- Phrasal verbs: turn on, turn off…
Linguistic knowledge enhances machine learning by applying linguistic solutions to raw data before entering the learning scheme.
The study, the evaluation, and the results show that Bitext technology, based on a linguistic approach, can make any vector space model (VSM) perform better in downstream tasks; not only in standardization but also in information extraction and topic detection.