In recent years, word embeddings have become the de facto input layer for virtually all AI-based NLP tasks. While they have undoubtedly allowed text-based AI to advance significantly, not much effort has gone towards examining the limitations of using them in production environments.
In this two-part series, we look at some of the challenges we face with commonly used word embedding models, and how incorporating external linguistic knowledge helps to address those issues.
Word embeddings are basically numerical representations (in this case, vectors) of text, which are needed in order to build AI models, which need to be trained using numbers. These vectors are not randomly assigned but are generated considering the context in which each word is often used.
That’s to say, words used in similar contexts will be represented by similar vectors, therefore, two words as red and yellow will have a close vector representation.
This phenomenon is quite useful for machine learning algorithms since it allows models to generalize much more easily. Thus, if models are trained with a single word, words with with similar vectors will be similarly understood by the machine.
The so-called ‘quality’ of a word embedding model is often measured according to its performance when dealing with word analogies. An efficient system should recognize that the word queen is related to the word king the same way as the word woman is related to the word man.
While these analogies may reflect the principles of distributional semantics, they are actually not a good illustration of how word embeddings perform in practical contexts. There are some linguistic phenomena posing big challenges for word embeddings. Here we will explain two of these problems: homographs and inflection.
Current word embedding algorithms tend to identify synonyms efficiently. As a result, the vectors for the words house and home share a cosine similarity of 0.63, what means that they are alike to some extent.
Thus, the vectors for like and love are expected to be similar too. Nevertheless, they show a cosine similarity of just 0.41, what is surprisingly low. That’s because the word like is not only a verb but also, a preposition, an adverb, and even a noun. In other words, all these terms are homographs: different words sharing the same spelling.
Since there is no way to distinguish between these identical words, the vector used for the word like must include all the contexts where the word appears resulting, then, in an average of all vectors. That’s why the vector for like is not as close to love as expected.
When put into practice, this reality can significantly impact on the performance of ML systems posing a potential problem for conversational agents and text classifiers.
Solution: Training word embedding models using text preprocessed through part-of-speech tagging: in Bitext token+POS model, both verbs, like and love, have a cosine similarity of 0.72. Here, a POS-tagging tool distinguishes homographs by separating different behaviors depending on their word classes.
Another problem challenging standard word embeddings are word inflections (alterations of a word to express different grammatical categories). When looking, for instance, at the verbs find and locate they present a similarity of 0.68, almost as close as expected.
However, if the inflected forms (past tense or participle, for example) of those verbs are compared, an unusual similarity of 0.42 between found and located comes up. That’s because some word inflections appear less frequently than others in certain contexts.
As a result, there are fewer examples of those ‘less common’ words in context for the algorithm to learn from them resulting, therefore, in ‘less similar’ vectors. For all that, a far bigger issue emerges when using languages with a greater level of inflection.
While English verbs may have a maximum of 5 different forms (e.g., go, goes, went, gone, going), Spanish verbs present over 50 inflections and Finnish over 500. No matter how large these amounts of training data are, there will not be enough examples of the ‘less common’ forms to help the algorithm generate useful vectors.
Solution: Training word embedding models using text preprocessed through lemmatization: in Bitext token+lemma+POS model, found_find_VERB and located_locate_VERB have a cosine similarity of 0.72. Here, the Bitext lemmatizer helps alleviate those shortages of sample data by unifying all different forms of a word into their canonical lemma (root).
These are not the only linguistic challenges this approach must face. Our next post exposes more challenges and solutions for word embeddings..