Have you ever wondered how much data do you store in your devices every day? We spend our days looking for information— no matter if it’s for business purposes or personal ones— and creating documents. How many files can we collect after one year?
Now think about it from a company’s perspective, hundreds of employees collecting documents every day. Can you imagine how arduous would be to find all the documents related to a particular topic? It seems nearly impossible.
In order to extract information from the data and help to organize and classify text documents, we need to rely on specialized tools such as text mining ones.
One of those is Topic Modeling, a machine learning algorithm based on statistics that allows extracting the main topics given a large collection of text files.
This technique has a lot of useful applications like:
-Discovering hidden topic patterns across a given group of documents.
-Using the topics extracted to classify or group all our collection of files.
-Summarizing large texts by using the main topics found by the Topic Model algorithm.
However, the model per se shows some limitations derived from the complexity of the analysis and from not taking into account linguistics, the proper science to analyze and study the text. Therefore it becomes necessary to use complementary techniques such as lemmatization. In previous posts, we described how useful this algorithm can be for search procedures, and that is why at Bitext we decided to run an experiment to see whether lemmatization can enhance Topic Modeling results or not.
To run this experiment, we decided to analyze the impact of lemmatization in both Non-negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) models.
Other caveats that should be considered are:
-We run the analysis creating a corpus of documents extracted from the public 20 Newsgroups dataset.
-The analysis was done to three copies of the same training set:
• One without pre-processing
• One using stemming to pre-process the text
• One using lemmatization for text pre-processing
-To evaluate the results, we considered:
• Readability of the top word list for each topic
• How well resulting topics matched original newsgroups
Readability of topic Terms:
As we can see in the image, both stemming and lemmatization provide better results removing semantic duplicates. This allows returning the user more words related to the topic so he can have a better understanding of it. However, stemming adds noise to the results as it includes stems that are not real words.
For the sake of the analysis, we also measured the increase of topics matching newsgroups while using lemmatization.
The conclusion of our experiment as you can see is that in both cases lemmatization improves the results achieved while using Topic Modeling algorithm, so companies using this approach to order an ample collection of documents or to extract the main topic of a large collection of files will see how its results are enhanced if they use lemmatization algorithms to pre-process the files.
If you want to see more results and take a look to the whole experiment download our whitepaper.