Evaluate the Quality of your Chatbots and Conversational Agents

It is always important to evaluate the quality of your chatbots and conversational agents in order to know the its real health, accuracy and efficiency.

Chatbot accuracy can only be increased by constantly evaluating and retraining it with new data that answers your customer’s queries.

Chatbots require large amounts of training data to perform correctly. If you want your chatbot to recognize a specific intent, you need to provide a large number of sentences that express that intent, usually generated by hand. This manual generation is error-prone and can cause erroneous results.

How can we solve it?

With artificially-generated data. Since Dialogflow is one of the most popular chatbot-building platforms, we chose to perform our tests using it.

We tested how Dialogflow can benefit from the Artificial Training Data approach, comparing chatbots trained using hand-tagged sentences with chatbots that used automatically-generated training data. Our tests show that if we train bots with only 2 or 3 example sentences per intent in Dialogflow, performance suffers. Furthermore, using 10 sentences per intent, there is only minimal improvement.

On the other hand, by extending these hand-tagged corpora with additional variants automatically generated by Artificial Training Data, there is higher overall improvement and chatbot accuracy.

We carried out two different tests (A and B), both using the following 5 intents related to the house lighting management. In the first test (A), we trained two different bots:

A first bot (A1) was trained with only 12 hand-tagged sentences (2 to 3 sentences per intent). Using those sentences as input, our Bitext Artificial Training Data service generated 391 sentences which, combined with the 12 sentences from bot A1, were used to train a second bot A2 (with around 80 sentences per intent).
The second test (B) was very similar to the first. The only difference was the number of sentences used in the training and evaluation sets. In this case, the first bot (B1) was trained with a hand-tagged training set of 50 sentences (10 per intent).
Using those sentences as input, our Bitext Artificial Training Data service generated 798 sentences which, combined with the 50 sentences from bot B1, were used to train the second bot B2 (with around 170 sentences per intent). We used the same 100 chatbot evaluation sentences from test A as the evaluation set.

In both tests, we observed a significant improvement reaching at least 90% in the chatbot accuracy in both intent detection and slot filling. Do you want to see the results for yourself? Download our Dialogflow Full Benchmark Dataset now.

The Bitext Artificial Training Data service lets you create big training sets with no effort. If you only want to write one or two sentences per intent, our service is able to generate the rest of the variants needed to go from poor results to great chatbot accuracy.

If you would like to get further details, you can check some additional tools here:

Download Dialogflow Benchmark: increase accuracy up to 40%.
Download LUIS Benchmark: increase accuracy up to 40%.
Try real examples from domain leaders in retail, home and news.
Boost the capabilities of your chatbots. Available in 9 languages.
Request a Demo to reach 90% accuracy despite any mistake made by the user.
Check out our FAQs about Chatbots and Virtual Assistants.

Evaluate the Quality of your Chatbots and Conversational Agents

How can we solve it?

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta