In this blog we will discuss three ways of doing your chatbot evaluation by using:

  1. real world evaluation data
  2. synthetic data
  3. “in scope” or “out of scope” queries

You have a chatbot up and running, offering help to your customers. But how do you know whether the help you are providing is correct or not?  Chatbot evaluation can be complex, especially because it is affected by many factors. 

We have gathered some ideas based on our experience in helping our clients improve their bots:

  • If you can get your hands on real-world evaluation data (external datasets pertaining to your domain, with test utterances and their corresponding intent), you have everything you need to carry out a proper performance evaluation.
    We usually compute a confusion matrix that allows us to easily measure chatbot accuracy, precision, and recall (more about these terms here).
    Apart from that, another thing we usually measure is whether there are cases where the model’s prediction is “unclear” (i.e. the difference in confidence score between the first and second candidates is small), which is often an indication that there is potential overlap between two intents (or their training utterances). Some bot platforms include tools to help you perform these chatbot evaluations, and there are also some third party model evaluators around the web.
  • If real-world evaluation data is not available, we usually use our own data to build evaluation sets by taking utterances that have not been used for the training set.
    Rather than having a single evaluation set, we construct multiple ones using different modules (core, colloquial, polite…) and carry out multiple evaluation iterations, testing how the bot performs with different language registers.
  • In the chatbot evaluation work we’re doing for end clients, another concept we work with is “in scope” vs. “out of scope” queries – including “out of scope”, utterances in the evaluation data is key to identify both true negatives and false positives.For this, what we often do is to take data from our datasets for other industries/verticals (e.g. testing a Banking chatbot using utterances from the Travel industry).

All these steps help us measure the usefulness of our chatbots or chatbot training datasets.

You can use any of them to evaluate the Free Dataset we offer, created with our Multilingual Synthetic Data technology, centered on Customer Support: feel free to download it here and give us your feedback!

Download Evaluation Dataset


For more information, visit our website and follow Bitext on Twitter or LinkedIn.

Sharing is caring!