Bitext Bot Evaluation Methodology

Bitext methodology evaluates your bot without the need for historical data or for manual tagging of evaluation data. The process is based on the generation (NLG) of custom evaluation datasets, pre-tagged with intent information and linguistic features.




Bitext performs evaluation tasks for any NLU engine in the market, to test accuracy along different metrics according to the user profile. Bitext bot evaluation methodology measures how a bot performs during its life span, from deployment and through its life, changes and updates.

Our main advantage: Bitext automates most steps in the evaluation pipeline, including the generation of evaluation dataset, a critical step in the absence of historical evaluation data.

This semi-supervised process is based on standard accuracy metrics (like the F1-score, that take into account both precision and recall together). The analysis of these metrics is then compiled in a report highlighting strengths and weaknesses, both at the bot level and at the intent level.

The process combines software tools, evaluation data and expert insights in one single methodology. This methodology is transparent and easy to explain to end users.

The Evaluation Dataset

Data & Flags

The key to this process is a rich proprietary dataset designed for evaluation that contains thousands of utterances per intent. These utterances are tagged with intent information, so there is no need to manually tag them.

Also, these utterances are categorized with flags according to their linguistic features

  • Language register: colloquial, formal…
  • Regional variant: UK/US English; Spain/Mexico Spanish; Canada/France French …
  • And more: offensive language, spelling errors, punctuation errors…

These flags are key to automatically evaluate the accuracy of the chatbot in different use environments: how will the chatbot perform when users are younger or when people are from different geographic areas or locales?


The Evaluation Methodology

The evaluation methodology is built on an iterative process to train the bot, evaluate performance, retrain and measure performance again… This iterative process provides systematic performance improvements, typically starting at 60% out-of-the-box and going up to 90% in a few months.


The evaluation system is designed as a continuous improvement process implemented in cycles:


  • define evaluation target: 60% to 90%
  • select training dataset
  • train bot
  • select evaluation dataset
  • evaluate trained bot
  • identify accuracy gaps
  • identify problems and fixes
  • re-train with new fixes and
  • re-evaluate to measure improvements

Bitext Methodology Overview


This process makes the accuracy grow systematically in each iteration. The usual picture in a brand-new Bitext bot is that accuracy starts at 60% out-of-the-box and, with our evaluation and re-training iteration, it goes up to 90% accuracy in a few months.

Access to Our Repositories

You can access to our Github Repository and Hugging Face Dataset


541 Jefferson Ave., Ste. 100

Redwood City

CA 94063


José Echegaray 8, Building 3

Parque Empresarial Las Rozas

28232 Las Rozas