Bitext LLMs Evaluation Methodology for Conversational AI

Bitext methodology evaluates your conversational AI without the need for historical data or for manual tagging of evaluation data. The process is based on the generation (NLG) of custom evaluation datasets, pre-tagged with intent information and linguistic features.




Bitext performs evaluation tasks for any NLU engine in the market, to test accuracy along different metrics according to the user profile. Bitext LLMs evaluation methodology measures how a conversational AI performs during its life span, from deployment and through its life, changes and updates.

Our main advantage: Bitext automates most steps in the evaluation pipeline, including the generation of evaluation dataset, a critical step in the absence of historical evaluation data.

This semi-supervised process is based on standard accuracy metrics (like the F1-score, that take into account both precision and recall together). The analysis of these metrics is then compiled in a report highlighting strengths and weaknesses, both at the bot level and at the intent level.

The process combines software tools, evaluation data and expert insights in one single methodology. This methodology is transparent and easy to explain to end users.

The Evaluation Dataset for Fine-tuning LLMs

Data & Flags

The key to this process is a rich proprietary dataset designed for evaluation that contains thousands of utterances per intent. These utterances are tagged with intent information, so there is no need to manually tag them.

Also, these utterances are categorized with flags according to their linguistic features

  • Language register: colloquial, formal…
  • Regional variant: UK/US English; Spain/Mexico Spanish; Canada/France French …
  • And more: offensive language, spelling errors, punctuation errors…

These flags are key to automatically evaluate the accuracy of the chatbot in different use environments: how will the chatbot perform when users are younger or when people are from different geographic areas or locales?


The Evaluation Methodology

The evaluation methodology is built on an iterative process to train the conversational AI model, evaluate performance, retrain and measure performance again… This iterative process provides systematic performance improvements, typically starting at 60% out-of-the-box and going up to 90% in a few months.


The evaluation system is designed as a continuous improvement process implemented in cycles:


  • define evaluation target: 60% to 90%
  • select training dataset
  • train conversational AI model
  • select evaluation dataset
  • evaluate trained conversational AI model
  • identify accuracy gaps
  • identify problems and fixes
  • re-train with new fixes and
  • re-evaluate to measure improvements

Access to Our Repositories

You can access to our Github Repository and Hugging Face Dataset