Bitext’s LLM Evaluation Methodology for Conversational AI

Bitext’s methodology evaluates your conversational AI without the need for historical data or for manual tagging of evaluation data. The process is based on the generation (NLG) of custom evaluation datasets, pre-tagged with intent information and linguistic features.




Bitext performs evaluation tasks for any NLU engine in the market, to test accuracy along different metrics according to the user profile. Bitext’s LLM evaluation methodology measures how a conversational AI performs during its life-span, from deployment to retirement, throughout all changes and updates.

Our main advantage is that Bitext automates most steps in the evaluation pipeline, including the generation of an evaluation dataset, which is a critical step in the absence of historical evaluation data.

This semi-supervised process is based on standard accuracy metrics (like the F1-score, that takes into account both precision and recall together). The analysis of these metrics is then compiled in a report highlighting strengths and weaknesses, both at the bot level and at the intent level.

The process combines software tools, evaluation data and expert insights in one single methodology. This methodology is transparent and easy to explain to end users.

The Evaluation Dataset for Fine-tuning LLMs

Data & Flags

The key to this process is a rich proprietary dataset designed for evaluation that contains thousands of utterances per intent. These utterances are tagged with intent information, so there is no need to manually tag them.

Also, these utterances are categorized with flags according to their linguistic features

  • Language register: colloquial, formal…
  • Regional variant: UK/US English; Spain/Mexico Spanish; Canada/France French …
  • And more: offensive language, spelling errors, punctuation errors…

These flags are key to automatically evaluating the accuracy of the chatbot in different use environments; they permit the chatbot to perform seamlessly with users of virtually any demographic.


The Evaluation Methodology

The evaluation methodology is built on an iterative process to train the conversational AI model, evaluate performance, retrain and remeasure performance. This iterative process provides systematic performance improvements, typically starting at 60% understanding out-of-the-box and reaching up to 90% in a few months..


The evaluation system is designed as a continuous improvement process that is implemented in cycles:


  • Define evaluation target: 60% to 90% understanding
  • Select training dataset
  • Train conversational AI model
  • Select evaluation dataset
  • Evaluate trained conversational AI model
  • Identify accuracy gaps
  • Identify problems and fixes
  • Re-train with new fixes
  • Re-evaluate to measure improvements

Access to Our Repositories

You can access our Github Repository and Hugging Face Dataset


Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain


541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA