Bitext’s LLM Evaluation Methodology for Conversational AI

Bitext’s methodology evaluates your conversational AI without the need for historical data or for manual tagging of evaluation data. The process is based on the generation (NLG) of custom evaluation datasets, pre-tagged with intent information and linguistic features.

Talk to an Expert

Get the Full Dataset

Overview

Bitext performs evaluation tasks for any NLU engine in the market, to test accuracy along different metrics according to the user profile. Bitext’s LLM evaluation methodology measures how a conversational AI performs during its life-span, from deployment to retirement, throughout all changes and updates.

Our main advantage is that Bitext automates most steps in the evaluation pipeline, including the generation of an evaluation dataset, which is a critical step in the absence of historical evaluation data.

This semi-supervised process is based on standard accuracy metrics (like the F1-score, that takes into account both precision and recall together). The analysis of these metrics is then compiled in a report highlighting strengths and weaknesses, both at the bot level and at the intent level.

The process combines software tools, evaluation data and expert insights in one single methodology. This methodology is transparent and easy to explain to end users.

The Evaluation Dataset for Fine-tuning LLMs

Data & Flags

The key to this process is a rich proprietary dataset designed for evaluation that contains thousands of utterances per intent. These utterances are tagged with intent information, so there is no need to manually tag them.

Also, these utterances are categorized with flags according to their linguistic features

Language register: colloquial, formal…
Regional variant: UK/US English; Spain/Mexico Spanish; Canada/France French …
And more: offensive language, spelling errors, punctuation errors…

These flags are key to automatically evaluating the accuracy of the chatbot in different use environments; they permit the chatbot to perform seamlessly with users of virtually any demographic.

DOWNLOAD YOUR FREE EVALUATION DATASET HERE

The Evaluation Methodology

The evaluation methodology is built on an iterative process to train the conversational AI model, evaluate performance, retrain and remeasure performance. This iterative process provides systematic performance improvements, typically starting at 60% understanding out-of-the-box and reaching up to 90% in a few months..

The evaluation system is designed as a continuous improvement process that is implemented in cycles:

Define evaluation target: 60% to 90% understanding
Select training dataset
Train conversational AI model
Select evaluation dataset
Evaluate trained conversational AI model
Identify accuracy gaps
Identify problems and fixes
Re-train with new fixes
Re-evaluate to measure improvements

Access to Our Repositories

You can access our Github Repository and Hugging Face Dataset

Github Repository

Hugging Face Dataset

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA