For Consistent LLM Answers, Fine-tune with Examples. LOTS of Examples

LLMs tend to be very creative and introduce diversity and creativity in answers.

That’s good for certain types of questions like:

  • What can you tell me about La Cibeles?
  • What gothic buildings should I visit in Madrid?

Questions that do not have a single obvious answer, questions that knowledgeable people might answer very differently, are great questions for a search-based approach like RAG.

For some other questions, the right answer is consistent and precise. In these cases, creativity can be flat-out wrong. Some good examples of these types of questions are:

  • What time does the Metropolitan Museum open?
  • Do you need tickets to visit The Cathedral? Can I buy the tickets online?
  • Who is the architect of Reina Sofia Museum? Does it have paintings by Picasso?
  • Is there underground service from Atocha to Barajas airport?

For these questions, excessive creativity may cause significant problems; creative answers are far less likely to be the correct answer. In a real life application, getting these questions wrong seriously undermines user confidence.

Does the Museum open at 9am or at 10am? Variability in this answer is risky.

A unique answer that is still consistent and precise is required.

To achieve this consistency in an LLM-based application, like a chatbot, a training dataset with hundreds of variations of these type of questions can help with the task. The dataset should contain:

  • Variations of the factual questions like:

What time does the Metropolitan Museum open?

What’s the schedule for the Metropolitan Museum

Is the Metropolitan Museum open on Mondays?

  • Example answers to be fed to the LLM
  • Optionally, some tagging about the linguistic rationale behind each variant: colloquial vs formal language, etc..

How many variants of the question are required to safely fine-tune the LLM and be sure that the question will be properly understood and answered? Our experimental trial, which you can see here.

Bitext provides an example of this type of dataset for Customer Support, with 3M tokens and 27,000 question-answer pairs (which you can find here), suggests that that number is a little under 1,000.

The dataset is freely available, including for commercial use. It can be used in real-life applications to check how effective additional training data is at preventing both hallucinations and excessively creative answers for factual questions.

Sharing is caring!