LLMs tend to be very creative and introduce diversity and creativity in answers.

That’s good for certain types of questions like:

  • Tell me about La Cibeles
  • What gothic buildings should I visit in Madrid

It’s questions that do not have a clear single answer, questions that even two people that knowledgeable of the topic may answer differently, still correctly.

For these questions, a search-based approach like RAG can provide a very good solution.

For some other questions, the right answer is of a different type; you need to have consistent and precise, rather than creative, answers. This is typical for factual questions:

  • What time does the Metropolitan Museum opens?
  • Do you need tickets to visit The Cathedral? Can I buy the tickets online?
  • Who is the architect of Reina Sofia Museum? Does it have paintings by Picasso?
  • Is there underground service from Atocha to Barajas airport?

For these questions, excessive creativity may cause significant problems if it modifies the correct answer. In a real life application, getting these questions wrong seriously undermines user confidence.

Does the Museum open at 9am or at 10am? Variability in this answer is risky.

A unique, consistent and precise answer is required.

To achieve this consistency in an LLM base application, like a chatbot, a training dataset with hundreds of variations of these type of questions can help with the task. The dataset should contain:

  • Variations of the factual questions like:

What time does the Metropolitan Museum opens?

What’s the schedule for the Metropolitan Museum

Is the Metropolitan Museum open on Mondays?

  • A few example answers to be fed to the LLM
  • Optionally, some tagging about what’s the linguistic rational behind each variant: colloquial vs formal language, etc.

How many variants of the question are required to safely fine tune the LLM and be sure that the question will be properly understood? A little bit under 1,000 is the number that our experimental trials suggest here.

Bitext provides an example of this type of dataset for Customer Support, with 3M tokens and 27,000 question answer pairs; it can be found here.

The dataset is freely available, including commercial use, so it can be used in real life applications to check how far additional training data can prevent hallucinations or excessively creative answers for factual questions.

Sharing is caring!