How Synthetic Text can solve your training and evaluation problems for your virtual assistants / chatbots
When shopping online, customers frequently have the need to modify their order: exchanging an item in the basket, deleting something already added…
Customers ask for these kinds of changes in many different ways, like “how do I change my order?” or “I need to delete a product from my basket”.
Customers may use a formal register (“can you please help me…”), or an informal one (“can u help me…”), use only keywords (“delete item”) or add spelling or grammar errors (“need change my baskt”), among other phenomena.
To illustrate this variety in practice, with this post we release a tagged dataset that contains 10,000 ways of asking for an order modification, in English this time.
Our first reaction to this number may be: are there are really 10,000 ways to ask for a change in your customer’s order?
Indeed, there are 10,000 and 100,000 and 1,000,000 ways to modify your basket. This is a feature of all natural languages.
Language has been designed to produce literally infinite ways to express the same content.
This expressive power has many different purposes, for one, it allows for expressions of subjectivity, something essential to humans, and keeps language from being boring like formal languages.
That’s why when customers express themselves they want to be polite and formal, or colloquial and informal; or want to include offensive language if they are angry; or stress their geographical origin, like Canadian French speakers vs. France French speakers.
Language has the power to express these and many other variations.
The dataset we are releasing is tagged with these variants and many more, see here for a comprehensive list
Now, the first question is: where do you get enough data to cover all these variations in your chatbot training and evaluation for all the intents your virtual assistant needs to cover?
If you don’t have historical data to leverage –or if you just want to avoid privacy issues, the typical answer is generating and tagging this data by hand.
As chatbots grow in scope, crowdsourcing text generation or tagging is becoming more challenging. As in any other field, the trend is going towards automating data generation.
As NLG (Natural Language Technology) develops, synthetic text is becoming a solid alternative for question/answer systems, for the generation and labeling of textual data.
The main advantages are:
- This technology generates very large amounts of text, in the range of tens of thousands to hundreds of thousands
- the text is generated with linguistic tags: colloquial vs formal; neutral vs regional; spelling and grammar errors …
- datasets can be regenerated as data/chatbot specs evolve or change
- multilingual data can be generated in a consistent manner across languages
These large datasets can be used for training of course; training is the first need in the chatbot development cycle. But they can be used for evaluation too, particularly in the absence of real data.
See this post on evaluation
The sample dataset we have released is just an example of what current technology can achieve.
Download it here and let us know your thoughts: does it work for you?
This is just the beginning. We will soon publish another 20+ intents to complete a full chatbot for customer support.