Multilingual Synthetic Training Data
Industrialize training data production for any voice-controlled device, chatbot or IVR using artificial training data.
- Recognize a user´s intent in any chatbot platform: Dialogflow, MS-LUIS, RASA…
- Enjoy 90% accuracy, guaranteed by SLA
Machine Learning is one of the most common use cases for Synthetic Data today mainly in images or videos. We offer text training data in any language you need. Quickly scale or increase the amount of data in a fast and flexible way.
Working with 3 of the Top 5 largest companies in NASDAQ
“Any bot works as long as it has the right data. No bot platform works with the wrong data”
What is Training Data?
Training data is the data that is used to train an NLU engine. An NLU engine allows chatbots to understand the intent of user queries. The training data is enriched by data labeling or data annotation, with information about entities, slots…
This training process provides the bot with the ability to hold a meaningful conversation with real people.
After the training process, the bot is evaluated to measure the accuracy of the NLU engine. Evaluation identifies errors in the bot behavior and these errors are then fixed by improving training data. This cycle is repeated
Bitext Synthetic Data solves the three main problems of AI data:
- scarcity of data, tens of thousands of utterances per intent
- no privacy / GDPR issues, no anonymization needed
- scalable process, for different bots and different languages
Multilingual Training datasets for intent detection
We help you understand your customers either
- if you do not have any existing training data and are getting started with your chatbot
- if you need to increase the accuracy of your existing bot
- if you need to expand your bot to other languages and want to keep the same accuracy across languages
Our Solution, for your current bot and for your new bot
If you have existing training data
- If you want to increase the accuracy or expand the scope of your current assistant/chatbot with more intents and utterances, we automate the process and generate the training data you need in any language.
- Our Quality Assurance and Improvement service allows to retrain the model regularly, to increase accuracy up to 90%, guaranteed by SLA.
If you don’t have existing training data
We offer different options according to your needs. From our pre-built vertical templates (bootstrapping) covering the most common intents for each vertical, to custom datasets for customer specific requests.
Access to Our Repositories
You can access to our Github Repository and Hugging Face Dataset
Linguistic features included in our datasets
The dataset contains annotations for all relevant linguistic phenomena that can be customized to adapt bot training to different user language profiles. Some of the most relevant annotations are:
- Morphological variation: inflectional and derivational
“is my SIM card active”
“is my SIM card activated”
- Semantic variations: synonyms, use of hyphens, compounding…
“what’s my billing date
“what’s my anniversary date”
Syntactic structure variation:
- Basic syntactic structure:
“activate my SIM card”
“I need to activate my SIM card”
- Interrogative structure
“can you activate my SIM card”
“how do I activate my SIM card”
- Coordinated syntactic structure
“I have a new SIM card, what do I need to do to activate it?”
- Indirect speech
“ask my agent to activate my SIM card”
Language register variations:
- Politeness variation
“could you help me activate my SIM card, please?”
- Colloquial variation
“can u activ8 my SIM?”
- Respect structures – Language-dependent variations
English: may vs. can…
French: tu vs. vous…
Spanish: tú vs. usted…
- Offensive language
“I want to talk to a f*cking agent”
- Keyword mode
- Use of abbreviations:
“I’m / I am interested in getting a new SIM”
- Errors and Typos: spelling issues, wrong punctuation…
“how can i activaet my card”
- Regional variations
US English vs UK English: truck vs. lorry
- France French vs Canadian French: tchatter vs. clavarder
“activer ma SIM card”
Each Prebuilt Chatbot contains the 20 to 40 most frequent intents for the corresponding vertical, designed to give you the best performance out-of-the-box.
Our Prebuilt Chatbots are trained to deal with language register variations including polite/formal, colloquial and offensive language. We have profiled the language register use in user queries from a wide range of vertical bots, and we use this information to generate training data with a similar profile, ensuring maximum linguistic coverage.
We also introduce noise into the training data, including spelling mistakes, run-on words and missing punctuation. This makes the data even more realistic, which makes our Prebuilt Chatbots more robust to the type of “noisy” input that is common in real life.
Retail Case Study
Deploying a bot which is able to engage in sucessful converstions with customers worldwide for one of the largest fashion retailers.
A Benchmark based on Dialogflow shows increased standard accuracy +40%.
See how automatic training improves manual training.
Get the full dataset used to generate the benchmark results. Check out how easy is to integrate the training data into Dialogflow and get +40% increased accuracy.
SAN FRANCISCO, USA
541 Jefferson Ave., Ste. 100
José Echegaray 8, Building 3
Parque Empresarial Las Rozas
28232 Las Rozas