Building effective conversational agents requires large amounts of training data. Producing this data manually is an expensive, time-consuming and error-prone process which does not scale. Platform providers usually do not have the infrastructure required to tackle the wide range of verticals, languages and locales that their large clients need to handle, while clients rarely have the expertise necessary to collect and annotate their data, and outsourcing the task is complicated by the fact that the data often contains sensitive data that cannot be exposed to third parties.
Bitext offers an easy solution to bootstrap new bots or boost existing ones in minutes, providing a high level of accuracy out-of-the-box and replacing the need for weeks or months of manual bot development.
Each Instant Chatbot contains the 20 to 40 most frequent intents for the corresponding vertical, designed to give you the best performance out-of-the-box.
Our Instant Chatbots are trained to deal with language register variations including polite/formal, colloquial and offensive language. We have profiled the language register use in user queries from a wide range of vertical bots, and we use this information to generate training data with a similar profile, ensuring maximum linguistic coverage.
We also introduce noise into the training data, including spelling mistakes, run-on words and missing punctuation. This makes the data even more realistic, which makes our Instant Chatbots more robust to the type of “noisy” input that is common in real life.
Bitext Instant Bots are currently available in English and Spanish. Our NLG technology is already available for German, French, Italian, Danish, Portuguese, Dutch and Swedish. Support for Turkish, Polish, Chinese, Japanese and Korean will be available in Q3 of 2020.
We employ a scalable and data-driven linguist-in-the-loop methodology. We begin by collecting large volumes of text from domain-specific public data sources such as FAQs, knowledge bases and technical documentation. We then apply our Deep Parsing technology to automatically extract the most frequent actions and objects that appear in those texts. This results in a knowledge graph that captures the semantic structure of the vertical, which is then curated by computational linguists to identify synonyms and to ensure consistency and completeness. Actions are grouped into categories and intents, and the intent structure is then validated against FAQs and with domain experts.
Finally, the linguistic structure of each intent is defined, together with the applicable frame types which allow our Natural Language Generation (NLG) technology to generate utterances which are predictable and consistent semantic variations of each intent request. This approach provides a measurable improvement to NLU performance: benchmarks comparing a manual baseline with our synthetic data show a >30% increase in intent detection and slot filling accuracy across multiple platforms.
Our methodology and tools allow us to easily customize and adapt the datasets to changing needs, including new intents, corporate terminology, language registers, new regions, markets and languages. With each change, the data is automatically regenerated, allowing for continuous improvement in a scalable fashion.