multilingual synthetic training data

get your customer service chatbot
up and running without effort

Introduce a new way to speed up the deployment of new domains and languages for any bot platform using artificial training data.

artificial data for reliable artificial intelligence

Pre-packaged synthetic data for a wide range of verticals and languages.

You can have your multilingual bot up and running in a few hours.


Save Time

Deliver effortless customer support by avoiding manual chatbot training.

Customized for different domains

Easy to start entering in new verticals with minimal investments.

Easy Integration

Customized data to integrate on any platform (Dialogflow, LUIS, Lex...) with up to 30% accuracy improvement.


Scalability & Modularity

No need to understand the technical details to train optimal AI models.

Increase CX Experience

Grow sales and drive loyalty by providing great customer service 24/7.


Available in 9 languages: English, Spanish, French, Italian, German, Dutch, Danish, Swedish and Portuguese (summer 2019).

A Benchmark based on Dialogflow shows increased accuracy up to 40%.

See how automatic training improves manual training.

Get the full dataset used to generate the benchmark results. Check out how easy is to integrate the training data into Dialogflow and get 40% increased accuracy.

Do you train your bot manually?

The most-cited issue in poor chatbot implementations is manual training. Training data is critical to the successful operation of a chatbot.

Building effective conversational agents requires large amounts of training data. Producing this data manually is an expensive, time-consuming and error-prone process which does not scale.

Platform providers usually do not have the infrastructure required to tackle the wide range of verticals, languages and locales that their large clients need to handle, while clients rarely have the expertise neccesary to collect and annotate their data to avoid both language ambiguity and intent overlap when models are trained.



We employ a scalable and data-driven linguist-in-the-loop methodology.

This approach provides a measurable improvement to NLU performance: benchmarks comparing a manual baseline with our synthetic data show >30% increase in intent detection and slot filling accuracy across multiple platforms.

Our methodology and tools allow us to easily customize and adapt datasets to changing needs, including new intents, corporate terminology, language registers, new regions, markets and languages. With each change, the data is automatically regenerated, allowing for continuous improvement in a scalable fashion.

We begin by collecting large volumes of text from domain-specific public data sources such as FAQs, knowledge bases and technical documentation.

We then apply our Deep Parsing technology to automatically extract the most frequent actions and objects that appear in those texts. This results in a knowledge graph that captures the semantic structure of the vertical, which is then curated by computational linguists to identify synonyms and to ensure consistency and completeness.

Actions are grouped into categories and intents, and the intent structure is then validated against FAQs and with domain experts.

Finally, the linguistic structure of each intent is defined, together with the applicable frame types which allow our Natural Language Generation (NLG) technology to generate utterances which are predictable and consistent semantic variations of each intent request.


Easy Implementation

Based on Data

Typically up & running

in days not months

Dedicated Client Success Managers

highly skilled in NLP options

Platform agnostic

Ready for Dialogflow, LUIS, Lex, Watson, Rasa and many more

Recycle and own your data

for different platforms and projects

API On Premise

Flexible delivery options

our pretrained bot offer:

Each vertical template contains:

  • From 20 to 100 common intents for each vertical
  • From 500 to 5,000 of example utterances for each intent.
  • Entity/slot annotations for each utterance.

The data is organized into one core dataset and several optional advanced module datasets:

Core Dataset

  • Initial definition based on common linguistic phenomena including lexical, syntactic plus morphology, coordination...
  • Structured according to specific language use (vertical terminology and audience).

Advanced Module Datasets

  • Politeness: could you... please?
  • Expanded abbreviations: I´d like... I would like.
  • Small talk or colloquial: i wanna...
  • Indirect: ask an agent to...
  • Offensive: exchange a f***ing product.
  • Regional: basket/cart.

verticals avalaible

Retail - Ecommerce


Retail Banking


Media Streaming



Field Service






Mortages & Loans

Wealth Management

Real State/ Construction


Restaurant & Bar chains

Moving & Storage

Events & Ticketing

Legal Services

Got questions?

"Bitext can improve the performance of almost any conversational engine and project".

"End users frustrated with the performance or complexity of their chatbot developments will be interested in how Bitext can improve intent matching confidence and reduce development time".

"Synthetic data can act as a democratizer for smaller players as they try to compete with data-laden tech heavyweights. Privacy restrictions are an additional major driver of this technology".

Anthony Mullen- Gartner

2018 Cool Vendors in AI Core Technologies Report & Hype Cycle for Enterprise Information Management, 2019

Bitext is recognized in up to 20 Gartner Reports

We have received this recognition because of our relentless focus on product innovation: Synthetic Data and NLP Middleware are some of our cutting-edge technologies.

empower your contact center with our api

You can find in our NLP API Platform a wide variety of our leading multilingual NLP tools and solutions that will help you create the best customer experience. Here are some examples:

Sentiment Analysis

Identify the topics of conversation and evaluate, extract and quantify the emotional state or attitude of your customers towards those topics with a polarity score value.

Entity Extraction

Extract the relevant multi-word noun, verb, adjective or adverbial phrases using morphological and syntactic analysis.


Data processing technique that removes or replace personally identifiable information with special tokens. The result is anonymized data that cannot be associated with any one individual. 


Categorization service classifies your texts into groups according to your customized categories.


Identify all the potential roots (lemmas) of each word in a sentence, using morphological analysis and carefully-curated lexicons.


541 Jefferson Ave., Ste. 100

Redwood City

CA 94063


José Echegaray 8, Building 3, Office 4

Parque Empresarial Las Rozas

28232 Las Rozas