
We build the Copilot for your Customers, Finetuning GPT with our Hybrid Data
Copilots For Customer Support and Onboarding in any vertical: Banking, Travel, HHRR…
Use Case: Fine-tuning for Customer Service
Make any LLM work for your needs with our Hybrid Data
Bitext Hybrid Datasets Combine the scale of Synthetic Data with the Quality of Manual Curation
Usecase: fine-tuning for Customer Service

Working with 3 of the Top 5 largest companies in NASDAQ

Working with 3 of the Top 5 largest companies in NASDAQ

NLG technology to generate hybrid datasets for LLM finetuning
We call them hybrid datasets because they combine the scale and volume provided by synthetic text generation & with the quality provided expert curation. These datasets are tagged with the linguistic properties that motivate the variation: colloquial/formal language, spelling errors, different syntactic structures, etc.
The datasets are designed for fine tuning Large Language Models (LLMs) for Conversational Applications and in particular Customer Support. The datasets solve the typical problems of text produced with generative AI technology: hallucination, bias and PII, because it’s generated using a hybrid methodology that merges synthetic techniques with linguist supervision.
Bitext Open Source Dataset
We have shared a sample Hybrid Dataset to enable the AI community to evaluate and leverage it. Here are the main features of this sample dataset:
- Primary Objective:
The dataset is chiefly designed for training Large Language Models (LLMs) aimed at enhancing the efficiency of Conversational Applications, particularly in Customer Support. It addresses common issues associated with data produced through generative AI, such as hallucination, bias, and PII. This is achieved by utilizing a hybrid methodology that combines synthetic techniques with linguist supervision, facilitating the creation of smaller, easier-to-operate LLMs with higher accuracy. Given both AWS (Amazon Web Services) and Apple policies for PII and data sharing, these datasets are immensely valuable, offering substantial benefits for platforms like AWS and applications like Siri or models like Lex. - Language Coverage:
Currently, the dataset covers English and Spanish, with some data generated in German. The technology is also ready for another 9 languages. The full list includes: English, Spanish, German, French, Italian, Dutch, Portuguese, Swedish, Polish, and Korean. Additionally, Danish, Turkish, Chinese, and Japanese are on the roadmap. - Content Characteristics:
The dataset encompasses questions and answers typical for Customer Support within the ecommerce domain. These questions are enriched with extensive linguistic tagging (formal, colloquial, noisy, etc.). The primary facets of these datasets are categorized under ‘intent,’ ‘instruction,’ and ‘response,’ with the option to include additional fields such as ‘context’ or ‘system prompt.’ - Volume Metrics:
In terms of volume, the dataset comprises 3.5 million tokens and 27,000 question-answer pairs.
Bitext Dataset Language Tagging
The Bitext Datasets are enriched with a large set of language tags, capturing the diverse ways in which language can generate variants.
The corpus contains more than 12 different tag types (a detailed description of the tagging is provided below). Here are some examples:
- Handling LLMs with Multiple Registers:
In natural language, the same content can be expressed in different language registers, typically formal and colloquial. For example:- Tag “COLLOQUIAL”: Indicates the utterance contains informal expressions.
Ex: “can u close my account” - Tag “FORMAL”: Indicates the utterance contains formal language.
Ex: “could you please help me close my account”
- Tag “COLLOQUIAL”: Indicates the utterance contains informal expressions.
With the information about variants captured in tagging, it’s possible to fine-tune the model for different types of speakers and their language preferences: informal for younger audiences and more formal for senior audiences.
Detecting Non-Desired Language:
Variants of texts with biased and offensive language have been included and tagged. This tagging facilitates the training and evaluation of models to detect undesired biased or offensive language. The texts are 100% free of untagged biased or offensive language.
- Tag “OFFENSIVE”: Indicates the utterance contains offensive expressions.
Ex: “open my f*&%* account”
- Tag “OFFENSIVE”: Indicates the utterance contains offensive expressions.
- Errors/Typos in Texting:
To better replicate actual texts from users querying LLMs, classical errors like typos have been included.- Tag “NOISE”: Indicates the utterance contains typos.
Ex: “how can i activaet my card”
- Tag “NOISE”: Indicates the utterance contains typos.
Would you like to get more info?
At Bitext, we provide a clear emphasis on linguistic-based abstraction language automation to deliver innovative customer experiences. If you want to test our solutions or learn more, we recommend you schedule a personalized demo from one of our experts.
Would you like to get more info?
At Bitext, we provide a clear emphasis on linguistic-based abstraction language automation to deliver innovative customer experiences. If you want to test our solutions or learn more, we recommend you to get a personalized demo from one of our experts.

MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA