Bitext Offers Consulting and Data Services for LLM Finetuning

Copilots for Customer Support and Onboarding in Any
Vertical: Banking, Travel, HR…
Language: English, Spanish, German,…
LLM:  GPT, Llama2, Mistral, DBRX…
TravelGPT demo for cellular device

Working with 3 of the Top 5 Largest Companies in NASDAQ

Bitext Copilots

Why do we create Copilots?

We create Copilots because they represent the future of User Experience. They will be the new way for users to book airline tickets, open a bank account, or reserve a table at a restaurant, all through an intelligent system that can answer user queries while executing the desired transaction. This will support both internal company processes, such as the human resources department, where employees can inquire about their vacation days, and sales processes by guiding users through a much more effective funnel.

NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning

What does Bitext Copilot do?

Bitext Copilot is an intelligent system that guides users through any of the processes mentioned above. The unique feature of Bitext Copilot is its proactive intelligence, as it remains with the user until the process is completed.

How does Bitext create customized Copilots for each company?

According to OpenAI, the most effective method to deploy Copilots for companies in various industries is through a process called “Finetuning.” This method requires a training dataset with clean and structured data to train the system effectively, ensuring it learns everything about your company without generating hallucinations. Bitext is one of the few companies worldwide with a Natural Language Generation (NLG) system that can automatically, rapidly, and effectively generate the required clean and structured data tailored to the specific industry or vertical needed. These generated data do not have legal issues regarding rights, etc.

NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning

NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning

Our datasets are hybrid datasets because they combine the scale and volume of synthetic text generation with the quality of expert curation. These datasets are tagged with linguistic properties that motivate variation: colloquial/formal language, intentional spelling errors, different syntactic structures, etc.

The datasets are designed to fine-tune Large Language Models (LLMs) for conversational applications and, in particular, for customer support. Our datasets use a hybrid methodology that merges synthetic techniques and linguistic supervision to solve problems that are typical of text produced with generative AI like hallucination, bias, and PII.

Bitext Open-Source Dataset

We have shared a sample Hybrid Dataset to enable the AI community to evaluate and leverage it. Here are the main features of this sample dataset:

  • Primary Objective:
    The dataset is chiefly designed for training Large Language Models (LLMs) aimed at enhancing the efficiency of conversational applications, particularly in customer support. It addresses common issues associated with data produced through generative AI, such as hallucination, bias, and PII. This is achieved by utilizing a hybrid methodology that combines synthetic techniques with linguist supervision, facilitating the creation of smaller, easier-to-operate LLMs with higher accuracy. Importantly, our datasets comply with AWS (Amazon Web Services) and Apple policies for PII and data sharing, which make them ideal platforms, applications, and models (like AWS, Siri, or Lex).
  • Language Coverage:
    Currently, the dataset covers English and Spanish, with some data generated in German. The technology is also ready for another 8 languages, including German, French, Italian, Dutch, Portuguese, Swedish, Polish, and Korean. In the future we plan to expand to cover Danish, Turkish, Chinese, and Japanese for a total of 14 languages.
  • Content Characteristics:
    The dataset encompasses questions and answers typical for Customer Support within the e-commerce domain. These questions are enriched with extensive linguistic tagging (formal, colloquial, noisy, etc.).  The primary facets of these datasets are categorized under ‘intent’, ‘instruction’, and ‘response’, with the option to include additional fields such as ‘context’ or ‘system prompt’.
  • Volume Metrics:
    In terms of volume, the dataset comprises 3.5 million tokens and 27,000 question-answer pairs.

    Bitext Dataset Language Tagging

    The Bitext Datasets are enriched with a large set of language tags, capturing the diverse ways in which language can generate variants.

    The corpus contains more than 12 different tag types (a detailed description of the tagging is provided below). Here are some examples:

    • Handling LLMs with Multiple Registers:
      In natural language, the same content can be expressed in different language registers, typically formal and colloquial. For example:

      • Tag “COLLOQUIAL”: Indicates the utterance contains informal expressions.
        Ex: “can u close my account”
      • Tag “FORMAL”: Indicates the utterance contains formal language.
        Ex: “Could you please help me close my account?”
      Bitext Dataset Language Tagging

      With the information about variants captured in tagging, it’s possible to fine-tune the model for different types of speakers and their language preferences: informal for younger audiences and more formal for senior audiences.

      Detecting Non-Desired Language:
      Variants of texts with biased and offensive language have been included and tagged. This tagging facilitates the training and evaluation of models to detect undesired biased or offensive language. These texts guarantee that all biased or offensive language is tagged.

        • Tag “OFFENSIVE”: Indicates the utterance contains offensive expressions.
          Ex: “open my f*&%* account”
      • Errors/Typos in Texting:
        To better replicate actual texts from users querying LLMs, classical errors like typos have been included.
        • Tag “NOISE”: Indicates the utterance contains typos.
          Ex: “how can i activaet my card”

      Taming the GPT Beast for Customer Service

      GPT and other generative models tend to provide disparate answers for the same question. Having control is called Fine-tuning.

      Can You Use GPT for CX Purposes? Yes, You Can

      ChatGPT has major flaws that prevent it from becoming a useful tool in industries like Customer Experience

      Synthetic Text Datasets: Teaching Business Strategy to LLMs

      Immerse yourself in the dynamic world of AI innovations, where businesses hustle to create standout applications – all powered by Large Language Models (LLMs)

      Boosting Customer Experience with Empathetic AI Chatbots

      In a groundbreaking move, OpenAI has introduced ChatGPT Enterprise reshaping the realm of AI and its profound impact on businesses.

      Transforming the Business Landscape with ChatGPT Enterprise: A Detailed Look

      In a groundbreaking move, OpenAI has introduced ChatGPT Enterprise reshaping the realm of AI and its profound impact on businesses.

      Harnessing Large Language Models (LLMs) and Artificial Intelligence

      In today’s fast-paced business landscape, the fusion of Artificial Intelligence (AI) and Large Language Models (LLMs) is redefining how industries operate.

      Enhancing Intent Recognition in Chatbots: Fine-tuning GPT-3 with Generative Large Language Models

      Generative Large Language Models (LLMs) have demonstrated remarkable performance in tackling various business challenges, from generating well-written articles to sentiment classification of sentences. Moreover, customizing these models with tailored data presents an intriguing avenue to leverage their vast knowledge for specific real-life use-cases.


      Chatbot? Where is the Chat?

      A couple of weeks ago, Facebook introduced an upgrade for its Messenger platform. The upgrade of Messenger was aimed to improve the user experience. According to the announcement, they have taken into consideration user’s feedback to create new features. It sounds like good news, right? Well, you need to keep reading.

      Worldwide Language Coverage

      Worldwide Language Coverage

      Need More Info?

      At Bitext, we focus on linguistic-based language automation to deliver innovative customer experiences. If you want to test our solutions or learn more, we recommend you schedule a personalized demo from one of our experts.

      Request a Demo


      Camino de las Huertas, 20, 28223 Pozuelo
      Madrid, Spain


      541 Jefferson Ave Ste 100, Redwood City
      CA 94063, USA