Bitext Automates Text Data Services for Multilingual GenAI

  • Automation of Data Labelling and Annotation (DAL) tasks
  • Generation of Synthetic Text with proprietary NLG tech
  • Verticalization of LLMs (GPT, Mistral…) in 20 domains (CS, Banking…)
  • Training and Evaluation of LLMs (GPT, Mistral…) for Conversational AI
TravelGPT demo for cellular device

Working with 3 of the Top 5 Largest Companies in NASDAQ

NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning

NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning

Our datasets are hybrid datasets because they combine the scale and volume of synthetic text generation with the quality of expert curation. These datasets are tagged with linguistic properties that motivate variation: colloquial/formal language, intentional spelling errors, different syntactic structures, etc.

The datasets are designed to fine-tune Large Language Models (LLMs) for conversational applications and, in particular, for customer support. Our datasets use a hybrid methodology that merges synthetic techniques and linguistic supervision to solve problems that are typical of text produced with generative AI like hallucination, bias, and PII.

Bitext Open-Source Dataset

We have shared a sample Hybrid Dataset to enable the AI community to evaluate and leverage it. Here are the main features of this sample dataset:

  • Primary Objective:
    The dataset is chiefly designed for training Large Language Models (LLMs) aimed at enhancing the efficiency of conversational applications, particularly in customer support. It addresses common issues associated with data produced through generative AI, such as hallucination, bias, and PII. This is achieved by utilizing a hybrid methodology that combines synthetic techniques with linguist supervision, facilitating the creation of smaller, easier-to-operate LLMs with higher accuracy. Importantly, our datasets comply with AWS (Amazon Web Services) and Apple policies for PII and data sharing, which make them ideal platforms, applications, and models (like AWS, Siri, or Lex).
  • Language Coverage:
    Currently, the dataset covers English and Spanish, with some data generated in German. The technology is also ready for another 8 languages, including German, French, Italian, Dutch, Portuguese, Swedish, Polish, and Korean. In the future we plan to expand to cover Danish, Turkish, Chinese, and Japanese for a total of 14 languages.
  • Content Characteristics:
    The dataset encompasses questions and answers typical for Customer Support within the e-commerce domain. These questions are enriched with extensive linguistic tagging (formal, colloquial, noisy, etc.).  The primary facets of these datasets are categorized under ‘intent’, ‘instruction’, and ‘response’, with the option to include additional fields such as ‘context’ or ‘system prompt’.
  • Volume Metrics:
    In terms of volume, the dataset comprises 3.5 million tokens and 27,000 question-answer pairs.

    Bitext Dataset Language Tagging

    The Bitext Datasets are enriched with a large set of language tags, capturing the diverse ways in which language can generate variants.

    The corpus contains more than 12 different tag types (a detailed description of the tagging is provided below).

    Tags for Lexical variation

    M – Morphological variation: inflectional and derivational “is my SIM card active”, “is my SIM card activated”

    L – Semantic variations: synonyms, use of hyphens, compounding… “what’s my billing date”, “what’s my anniversary date”

    Bitext Dataset Language Tagging

    Tags for Syntactic structure variation

    B – Basic syntactic structure: “activate my SIM card”, “I need to activate my SIM card”

    I – Interrogative structure “can you activate my SIM card?”, “how do I activate my SIM card?”

    C – Coordinated syntactic structure “I have a new SIM card, what do I need to do to activate it?”

    N – Negation “I do not want this item, where to cancel my order?”

    Tags for language register variations

    P – Politeness variation “could you help me activate my SIM card, please?”

    Q – Colloquial variation “can u activ8 my SIM?”

    W – Offensive language “I want to talk to a f*&%*g agent”

    Tags for stylistic variations

    K – Keyword mode “activate SIM”, “new SIM”

    E – Use of abbreviations: “I’m / I am interested in getting a new SIM”

    Z – Errors and Typos: spelling issues, wrong punctuation… “how can i activaet my card”

    Other tags not in use in this Dataset

    D – Indirect speech “ask my agent to activate my SIM card”

    G – Regional variations US English vs UK English: “truck” vs “lorry” France French vs Canadian French: “tchatter” vs “clavarder”

    R – Respect structures – Language-dependent variations English: “may” vs “can…” French: “tu” vs “vous…” Spanish: “tú” vs “usted…”

    Y – Code switching “activer ma SIM card”

    Here are some examples:

    • Handling LLMs with Multiple Registers:
      In natural language, the same content can be expressed in different language registers, typically formal and colloquial. For example:
      • Tag “COLLOQUIAL”: Indicates the utterance contains informal expressions.
        Ex: “can u close my account”
      • Tag “FORMAL”: Indicates the utterance contains formal language.
        Ex: “Could you please help me close my account?”

    With the information about variants captured in tagging, it’s possible to fine-tune the model for different types of speakers and their language preferences: informal for younger audiences and more formal for senior audiences.

    Detecting Non-Desired Language:
    Variants of texts with biased and offensive language have been included and tagged. This tagging facilitates the training and evaluation of models to detect undesired biased or offensive language. These texts guarantee that all biased or offensive language is tagged.

      • Tag “OFFENSIVE”: Indicates the utterance contains offensive expressions.
        Ex: “open my f*&%* account”
    • Errors/Typos in Texting:
      To better replicate actual texts from users querying LLMs, classical errors like typos have been included.
      • Tag “NOISE”: Indicates the utterance contains typos.
        Ex: “how can i activaet my card”

    Bitext Copilots

    Why do we create Copilots?

    We create Copilots because they represent the future of User Experience. They will be the new way for users to book airline tickets, open a bank account, or reserve a table at a restaurant, all through an intelligent system that can answer user queries while executing the desired transaction. This will support both internal company processes, such as the human resources department, where employees can inquire about their vacation days, and sales processes by guiding users through a much more effective funnel.

    NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning

    What does Bitext Copilot do?

    Bitext Copilot is an intelligent system that guides users through any of the processes mentioned above. The unique feature of Bitext Copilot is its proactive intelligence, as it remains with the user until the process is completed.

    How does Bitext create customized Copilots for each company?

    According to OpenAI, the most effective method to deploy Copilots for companies in various industries is through a process called “Finetuning.” This method requires a training dataset with clean and structured data to train the system effectively, ensuring it learns everything about your company without generating hallucinations. Bitext is one of the few companies worldwide with a Natural Language Generation (NLG) system that can automatically, rapidly, and effectively generate the required clean and structured data tailored to the specific industry or vertical needed. These generated data do not have legal issues regarding rights, etc.

    Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

    Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to reduce data needs, as well as training and evaluation efforts, when building customized Conversational Assistants. Bitext provides these Pre-Built Datasets and Models in 20 verticals.

    Any Solutions to the Endless Data Needs of GenAI?

    Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn how 100% reliable, bias-free, and PII-free data can be achieved through rule-based generation, ensuring semantic integrity and accuracy. Explore the unique benefits of this method for generating variations from seed sentences with predictable outcomes.

    From General-Purpose LLMs to Verticalized Enterprise Models

    In the blog “General Purpose Models vs. Verticalized Enterprise GenAI,” the focus is on the advantages of verticalizing AI models for specific enterprise domains. Verticalized models can disambiguate context-specific terms and speak in industry-specific tones. There are two approaches: building models from scratch, which is costly, or fine-tuning general-purpose models with domain-specific data. Bitext proposes a faster two-step method: first, verticalize the model, then customize it with enterprise data. This approach saves time, resources, and avoids common AI issues like hallucinations and bias.

    Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

    Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational, proactive, and highly personalized user experience. This change not only simplified the onboarding process but also made it more interactive and capable of resolving queries in real time, offering significant advantages over traditional methods.

    Abstract minimalist design visualizing Automating Online Sales with GenAI Copilots by Bitext.

    Automating Online Sales with Proactive Copilots

    Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots moves from passively answering customer questions to actively executing online sales. These new Copilots are proactive, they can start and drive an interaction with a potential customer; and context-aware, they know the different steps in the sales process, where they are in the process and how to move to the next step.

    Taming the GPT Beast for Customer Service

    GPT and other generative models tend to provide disparate answers for the same question. Having control is called Fine-tuning.

    Can You Use GPT for CX Purposes? Yes, You Can

    ChatGPT has major flaws that prevent it from becoming a useful tool in industries like Customer Experience

    Worldwide Language Coverage

    Worldwide Language Coverage

    Need More Info?

    At Bitext, we focus on linguistic-based language automation to deliver innovative customer experiences. If you want to test our solutions or learn more, we recommend you schedule a personalized demo from one of our experts.

    Request a Demo

    MADRID, SPAIN

    Camino de las Huertas, 20, 28223 Pozuelo
    Madrid, Spain

    SAN FRANCISCO, USA

    541 Jefferson Ave Ste 100, Redwood City
    CA 94063, USA