Bitext

LLMs Cannot Find Any More Data; What Are They Going to Do Now?

Finite Data: Where to Go When the Data Runs Out

Data is often called the oil of the AI industry. The metaphor almost works, except for the fact that the world has a surplus of oil; it’s running out of data.

 

What’s the Problem in the AI Market?

Businesses are investing heavily in creating LLM-based applications, with GPT, LLaMa, MPT, Falcon, etc. Since all these models rely on very similar datasets and architectures, they tend to be indistinguishable in practice from each other. This lack of differentiation leads to AI applications that offer undifferentiated experiences. There isn’t much room for differentiation when all the applications have similar models with similar data and similar architectures. A16z elaborated quite eloquently on this differentiation dilemma in their article, “Who Owns the Generative AI Platform?

What Solutions Are Available?

Since LLM architectures are mostly made public via open source, data seems to offer one potential path out of this maze of uniformity. At Bitext we’ve produced data for NLP/NLU/AI applications for a few years. To help overcome the mimicry sickness infecting LLMs, we have produced “Hybrid Datasets” (like in “hybrid cars”). We call them hybrid because they are a combination of manual and synthetic data, created with a methodology that combines NLG technology with curation by linguists and vertical experts.

Hybrid Datasets Have Two Advantages

First, despite being synthetic, hybrid datasets still avoid the typical problems of the generative approach because they are:

  • Hallucination free. The corpus is 100% hallucination free. This makes it particularly suitable for high-quality LLM fine-tuning.
  • Bias free. The corpus includes tagging for offensive language generated from human-curated dictionaries.
  • PII free. The corpus is 100% free of Personal Identifiable Information; instead of actual names there are placeholders or slots.

Second, our hybrid datasets have extensive synthetic tagging. All of the data in our datasets is enriched with information about formality levels, what category of language is being used, and other idiomatic paradigms. For example:

  • a request like, “can u send me a new pw?” will be tagged as “colloquial”
  • another request like, “just cancel right now the f***g order” will be tagged as “offensive”

Bitext uses 12 different tags to mark the data in its datasets. Our unique tagging strategy allows us to easily create different vertical datasets for different customers. For example, customers who speak in colloquial dialects can easily be directed to a model which responds in a way that they are most comfortable with. We will discuss this in more detail in Part 2.

admin

Recent Posts

From General-Purpose Models to Verticalized Enterprise GenAI Use Cases

Verticalization is a necessary step for deploying AI in the enterprise. But what does verticalizing…

1 week ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

3 months ago

Automating Online Sales with Proactive Copilots

Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…

3 months ago

Taming the GPT Beast for Customer Service

GPT and other generative models tend to provide disparate answers for the same question. Having…

7 months ago

Can You Use GPT for CX Purposes? Yes, You Can

ChatGPT has major flaws that prevent it from becoming a useful tool in industries like…

7 months ago

Why Do You Need to Fine-tune Your Conversational LLM with 100’s (If Not 1,000’s) of Examples?

If data is the oil of the AI industry, we are running out of data…

7 months ago