Finite Data: Where to Go When the Data Runs Out

Data is often called the oil of the AI industry. The metaphor almost works, except for the fact that the world has a surplus of oil; it’s running out of data.


What’s the Problem in the AI Market?

Businesses are investing heavily in creating LLM-based applications, with GPT, LLaMa, MPT, Falcon, etc. Since all these models rely on very similar datasets and architectures, they tend to be indistinguishable in practice from each other. This lack of differentiation leads to AI applications that offer undifferentiated experiences. There isn’t much room for differentiation when all the applications have similar models with similar data and similar architectures. A16z elaborated quite eloquently on this differentiation dilemma in their article, “Who Owns the Generative AI Platform?

What Solutions Are Available?

Since LLM architectures are mostly made public via open source, data seems to offer one potential path out of this maze of uniformity. At Bitext we’ve produced data for NLP/NLU/AI applications for a few years. To help overcome the mimicry sickness infecting LLMs, we have produced “Hybrid Datasets” (like in “hybrid cars”). We call them hybrid because they are a combination of manual and synthetic data, created with a methodology that combines NLG technology with curation by linguists and vertical experts.

Hybrid Datasets Have Two Advantages

First, despite being synthetic, hybrid datasets still avoid the typical problems of the generative approach because they are:

  • Hallucination free. The corpus is 100% hallucination free. This makes it particularly suitable for high-quality LLM fine-tuning.
  • Bias free. The corpus includes tagging for offensive language generated from human-curated dictionaries.
  • PII free. The corpus is 100% free of Personal Identifiable Information; instead of actual names there are placeholders or slots.

Second, our hybrid datasets have extensive synthetic tagging. All of the data in our datasets is enriched with information about formality levels, what category of language is being used, and other idiomatic paradigms. For example:

  • a request like, “can u send me a new pw?” will be tagged as “colloquial”
  • another request like, “just cancel right now the f***g order” will be tagged as “offensive”

Bitext uses 12 different tags to mark the data in its datasets. Our unique tagging strategy allows us to easily create different vertical datasets for different customers. For example, customers who speak in colloquial dialects can easily be directed to a model which responds in a way that they are most comfortable with. We will discuss this in more detail in Part 2.

Sharing is caring!