If data is the oil of the AI industry, we are running out of data faster than out of oil. Definitely, we have a problem.
What’s the Problem in the AI Market?
Businesses are investing heavily in creating LLM-based applications, with GPT, LLaMa, MPT, Falcon, etc. Since all these models rely on very similar datasets and architectures, they tend to be indistinguishable in practice from each other. This lack of differentiation leads to AI applications that offer undifferentiated experiences since they are based on similar models with similar data and similar architectures. And it is A16z who set this Who Owns the Generative AI Platform?
What Solutions are Available?
Since architectures are mostly made public via open source, data seems to offer one potential path. At Bitext we’ve produced data for NLP/NLU/AI applications for a few years. To address this challenge, we have produced “Hybrid Datasets” (like in “hybrid cars”). We call them hybrid because they are a combination of manual and synthetic data, created with a methodology that combines NLG technology with curation by linguists and vertical experts.
What are the advantages of “Hybrid Datasets”? Two.
First. They are synthetic but still they avoid the typical problems of the generative approach:
- Hallucination free. The corpus is 100% hallucination free. This makes it particularly suitable for high-quality LLM fine tuning.
- Bias free. The corpus includes tagging for offensive language generated from human-curated dictionaries.
- PII free. The corpus is 100% free of Personal Identifiable Information, there are no actual names but placeholders or slots.
Second. The datasets have extensive synthetic tagging, text is not just raw text but text enriched with information on what type of language variation the express, which means:
- a request like “can u send me a new pw?” will be tagged as “colloquial”
- another request like “just cancel right now the f***g order” will be tagged as “offensive”
There are 12 different tags like these ones. Different vertical datasets can be created with this tagging, for example, a training dataset for younger people based on colloquial texts. We will come back with more on this in Part 2.