The Small(er) Data Era: How Fine-Tuning and Data Quality are Defining the AI Arms Race
Solutions Review’s Premium Content Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, Tola Capital Vice President Jake Nibley, Partner Akshay Bhushan, and Founder Sinan Ozdemir offer a commentary on how fine-tuning and data quality are defining the AI arms race.
We’re only a few months into 2023, and the artificial intelligence foundation model arms race is heating up. OpenAI introduced GPT-4 in March with new photo-to-text capabilities that has the tech world buzzing, and Google (finally) soft-launched their LaMDA-powered Bard chatbot. The industry is changing so fast that tech leaders are now calling for a 6-month ban on training AI models more powerful than GPT-4 out of safety concerns.
With the media spotlight shining on these proprietary models trained on massive amounts of ‘big’ data, we’re ignoring the equally valuable open-source foundation models trained on smaller parameter sets capable of delivering OpenAI-quality results for specific use cases. The biggest foundation models won’t exclusively define this AI era. Just as much value can be derived from smaller, fine-tuned, open-source AI models. It’s up to founders and practitioners to balance the benefits and risks and find the right mix of models for their business.
Fine-Tuning Data Quality
Model Use Case > Size
Speaking with developers, we repeatedly see open-source models outperform larger proprietary ones on discrete tasks. It’s not uncommon for us to hear from entrepreneurs that a well-tuned BERT or BLOOM model trained on their specific data outperforms the latest, largest model from OpenAI. The most famous public example is when Deepmind proved a lower parameter count model — Chinchilla, at 70B parameters — could outperform models double its size — such as Gopher, at 280B parameters and GPT-3, at 175B parameters — at similar tasks. They found that the current large language models are far too large for their compute budget and are not being trained on enough data (or the right data) compared to their size.
Let’s say we work on the engineering team at a SaaS company and want to create a bot that routes text-based customer chats to the right team based on the semantics of a customer question or complaint. Engineers could use a large model with billions of parameters like GPT-3 to tag or route conversations to the right teams. Or, as we are increasingly seeing, they can use a much narrower open-source model trained exclusively on written transcripts from customer support calls or chats. This model would likely be a more relevant tool for our use case at a lower cost because it was trained with contextually relevant data for our use case.
Even if one model has incredible new state-of-the-art features (for now, it’s OpenAI’s GPT-4), other open-sourced examples are six–nine months behind with similar functionality or even outperforming GPT-4 on specific tasks at a lower cost. Even in the last few weeks, we’ve seen this happen with Databricks’ Dolly LLM. One example is Alpaca, a language model created by Stanford Ph.D. students at the Center for Research on Foundation Models (CRFM) in March this year. Alpaca is fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
The data generation process cost the team less than $500 using the OpenAI API. In their initial run, fine-tuning the model took three hours on 8 80GB A100s and gave them similar performance results to text-davinci-003 using only $100 worth of cloud computing costs. For less than $1,000, the team created a language model that won 1 more comparison against text-davinici-003 in a blind pairwise comparison evaluation (89 vs. 90). This shows us that it’s more than possible for open source models to catch up quickly; it’s inevitable.
Less is More: Small(er) Data Compared to Big Data
Deepmind’s research and Stanford’s Alpaca model remind us that what’s really constraining the reasoning and output of these models is high-quality, curated training data. Founders and CTOs will likely balance cost, privacy, security, and performance alongside the usual headline improvements in accuracy and reasoning. There are benefits and risks to these two different models, and it’s up to enterprises to balance the cost/benefit analysis for themselves.
To understand the tradeoffs practitioners often evaluate, we summarized the pros and cons of open sources and proprietary models, which can be found in the table below.
Open-source models have many benefits. Because these models are self-hosted, you own the model, the data, and the entire ecosystem around it. It’s also much less expensive to run because of the fewer compute parameters needed to get the desired results.
At the same time, the risk of open-source foundation models is the amount of high-quality labeled data needed to fine-tune. And getting the desired results may take time. Large proprietary models, like OpenAI’s GPT, can give you a good enough output with contextualized prompting and the time to value is much faster than with open-source models. Yet, integrating these models and the computing power needed to run them can be expensive. As these models get more complex, the latency and costs increase.
Finally (and potentially most importantly), a risk of proprietary models is that you’re at the mercy of the foundation model provider’s terms of use, which could lead to data privacy concerns.
Foundation Model Innovation is Not Binary
We’re heading into a world where everyone has access to these models, and enterprises and individuals alike will get tons of value from them. It’s up to enterprises to decide not only how they’ll create the next industry-disrupting technology using proprietary or open-source models but also how they’ll tweak them in a way that gives them better results for their specific use case. This kind of innovation isn’t a binary approach: open-source and proprietary models can work harmoniously.
For example, Klarity uses a combination of optical character recognition (OCR) tools, BERT, and other models to turn contracts into machine-readable corpuses of data. They then use GPT-4 to extract and understand contracts through 0 and k-shot learning. The space is evolving at a blistering pace, and we’re excited to see what developers, entrepreneurs, and machine learning engineers flock to in the coming months.
Despite whatever media hype cycle exists about large proprietary models, they aren’t always the panacea that we infer them to be. For many entrepreneurs building the next great company, the technical reality is often a mix of models of different sizes. These two realities can coexist — it’s just a matter of how companies use them.