Ad Image

The Rise of Small Language Models: The Future of On-Device and Cloud AI

Predibase’s Dev Rishi offers his take on the rise of small language models and the future of on-device and cloud AI. This article originally appeared on Solutions Review’s Insight Jam, an enterprise IT community enabling the human conversation on AI.

Apple recently announced “Apple Intelligence” for iOS devices, marking a major shift in how Generative AI models will be integrated into our everyday lives and tasks, both on personal devices and in the cloud. As part of their announcement, Apple defined the new reference architecture for GenAI, and for teams building with generative models. This news is a game changer.

First, all of Apple’s models run on-device, requiring high performance in a compute constrained environment. Second, Apple used an innovative approach to create these models. Instead of using large monolithic AI models, they adopted a technique called LoRA (Low-Rank Adaptation) to fine-tune “adapters” on top of small language models. These small specialized adapters can perform individual tasks—like proofreading, summarizing action items and generating messages—with high accuracy (oftentimes outperforming GPT-4). What’s really novel is that adapters can be served dynamically on top of a single deployed model.

The implications of this approach are profound. A three billion parameter foundation model — which is about 500x smaller than GPT-4 — can be augmented with hundreds of capabilities while remaining efficient on mobile chips or cost-effective in the cloud. Switching between different tasks is as simple as loading the relevant adapter weights, minimizing impact on storage, memory footprint, latency and concurrency. This eliminates bloat and laggy response times when swiping for the next action on your iPhone.

This modular approach will redefine on-device AI and cloud applications. Services no longer need to rely on massive general purpose LLMs. A small base model can be deployed and shared across many even smaller adapters to perform an infinite number of tasks. Updating capabilities is also simplified by just pushing out new adapter weights.

As Hamel Husain, an AI thought leader, mentioned on X, Apple’s announcement is perhaps “the best ad for fine-tuning that probably exists.” While my team at Predibase has been championing fine-tuning SLMs for a while, it’s hard to argue with Apple’s scale and reach in bringing this game-changing AI architecture into the mainstream.

So, if you are wondering how to get started fine-tuning LoRA adapters for your own applications, the good news is this revolutionary approach isn’t unique to Apple. You can do this today.

We believe the future of generative AI lies in smaller, faster, cheaper open-source models that are fine-tuned for specific tasks. We’re already seeing a major trend of developers looking to switch from monolithic closed-source LLMs or APIs to fine-tuned open-source models. The narrative that fine-tuning can be hard and expensive is quickly being rewritten. We often get asked questions like: What model should I fine-tune for which task? How can I reduce serving costs and maximize resource utilization while serving adapters?

To address these questions more broadly, we recently conducted over 500 fine-tuning experiments and shared our research with the Fine-Tuning Index. It demonstrates how LoRA-tuned adapters match or exceed the performance of state-of-the-art commercial offerings like GPT-4 across a broad range of tasks and at a fraction of the cost – around $8 per adapted model.

To further accelerate adoption, there are many open-source tools as well, such as LoRAX (LoRA eXchange), that enable the hosting of hundreds of LoRA adapters on a single GPU, dramatically reducing the cost of serving fine-tuned models at scale while still delivering low latency and high concurrency.

The key innovations powering this shift to efficient, specialized AI in both on-device and cloud scenarios are accessible today for all. They include:

  1. Small open-source foundation models that rival large commercial offerings.

  2. LoRA fine-tuning to create small, task-specific adapters that leverage the base foundation model.

  3. Frameworks (like LoRAX) that dynamically serve many adapters efficiently with high throughput.

By combining these elements, it’s now possible to deliver intelligent experiences that were previously confined to the largest models and most well-resourced organizations. Interactions can be faster, more natural, and better aligned with each user’s individual needs and context, whether on their personal device or in their favorite cloud applications.

We’re entering an age where every service and device can be a personalized intelligent assistant. With the power of small language models and the efficiency of LoRA adapters, this vision is now within reach for all developers and organizations.

As Apple has shown, the key lies in a modular approach leveraging foundation models and efficient fine-tuning techniques. By making these building blocks widely accessible and easy to use, we can democratize AI and put its benefits in the hands of every developer and user, ushering in a new era of ambient intelligence powered by small language models.

Share This

Related Posts