Building Intuition: Retrieval Augmented Generation vs Fine Tuning

TL;DR: RAG for Facts, Fine Tuning for Style

Yujian Tang
Dev Genius

--

If you’ve been working with LLMs over the last few months, you’ve probably heard that RAG is best for fact finding and Fine Tuning is best for output styling. If you haven’t, well, you have now. Let’s explore how these two methods work so we get an intuition for their pros and cons. We’ll start with the basics of how large language models work, and then build up how Fine Tuning and RAG work.

How do LLMs Work?

At their heart, LLMs are a neural network. The same things that affect neural networks are the same things that affect LLMs. What’s most important is the data, the quality and quantity of the data, and the architecture of the network.

When building retrieval applications, it’s mostly the data that we are concerned about. Unless you built a special purpose LLM, mostly likely you are using a publicly available LLM such as GPT or LLaMa. The knowledge base that these models have is based on the data they are trained on.

In all likelihood, these models do not have your data. That is the goal of RAG, to inject your data on top of an LLM, allowing your data to serve as the knowledge base and the LLM to serve as the interpreter and processor.

Since these are general models, they also respond generally, only molded by the prompt. Fine tuning is another way to guide the output and change the model for specific tasks.

How does Fine Tuning Work?

There are many ways to fine tune a network. We’ll take a look at three popular methods: full fine tuning, transfer learning, and parameter efficient tuning.

Underneath all of these fine-tuning methods rests the same concept. Much like training a neural network, fine tuning is essentially giving the model new data. The addition and changing of the last few layers is great for tuning the style of output, imagine having an extra instruction at the beginning of each of your prompts.

However, you need a lot of data to get the model to output the right response. Imagine how much data you have, then imagine how much data GPT-3 was trained on (570 GB).

The three types of fine-tuning

Full fine tuning is basically adding data to the training set. The entire model’s weights and biases are updated in this type of fine tuning. This is the most expensive type of fine tuning, but if you have enough data it is also the most effective.

Transfer learning involves updating just a few layers. You can choose to update the last layers in the existing model, or add layers depending on what task you need to do. This type of fine tuning is best suited for getting a model to perform specific tasks such as delivering a story.

Parameter efficient tuning is like a “hack” that tunes specific layers inserted into the model. LoRA is a recent popular type of parameter efficient tuning. It involves adding in layers into each transformer and tuning those layers. This type of fine tuning is best for making your models most easily deployable and shareable.

How does RAG work?

Retrieval augmented generation uses your data as the knowledge store and the LLM to process the query and response. There are many possible database setups for a RAG application. A critical piece is the usage of semantic similarity, best facilitated by a vector database like Milvus.

The first step to building a RAG app is to vectorize your data and insert it into a vector database. At retrieval time, send the natural language request from the user to an LLM to route. Then, route the query to the right vector indices and query against those vectors. Before sending the response back to the user, the returned value is sent to the LLM to format.

It’s simple to start building a RAG app, all you need is one document to query on. This also translates to just one vector index and no need for routing logic. However, as you start adding more documents or folders, you’ll also need to route your queries through different indices or graphs.

--

--