Even more than usual, I need to caveat this article. It’s hard to make strong conclusions about this stuff. There are just too many unknowns.

OK. With that out the way, let's draw some strong conclusions!

This developer series is about getting an LLM to engage with data that was not part of its training set. In parts 2 and 3 I spent a little time exploring semantic search and retrieval (i.e. embed your data in a vector space which can be queried for the relevant context which can then be passed to an LLM like ChatGPT). While this is the simplest way to work with new data, there are other approaches to assess.

We know that what really made LLMs useful and captured the public imagination was fine-tuning. Fine-tuning requires fewer resources (both data and compute) than training a model from scratch. So, our next stop is obviously to fine tune a model.

Since OpenAI is not open source, we are not able to fine-tune ChatGPT or GPT4. Not to worry, HuggingFace has a leaderboard of open source Large Language Models. When I started this exercise, top of the charts was the Falcon model. It offered a bunch of nice features. In particular it was one of a very small number of LLMs that could be used commercially. It came in a few flavours - a small version consisting of ‘only’ 7 billion parameters, a 40 billion parameter base model and “Instruct”, a fine tuned version of the base model that could be used for chat and question answering (note, I don’t think you can use the Instruct version commercially because it has been fine-tuned on ChatGPT data and that has some commercial implications). It also came with a bunch of information that other models don’t always come with. While I don’t want to spend too much time in the weeds, I hope you don’t mind a short interlude to present a few mind bending statistics for the 40 billion parameter version of this model.

The model was trained for two months non-stop on 384 A100 GPUs. At about $15,000.00 per GPU, building a new model from scratch is not something for the casual observer!
Source data for training consisted of about 1,500 billion tokens (A Fermi style approximation shows the average person reads about 233 million tokens in their life - assuming 80 years, 30 minutes of reading per day and 200 words per minute, 4 tokens being about 3 words)
The Instruct version was further fine-tuned on 64 A100 GPUs using about 265 million tokens (I don’t know how long it took but I assume a week or two)

Let’s just take a moment to reflect on those numbers. That is a lot of compute. That is a lot of data. Let’s also note:

1) According to this document (behind a paywall), the model behind GPT4 consists of 16 experts, each with about 110 billion parameters (so about 1.8 trillion parameters in total). It was trained on 13 trillion tokens. OpenAI used about 25,000 A100 GPUS to train GPT4 and training lasted between 90 and 100 days at an estimated cost of USD63million.

2) In about 4 weeks, Falcon has gone from topping the HuggingFace leaderboard of open source models to being about 22 on that leaderboard as of the time of writing.

If that doesn’t make you pause and think, there is probably nothing that will. You should just carry on as you were, don't let me interrupt you and your engrossing episode of the Kardashians!

If you however did manage to pause the Kardashians, perhaps you are wondering what’s the use of open sourcing something so big that only very large institutions can afford to spin them up in the first place? Well, I’m going to tell you something your parents probably told you when you were at school. Maths! It's important kids. Pay attention in class (if you can’t stand the maths, you can skip one, and only one paragraph but, like your parents, I’m very disappointed in you)

At its core, an LLM is just a matrix, let's call it $M$. Chatting with one is just matrix multiplication. You take an input vector $I$, you multiply it by $M$ and the result is an output vector $O$, Input x Model = Output, $I \times M = O$. The fine-tuning that was used in ChatGPT and in the Instruct version of Falcon is just one version of fine tuning. In that version you make changes to the giant matrix $M$ to make a new giant matrix $M’$, your fine-tuned model. It works but it is still very resource intensive because keeping track of billions of changes to floating point numbers requires a lot of memory. What if there was another way? A way that involved updating fewer parameters very efficiently. What if we had something like Parameter Efficient Fine Tuning? PEFT, what would this even involve? One way to do this (there are many and your attention span is too short to focus on more than one) is to leave $M$ unchanged and create a second matrix, I will call this $\Delta$ and calculate $I \times (M + \Delta) = O’$. If you remember anything about linear algebra, you may look at this and shake your head. You can only add matrices that have the same dimension. So, if $M$ has billions of parameters, so must $\Delta$ right? Right, but I’ve not finished yet. What if I replaced $\Delta$ with the product of two matrices which each have far fewer parameters. What if I could write $\Delta = B \cdot A$. With me? If the inner dimension of both $B$ and $A$ is small, both $A$ and $B$ are small and now we are cooking with gas. Now fine tuning $M + \Delta$ involves only keep track of changes to $B$ and $A$. If those are small, our memory requirement is also small. Welcome to Low Rank Adaption or LoRA for short.

LoRA on its own is good, but if you really want to make fine-tuning available to the masses, you need one more step - quantization. Like I played fast and loose with the maths, let me play fast and loose with quantization (but because I know you don’t want to disappoint me any more, you can read this for detail). Floating point numbers are stored as bits. Depending on the model, one floating point number can use 16 or 32 bits. What the computer scientists amongst us have discovered is that you don’t need all this precision all the time. Yip, time for more magic. There are certain times in the model update calculation where you can be less precise but still get an answer that works. If it feels to you like we rely too much on magic in this process, you would be correct. I think it’s part of the reason very serious people are quite worried. Things work but we just don’t really understand why.

Now, if you are like most quantum physicists and are happy to just use a calculation that works, even if nobody knows why, we can put this all together. Here is a repo with two Google Colab notebooks that allows you to fine-tune and then run inference on the fine-tuned Falcon models. You can run this on the 7 billion parameter model in the free tier. If you want to fine-tune the 40billion parameter model, you need the premium tier in order to access one A100 GPU. That’s right, With 1 A100 GPU, QLoRA (quantized LoRA) and about half a day (about $10 worth of compute time), you too can own your own fine-tuned version of a very good large language model. Go maths!

I trained my very own model. This is what I took from the process:

There is more to owning a model than meets the eye. It’s one thing going through the motions and getting something to produce a result but it is another thing to run this in production. I’m normally very cavalier about production. If it works in dev, its basically job done. In this case however, I can already tell that production is going to be hard. And expensive if you want reasonable performance.
Like it or not, users are going to compare your experience to that of OpenAI and it is clear to me that OpenAI has done a lot around the model (much, much more than just building and fine-tuning it) to make the user experience good. They are fast, pretty cheap and are adding capabilities daily. If you and your model don’t want to become the butt of many jokes, you are going to need a team of people constantly working on eking out as much performance as humanly possible. And remember, you are competing against some of the best in the world.
Data quality really (and I really mean that it REALLY) matters. I started this project with some technical documentation. Realising that this was not good enough, I used GPT to generate additional related documentation (Questions and answers based on the technical documentation plus summarizations of various parts of the documentation) but the results I was generating were still underwhelming. If you don’t have great documentation or text for training, this may not be for you.

While this was an interesting exercise, my conclusion is that unless you absolutely have to run your own model for some obscure reason, it is probably going to save you a lot of time and frustration to focus on vector embeddings.