Like most problems I have worked on, once you have worked through the jargon and theory, there are only a few key points that are important or valuable. I started this developer series with the intention of trying to answer this question: If an LLM has not been trained on your proprietary data, what is the best way to get it to answer questions on your data?

Fundamentally there are two approaches to solving this problem

Retrieval-augmented generation (RAG): When you pass the model the question, you also pass it the relevant resources necessary to answer that question. I covered this here.
Build or change a model: This involves actually changing the parameters (or augmenting them). I wrote about one method to do that, using Parameter Efficient Fine Tuning, in this article.

Since writing these articles I have been playing with these and other approaches for question answering and here are the few important or valuable points you need to understand:

If you are going to change or augment model parameters, you really need to have a very, very good reason for doing it. To date, there are only two reasons I have found that would qualify for changing or augmenting model parameters.
- You use a very specialized vocabulary (e.g. you want to answer research style questions about cancer treatment and your model needs to be able to process academic papers). The starting point for any Large Language Model is well, the language. To be more precise, the tokens or word parts. Each model begins with a vocabulary. These are chosen from the text the model is trained on but not every word in the text occurs frequently enough or in sufficient combinations to become a token. If a trained model later encounters the novel word, it just ignores the word it does not understand and tries to do its best with the remaining words or tokens. If the core of your problem is novel to a model, you are likely to be underwhelmed by its responses.
- You need the output of the model to be formatted in a particular way or use a certain style. I’m not talking about fonts and spacing. Here I am talking about tone and style - for example, you may want to respond to customer interaction in a particular way that your legal or compliance department prefers. In this case, we are changing the model weights to get the response template correct, but not to introduce new facts. The facts will still come in the RAG step. Here, fine-tuning is all about getting the response template correct.
In almost all cases, you will need to use RAG. Here are the bullet points you need to understand.
- Hallucination as a model phenomena is just another way of saying creativity (but for grumpy people who believe that LLMs should be factually correct at all times under all conditions). Most models offer parameters which you can fiddle with to reduce creativity but, fundamentally, the model’s job is to select the next most likely word from a statistical distribution of likely words. Playing with the model parameters only changes the range of the distribution from which the model can select. The model created the distribution of the next most likely word from its training data i.e. text on the internet, and we all know that a random piece of text on the internet is not guaranteed to be factually accurate! Since there is no guarantee that its training data was factually accurate, narrowing the distribution from which to sample the next word is not the same thing as making the model more accurate, even if it does make the model more predictable. RAG is about asking the model to extract an answer from the supplied reference text and only the supplied reference text. Provided you supply the correct reference text, RAG is far more likely to give you the ‘correct’ response. For question answering, this is important.
- Currently, the main constraint on RAG is the token limit, which is the total number of tokens (words) that can make up the user question plus the historical chat context plus the resource from which to answer the question plus the answer to that question. While we wait for people far smarter than us to “solve the token limit problem”, getting RAG to work is all about identifying that part of your proprietary documentation that contains the answer, and nothing else. This is then passed to the model along with the user question. So, for now, the main task associated with RAG is retrieving the correct paragraph, clause, page or sentence from all your input documents. This is achieved by embedding the source text and then measuring vector similarity. In my earlier post on this I was struggling with the fact that my embedding models often got stuck in the wrong part of my input documentation and were retrieving extracts that were not relevant for the question. After some time, I stumbled on the very simple solution that almost always exists (ok simple solutions do not almost always exist - just ask Andrew Wiles - but any problem you or I are going to solve will, almost by definition, have to have one). The idea is not to embed the source text like almost every embedding example you are likely to see will do. The idea is to summarize the source extracts and embed the summary (but retrieve the original, source text). This way, if there is a word in the summary that is doing too much work in the embedding space, you can just change that word and the new embedding will be closer to where it should be in semantic space.

So that is it for this developer series. I hope this saves you some time and effort. Don’t get distracted by wanting to build or fine-tune your own model. Focus on RAG and hope that your particular problem does, at its core, have a simple solution! Good luck out there.

5. What is the answer already?