In the previous articles I discussed how to turn your document into data and why your first project should opt for commercial embedding and language models rather than self-hosting them. In this article, I want to start addressing some of the areas where you should be spending your time if you really want to get a better feeling for what the models do. We will start this adventure with Re-Ranking.
Re-Ranking is independent of most other steps in RAG. This makes it a good space to warm up before the main event.
RAG starts by embedding a user question and checking this vector against all the document’s embedding vectors which have been pre-calculated and saved. All the embeddings need to be checked so they can be ranked from closest to furthest. In the naive implementation of RAG, you would select the best 3 or 5 results from this search and use those to enhance the user prompt.
After implementing the basic approach and reviewing the results, you will find many examples where the chunk that contains the answer sits just outside of the top 3 or 5 results you selected. What can we do besides increase the number of chunks selected to 10? Well, when you do this and review the results, you find that including too many irrelevant chunks causes the model to insert misleading or irrelevant (but always well written) information in its response.
You need to keep the number of chunks down to a minimum but also ensure that this minimum does include the correct chunks.
Enter stage left - Re-Ranking.
The idea is to take the top, say 15, results from the vector comparison and re-rank them so that, post re-ranking, the new top 3 or 5 are more likely to contain the correct chunk, i.e. you have a step which looks at a short list of candidates and selects the most promising ones from those.
While the embedding step encodes each piece of text independently, Re-Ranking actively compares the question with a given piece of text. Good re-ranking models should produce better rank-orderings of chunks because they have more context than the embedding model had. The downside is that you need to run raw text through a neural network to get the new rank-order, and these neural networks can be slow, especially if you have to make many calls. Re-Ranking, therefore, requires a ‘small’ number of candidate chunks so it does not unnecessarily delay the system.
So how do you perform re-ranking? Well, it’s just another language model. You can get dedicated re-ranking models (Cohere has a good one you can use via API for free for testing) or you can use the LLM that you will use in the Question-Answering step with a bespoke prompt. I quite like reusing the Question-Answering LLM. You already have a model. If it is good (and remember, the commercial ones are good, that's why we use them), it seems reasonable to use it rather than introduce yet another model.
Introducing this re-ranking step generally improves your ability to get the correct chunk into the final prompt that is used for Question-Answering. It's not perfect. Sometimes it down-weights the correct chunk, but, on balance, it improves the Rank-Order of the retrieved chunks. Here, yet again, it's important to have some tests so you can benchmark the step and see if a dedicated re-ranking model outperforms a two-stage approach using your Question-Answering LLM.
In my experience, both approaches—either using a dedicated Re-Ranking model or multiple instances of the base LLM—improve the overall performance of a base RAG implementation. The improvement however is not monotonic - some things get worse. I have not found a good way to compare the two approaches with each other. What I have, however, found somewhat unsurprisingly given my consistent references to the fact, is that you can significantly outperform both by making sure that the data you embed are of high quality. What do I mean? Well spend time with your data. Write summaries using the language your user will use. Write questions the way a user will write questions. If you do this, your search results will be much better than using a “search and re-rank” process on original chunks from a document.
My final observation is that if you have done as I suggested in the second article and turned your document into data rather than just chunking it, then it is not clear that an additional Re-Ranking step actually makes things better. Here, Re-Ranking changes things. Sometimes for the better, sometimes for the worse. On balance it's difficult to conclude that it is a necessary step if you have good data to start with. I expect that there are cases where it will improve things, even if you have good data. Perhaps your use-case is the one that really does benefit from Re-Ranking. So, since you are not wasting time fiddling with self hosted models, spend some time evaluating Re-Ranking. The heart of the RAG problem is trying to ensure that you have the correct facts before you begin answering the user’s question. Don’t skip this step.