The first two articles in the series dealt with converting a document into data, specifically focusing on two fundamental algorithmic issues: transformer architecture that requires short prompts, and embedding algorithms, designed for Retrieval, that also perform very well at Semantic Textual Similarity.

At this stage, therefore, my assumptions are:

You have a document that you want to answer questions based off (making this even more explicit, you have a document that contains answers, not vague principles).
You have chunked this into sections that make sense. While I have not yet mentioned this explicitly, you should try to create references for these chunks or some way of referring the user to the relevant chunk that was used when answering the question. This will be an invaluable tool when you are trying to assess or improve the system.
If your document is overly formal, you may want to summarize each chunk using language that you are more likely to encounter in the prompts. It makes for more accurate retrieval. You may also want to change the language from the third person (An Authorized Dealer must …) to the first person (You must …) for the same reason.
For each chunk or summary (which we embed for Retrieval), you have created one or more questions, for which the chunk or summary is the answer. This will allow us to leverage Semantic Textual Similarity in addition to retrieval.
You have a way to update all of these “sensibly”. If the underlying document changes or, as we observe users asking questions and how the RAG model performs, we are going to have to make changes to the chunks, the summaries, and the questions to improve the RAG model. This needs to be as painless as possible because, if you are successful, you are going to be spending most of your time performing this step.

With the first draft of the data in place, we can turn our attention to technology. I’m going to keep this short. You really only need to understand one point:

"Commercial vendors are better than open source models."

I subscribe to Medium. Every day I see a variant of this headline: “New XXXX model. Better than ChatGPT.” Here's the thing. If XXXX is the commercial offering from the likes of Google or Anthropic (or similar), you may want to read the article. If XXXX is a model from Hugging Face, the article is click-bait.

“But Steven…” I hear you about to pipe up. Stop. Sit back down and think. Many large companies (Google, Mistral, Facebook, etc.) have commercial and open source versions of their models. Do you honestly, really believe that their open source model is as good as the one they charge for? If so, I have a great investment opportunity for you. Send me a DM.

Now, that’s not to say that open source models are not ‘good enough’ - many of them are very good - it's just that the best commercial models are better than the best open source models. Period. End of discussion. Zip it.

So, when can you use an open-source model? To answer this question, let's run through some arguments that are misleading.

Cost: Open Source is free. Ok, sure, download the model. But you are going to have to put it somewhere, and if you want inference to be fast, you are going to need good hardware and probably a quantization strategy (do you really know what quantization does to your model's accuracy?). I'm also assuming you want someone else, and probably more than one someone, to be able to run the final product, so it can't end up on your fancy gaming rig that is so cool, you are literally beating off all the hot dates lining up just to get into your pants! No, your model is going into the cloud, and there you will pay for storage and compute. Now, at some point, your killer app will be logging millions of hits a day. At that point, it may save you some money to host your own model, but if you are getting less than one hundred thousand hits a day, it costs less to use someone else's API.

Data Privacy: If you use your own model, nobody else is harvesting your data. Commercial vendors understand Data Privacy is an issue. They cater to it. They have retention windows which can almost always be set to zero days, and they do not use the data in your API calls to train their models. Despite what you may think, it is almost always faster to tackle any Data Privacy issues directly than try to avoid it by hosting your own model. You may think that it's just a few clicks of a button to get your model running in your cloud subscription. That may well be the case, but remember, I can build a fully functional RAG system in 20 lines of code, and yet here I am, article three in a series with no end in sight. Doing it, and doing it well, are quite different.

Fine-Tuning: You can fine-tune your own models, but you can't touch a commercial vendor's model. Have you fine-tuned a model yet? The reason we are doing RAG is because we want to control the facts in the model. Fine-Tuning facts into a model is a bad idea because they will still be competing with facts like "the earth is flat" that was probably in the base training dataset and cannot be removed. Fine-tuning is good for one thing and one thing only, formatting the response. You want your model to sound like the head girl rather than a crazed 4Chan user? Fine-tune to your heart's content. Just remember, the commercial guys have already done this, so you are wasting time.

Having looked at some of the negative arguments for self-hosting, let's examine the positive ones:

Academic curiosity: I like mental masturbation as much as anyone. Good on you for showing an interest and working through it. But be honest, at least with yourself, about what you are doing.

Internal Bureaucracy: I've worked at large organisations before. They are slow and lethargic. Few people ever get fired for saying 'No' to something new, so that is often the default stance. But one word of caution: Make sure you understand what the real concerns are. If it's a data privacy issue, you should work through the bureaucracy - trust me, it's easier than hosting a model well. If, as I have heard in some places, the internal argument is that they cannot be sure that all the training data that went into a commercial model was obtained with the necessary consent, that also applies to your Hugging Face model.

Jargon: If you read my developer series, you will know that the words that occur infrequently in training are treated as 'unknown'. If the core of your problem revolves around understanding jargon, you may have to build your own models. If this is you, you should think about a simpler problem to start off with. Building models from scratch is for expert teams with budgets and time.

Creating a good RAG model is difficult. Using the best model you can at each step at least allows you to eliminate some of the wild problems associated with running your own models. These can be difficult to predict, replicate, and control. Take the wins you can and spend your time where you can make a real difference. APIs are cheap and better than your self-hosted model. Use them - at least until you have some experience with RAG.

Being clear about the difference between building a product and “life-long learning” makes the tech decisions quite easy. This is a relief because, when building a RAG system, almost nothing else is.

Finally, despite my lack of experience and interest, I will make one vague, handwavy remark about front-ends. I have used a framework for this - Streamlit. There are others. Streamlit is ok. Like everything, it has its issues, but it was simple enough to get a working example up and running in their cloud so it can be used by other people. If you want more tech advice than that, you need to go somewhere else. I can be of no more assistance here.

At this stage, you should have enough information to put together something that will outperform automated solutions. There is, however, still one problem that I still want to address, which I will do in the next article. When I score all my embeddings against the user question, using something like cosine score , every chunk, summary, and question has a score. How do I select the 'best' chunks from all of these? Spoiler - there are better solutions than just taking the lowest ‘n’ cosine scores.

Thanks for hanging in with me this far. See you next time for more art and less science.

3. RAG. What tech do I use?