My previous developer series concluded with the opinion that if you want to do something for more than just yourself in generative AI, you should start with Retrieval Augmented Generation (RAG). The rationale for this can be summarised as:

RAG is manageable. You don’t need specialist teams, specialist hardware or tons of compute credit. You can build something quite good on your own.
RAG reduces model hallucination. During training, Large Language Models learn the structure and flow of a language but they also learn facts. They learn ‘facts’ by reading many articles from the internet. Some of these articles contain ‘facts’ like “the earth is flat”. If you are in a position to ask a model to answer a question from a curated set of facts, rather than relying on its internal knowledge base, you are more likely to get a sensible answer.

To get started, let’s look at how easy it is to create a RAG project. If we use a framework (Langchain is a good one), we can get a full RAG example running in about 20 lines of code. If you want to get into the detail or see the import libraries, you can check out the explanation here but, just to hammer home the point about how simple it really is, here is the complete code to answer questions from facts contained on a specific web page:

# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
        web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
        bs_kwargs=dict(
            parse_only=bs4.SoupStrainer(
                class_=("post-content", "post-title", "post-header")
            )
        ),
    )
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
)

With that python code and an OpenAI API key, you have a RAG system where you are augmenting Chat 3.5 with the information on the page "https://lilianweng.github.io/posts/2023-06-23-agent/".

To use this, you execute: rag_chain.invoke("What is Task Decomposition?") and get a response with facts from the webpage, not from the models' training data, composed in a readable format, tailored to your question.

You can swap out a web page for other documents like pdf or word. I don’t know about you, but in my humble opinion, solutions to interesting problems don’t get more manageable than that!

But if that were the entirety of the challenge, I wouldn’t be here typing away; I’d be surfing. The fact that I am still banging away at my keyboard and that I still look like a bulldog in a ballet class on my surfboard means that there is more to it than 20 lines of code.

As soon as you get this example up and running and ask, “Now what?” you encounter the main challenges with RAG, Quantity and Quality.

Quantity here refers to the number of documents, or the number of words or facts in the documents. If you believe my bullet point about RAG reducing hallucination then surely it would be better to augment the base LLMs with as many facts as possible? I hope that it is a statement of the blindingly obvious to say that the larger your repository of facts, the more difficult it is to manage.

Starting out, chasing Quantity is akin to wanting to agree on the names of your quadruplets on your first date. Not only are you setting yourself up for failure, you are going to miss out on loads of fun because, while you may be smart, you are actually quite dumb. Don’t plan the victory lap until you have made the team.

But for those who believe I underestimate your towering intellect and think Quantity is precisely the type of challenge worthy of your immense talent, don't just focus on web pages or PDFs. Consider Wikidata instead.

The rest of us can focus on the Quality issue. The beauty of focusing on Quality is that it transforms your Tech project into a Data project. This transformation is significant because, with the current Transformer-based approach to LLMs, it's unlikely that any tech giant or innovator will release something new that could render your efforts obsolete.

The trick to getting a good RAG system is ensuring you can retrieve the correct facts from a database. You don’t cook a gourmet meal with whatever you pull out of your fridge: if you don't start with the right ingredients, your ambitious 'Michelin-star' dish will flop harder than a soufflé in a thunderstorm

The sample code from Langchain works by taking a source document (the web page) which it splits into sections or chunks. This is necessary because RAG works by augmenting the user question ("What is Task Decomposition?" in the example) with some text from the web page. It does this all in the prompt and there is a limit on the amount of text you can include in a prompt.

A brief interlude to discuss Prompt Length. Prompt length seems to me to be a fundamental constraint of the Transformer Architecture. Using Big O notation, running a prompt through a transformer model is an O(n^2) problem where n is the prompt length. This means the problem gets exponentially bigger as the prompt length increases. There are large language models that claim to process very long prompts. They however work, in a hand wavy sense that is good enough for now, by breaking the prompt into subsections and processing the smaller subsections. These models implement an ‘automatic chunking’ strategy and my hypothesis is that any automated chunking strategy will underperform a well designed manual chunking strategy. Ergo, if you want good retrieval don’t get sidetracked by the promise of being able to use very long prompts. At least while we are using transformers or anything else that is O(n^2).

The sections or chunks in the example are determined automatically with the line text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200). Here, a chunk is 1000 tokens long, and the next chunk will include the last 200 tokens from the current chunk, meaning they overlap. The numbers 1000 and 200 are parameters you can adjust to affect the effectiveness of document retrieval. However, this is where I draw the line. Such a strategy would be necessary if we were processing a large number of documents, but that's not our goal. We're keeping the number of documents small to outperform generic implementations. To achieve this, we will invest time in understanding the document. So, pull up a chair and get comfortable. This is the point where the Tech project becomes a Data Project, and Data Projects, as I keep emphasizing, are 60% to 80% about the data. We'll delve into the fancy algorithms, but only much later.

1. Retrieval Augmented Generation - A little bit of context never hurt anyone