This series is about supplementing a large language model with new or additional facts, from a pdf document, for Question-Answering (I use ‘pdf’ as a shorthand for source documents that contain formatting and other non-text items e.g. charts or tables, which are integral to understanding the document). To manage expectations, it is worth noting that ‘Question-Answering’ assumes that the user should get an answer without having to invoke complex prompt engineering. Ideally, we would like to just have a model that we can engage with, where the new facts have been trained in, or, in the backend, we engage in some Retrieval-augmented generation (RAG) using the techniques described in the previous article so you as a developer create a prompt that looks something like this:
### System Instruction
Answer the user question from the provided documentation only. If that provided document does not contain the answer, your response should be “I don’t know”
### User Question
—----- User question goes here —-------
### Documentation
—----- The appropriate extract(s) from the source document here —-----
In order to understand what is theoretically achievable, let's consider some “simple” questions. By “simple”, I mean something like “Given Glencore’s latest results, what is their most profitable commodity?” or “Using only the version of the Bank’s Act that I provide, can you tell me what approaches there are to calculate credit risk exposure?”.
Simple questions like this, allow us to think through the fundamentals of the model. In particular, we need to understand two key points.
- Large Language Models are based on an architecture that only has one pass at a problem. In geek speak, it is a Feed Forward Network. Put slightly differently, everything the model needs must be available at the start of a run, either in the model itself or in the (relatively short) input prompt, which includes chat history for context.
- Language tools (like LLMs or the embedding models described here) are built on data covering a broad range of topics. They are generalists. If you have a lot of nuanced data for one specialist subject, the open source tools may not have the depth or vocabulary necessary to distinguish between closely related concepts and you are going to have to do additional data work to help them.
Let's start with the first point. When looking for a problem to solve using an LLM, it helps to use the heuristic that these models work by selecting the next most likely word from a distribution of candidate words that was built during training. This means that if we want the model to answer a question like “What is Glencore’s most profitable commodity?”, somewhere in the input prompt or in the model training data, there must be a sentence similar to “Glencore’s most profitable commodity is …” or there needs to be a list of profitability per commodity. If Glencore’s latest financial statements do not explicitly contain that text or table, multiple runs of the LLM may be required to curate the data that can then be made available to answer the question.
Perhaps then what you really want is not the Glencore Financial document(s) themselves when you want to engage in question answering. Perhaps what you really want to start with is financial analysis of the results (or start by getting an LLM to produce a summary of the results). Analysis documents, rather than the source documents themselves, are more likely to have the text necessary to answer most questions.
In much the same way, a model like ChatGPT is able to answer questions about the Basel regulatory framework not because it has read and understood the Basel documents. Rather it is likely to respond based on Wikipedia type documents which contain a lot of information about the regulations.
This brings me to point 2, differentiating between closely related topics. As this post is all about heuristics, it helps to focus on this for the embedding models in the previous post. The key concept here is, once we move from the language that makes up the prompt into vector space, where neural networks live, we will stay in the neighbourhood of ‘nearby’ vectors rather than look to vectors that are ‘further’ away. Embedding models are built to construct vectors that are close together if the input text is semantically similar but we have very little theory or intuition about how they work. While building the embedding model for the Banks act, I noticed the following: If I asked the question “What methods are there to measure credit risk exposure?”, a search of the embedding space returned the correct section of the act (i.e. section 3 Alternative methodologies to measure exposure to credit risk). However when I asked what I believed was the same question, “What methods are there to calculate credit risk exposure?” (i.e. change the word ‘measure’ to ‘calculate’), the model got lost in an incorrect section of the regulations (it got stuck in section 5, Calculation of credit risk exposure: standardised approach)
It turns out that the word ‘calculate’ was doing way too much work in the embedded vector space. The incorrect sections of the act which tripped up the model (at least in my view) contained multiple uses of the word ‘calculate’ and because I used ‘calculate’ in the question, I landed up in the wrong part of the vector space and could not get out of it to the correct part.
This led me to my first insight. If you are particularly worried about getting your facts correct, it may not be a good idea to embed source text directly. Of course you want to retrieve source text but in this case, the text that we are going to turn into a vector, the text that does the work of making sure the source data is in the correct neighbourhood in vector space, does not have to be the source text itself. In almost every example I have seen, people embed the source data directly. I guess this is the only approach if you are processing a lot of data, but if you need to prioritise accuracy over volume, you should not do this. Instead what you really want is a summary or headline for your text which you can embed. Later, if you have problems that put you in the wrong part of the vector space you can do some analysis like I did to see if there are individual words or phrases in this summary that can be changed.
In my example, I wanted my embedding model to return section 3 of the act in answer to both versions of the question I posed (i.e. using the word method and calculate).
Rather than embedding section 3 itself, I embedded just the summary “Alternative methodologies to measure exposure to credit risk”. When I realised the “measure / calculate” issue, I just changed the summary to “Alternative methodologies to measure or calculate exposure to credit risk” (leaving the original text from the act unchanged) and everything worked.
If you have followed me, there should be a little feeling of dread at the pit of your stomach about now. If not, let me help you out. Finding this “method / calculate” issue is subtle. There is no way you can possibly do this in advance for all users for all sections of your source data. Also, should you change section 3 or section 5? People are going to ask questions in so many different ways. There is no way to capture all of that. You may also find problems which seem to have conflicting solutions.
I do have a lot to say about this but time is short so here is my flippant response. You may be a developer but this is a data problem. 60% - 80% of your time in a data problem is fixing the data. Get it now? Having ‘good’ data matters. Your data is crappy. Be thankful that this is an axiom because it guarantees you a job for life.