Step 1 - Turning a document into text

Have you ever converted a document into pure text, removing all the formatting? It's an interesting exercise. Try it with a few different documents, and you soon realise that some formatting does more than just make a document look good; it genuinely helps convey meaning. I have spent quite a bit of time with the South African Banking Act. Like most legal documents, it has a multi-level index that facilitates navigation and groups relationships between different parts of the document. The Banking Act uses a hierarchical index like 2a(iv)(C)(aa)(ii)(a). Now, imagine you are on some random page in the middle of the document and you see a paragraph that starts with the index “(i)”. How do you know if this is a Roman numeral one or the single letter, lowercase ‘i’? If it is a Roman numeral, to which of the two index hierarchies does it refer? If you know you are in section 2a(iv)(C)(hh) and have just read point (i), is the next paragraph, labelled only as (ii), point 2a(iv)(C)(ii) or 2a(iv)(C)(hh)(ii)? Formatting matters. When reading the document, the number of tabs or spaces can mean something. The first step on the road to improving retrieval is to determine what formatting conveys information so that this can be passed to the LLM in plain text.

Any formatting that can be written in Markdown (tables, formulas, code, etc.) is easy to preserve. Graphs and charts remain challenging. I guess in time, multimodal LLMs will be able to make statements about a chart or graph that they receive as an image. But, for now, if there is important information contained in the document that you cannot preserve in text, you should plan to direct the user to the appropriate place in the original document rather than plan to solve it in the LLM.

Converting a document to text requires manual intervention. You could be lucky and have an easy document to work with, but luck here may be a double-edged sword. The easier it is for you to convert, the easier it is for someone else to convert, and this entire series is about finding interesting tasks that will not be made redundant with the next set of LLM upgrades.

We will get to it a little later, but it is worth acknowledging here: many of the documents you will work with will evolve over time (internal policy and procedure documents, user manuals, regulations). Once you have completed the step of turning the document into text, you should also have a way to update it that involves less work than the original conversion.

A Note on PDFs. The PDF format is ubiquitous for documents. PDF is an "output" format intended to be read by a human. PDF documents were never intended to be "inputs" into automated tasks. There are many different approaches to turn PDFs into text, but all of them have shortcomings or require effort and in-depth knowledge of the library. If your source document is a PDF, plan to spend quite a bit of time turning it into text.

Step 2 Chunks

Once we have the document as text, Step 2 is chunking. If you get presented with a random chunk of text from a document with no other context, do you think you will be able to answer questions about it. The simplest RAG implementation (the one in the example in the previous article) creates chunks of 1000 words (or tokens if you prefer, now is not the time to nit pick) with a 200 word overlap. What happens if this starts midway through a code block or ends in the middle of a table or part of the way through a sentence?

Since we are processing language, would it not make more sense to create more logically coherent chunks? Have a look at how you have identified paragraphs and how you have preserved formatting. I like to introduce my own identifiers to group items of interest (like a table with its caption or a formula and its explanation), which I use when chunking to ensure I am not inserting gibberish into my prompt at a later step. The key here is to try to get the chunks small so several of them can be inserted in the prompt. Try a few different approaches and see what makes sense for you and your document.

Step 3 - Index

The final step is the most important—and the most time-consuming. In the simple examples, each chunk (section or paragraph) performs two different functions: it is both the 'data' and the 'index'. If you want good retrieval, it is vital that you understand these functions, so I am going to spend some time on it.

The easiest function to understand is that of the data. Behind the scenes in RAG, we create a prompt that combines the user question with relevant sections from the document. It may look something like this:

### System Instruction: Answer the User Question using only information contained in the Reference Material. If the Reference Material does not contain information that can be used to answer the question, respond with "I do not have the information to answer that question."

### User Question: What is Task Decomposition?

### Reference Material

Chunk X

Chunk A

Chunk Q

Here, Chunk X, Chunk A, and Chunk Q are specific paragraphs from the source document. In this function, each chunk acts as data to create the prompt. In this case, we want the data to be an exact extract from the document we are using as the reference.

If we take a step back and ask ourselves why we chose Chunk X over Chunk Y, A over G, or Q over N, we are talking about retrieval, and retrieval is all about indexing. In a previous article where I discussed embedding, I used this example: What is the closest match for the question "What is a rai stone?" If we are measuring Semantic Textual Similarity, the closest match to this question is the question "What is a rai stone?" itself. However, if we are doing retrieval, you would prefer a paragraph like the following to have a closer match to the question (closer than the match of the question to itself):

"The Yap islands group is part of Micronesia and has a very peculiar currency: stone. Stone money known as 'Rai' are large stone disks, sometimes measuring up to 4 metres, with a hole in the middle that was used for carrying them. Rai was and is still used as a trading currency there."

I want to postpone discussing embedding models for a later article. It's too important and too much of a distraction to delve into here. What I do want to highlight is that not all embedding models operate in the same manner. Some models, like those from OpenAI, offer a unified method to embed all text, meaning the same model is used for embedding both questions and answers. If there is only one model for embedding, the closest match to any question will always be that question itself. Other models adopt a different strategy for embedding questions versus answers, aiming to prioritise retrieval over Semantic Text Similarity (i.e., ensuring an actual answer is the closest match to a question). Either way—and this is the key point, kids, so pay close attention—I am here to tell you that Semantic Text Similarity is an effective method for building a retrieval algorithm. Instead of merely embedding a chunk of text and hoping for accurate retrieval from a model, we create a question (or perhaps multiple questions) for which the chunk is the answer. We then perform a semantic search over the questions. In most cases, semantic search over questions outperforms retrieval search over the chunks.

At this stage, the jargon may be starting to get to you. It's ok. Breathe. Let's look at an example. Let's assume the document we want to search is the Currency And Exchange Control Manual for Authorized Dealers. All you really need to know about this document is that it is 300 riveting pages of very prescriptive legalese (wouldn’t it be nice if you didn’t need to worry about this type of document ever again?).

Here is an extract.

B.14 Miscellaneous transfers

(V) Employment contracts involving non-residents

(i) Where South African entities are required to remit funds abroad in respect of employment contracts involving non-residents who are employed in South Africa, Authorised Dealers may allow such transfers provided that the payments are commensurate with the work undertaken. In this regard the provisions of section B.5(A)(i) of the Authorised Dealer Manual should be adhered to by the individual contract workers.

A question answered by Section B.14 (V)(i) resembles: "Can South African entities remit funds abroad for employment contracts with non-residents employed in South Africa?" However, it's unlikely a user would phrase their query in this manner. They are more likely to ask, "Can I pay non-resident contractors?" Our goal is to create an "index question" – a version that closely mirrors what a user might ask, yet is detailed enough to capture the nuances of the chunk in question. The question "Can I pay non-resident contractors?" could yield various answers, significantly depending on who "I" refers to and where the contractors are located. Different clauses in the Manual must be considered by the model to respond to "Can I pay non-resident contractors?" with enough nuance to indicate the potential ambiguity of the query. Your "index question" should not overshadow other relevant responses. An effective index question might merge the most similar query with the more ambiguous user question, perhaps phrased as "Can a South African company pay non-resident contractors for work performed locally?" Additionally, for thoroughness, we could explore alternative phrasings to cover a wider range of searches, such as substituting "South African company" with "resident entity" or "non-residents" with "foreigners." Multiple index questions may need to be created.

Note that nobody needs to ever see your index questions except you as you fine-tune them. They do not need to be perfect - or even accurate. They need to be effective at identifying the correct chunk, but not so effective that they drown out everything else.

Now instead of only embedding the text from B.14(V)(i), we embed each "index question". When a user asks the question, "Can I pay non-resident contractors?", our search should return the semantically similar "Can a South African company pay non-resident contractors for work done locally?" (along with other questions which, for example, deal with non-resident companies using CFC accounts to pay offshore contractors). We use search to find section B.14(V)(i), which we then pass to the prompt.

Creating a good index significantly improves the chances that your search will retrieve the appropriate chunk. Of course, you can also embed the raw chunk. I do this. I also embed summaries of the chunks, especially if the original document uses jargon or legalese. I just make sure the summary uses plain English. You can provide many indexes to the same chunk; just be sure that you don’t crowd out other sections, as they may also be relevant.

Just in case you are wondering, I create the index using an LLM and a lot of manual intervention. This is a data project. Most of my time is spent on this step and its subsequent fine tuning as we observe users interact with the model.

If you have got this far, well done. Let's summarise the journey from an AI problem to a Data Problem.

Transformer models have finite prompt lengths. Not only are they finite, but they are actually relatively short, even if they are advertised as having long content windows. If we need to parse a document that is longer than the content window, taking time to chunk the document sensibly and then building an index for each chunk will outperform automated approaches to parsing the long document. Now, all you need to do is spend time with the document, and this is where it can really help to bring in a subject matter expert.

In the next section, I will finally talk about some technology.

2. Turning a document into data