tl;dr

  • Learn how to script in python. LLMs in their current state will not displace people. People who can use LLMs, however, will displace people who do not.
  • This article shows how to use ChatGPT to quickly parse a long document and extract nuanced information
  • This toy example clearly illustrates the type of information that can quickly be extracted from a long body of text. My conclusion is that as long as your task does not require absolute perfection, using an LLM should make the process much faster.

Today we take a short break from the series on LLMs for users and for developers to examine a stand-alone example in which we use an LLM to extract information from documentation that is not part of its training data.

When I started this exercise, I was overly ambitious. I thought a good working example would involve extracting insights from financial reporting data and, basking in the optimism and enthusiasm only the ignorant can feel, I raced over to Glencore’s financial reporting documentation. After a week, all I had managed to do was relearn Rule No 1: Your data sucks. This is not a bug. It is a feature.

So I gave up!

Ok that is not exactly true. I gave up trying to extract information, as text, from a document that relies heavily on non-text (charts, tables, diagrams) for content. Fortunately there is a reasonable supply of electronic documents where the majority of the content is text thanks to Amazon - the e-book.

Today we are going to extract information from Andre de Ruyter’s book Truth to Power which Amazon describes as “André de Ruyter’s explosive account of his three years as CEO of Eskom, where he dealt with corruption, sabotage, political interference and a poisoning attempt.”

(If you want to know how ChatGPT describes the book, check the appendix to this document)

I have a Github repo which you can look at. Virtually all the code in the repo was generated by ChatGPT and Github Copilot. I wrote English and got an LLM to be the developer.

Most examples that illustrate the use of an LLM to gain insight into a PDF document involve document summarization (see the appendix) or question answering (I am exploring this in my evolving developer series). Here I want to look at something else – sentiment analysis. Since the book is a chronicle of why Eskom is in its current precarious state, it is full of names of people and organisations. Would it be possible to get a list of them along with a summary of de Ruyter’s view of the role they played? Could we create a list of the goodies and the baddies?

To give you a flavour of the exercise, here are a few of the approximately 280 people or organisations analysed. The results are unedited i.e: from the LLM using only the insight from the book. I have deliberately chosen people or organisations that appear frequently in the text as these results require multiple parses through the LLM (for extraction and for amalgamation)

  • The overall sentiment towards the ANC is negative, with the text criticising the party for being stuck in the past, lacking business acumen leading to counterproductive legislation, corruption, and being politically sensitive towards dealing with issues. The sentiment is also negatively impacted due to the party's strict procurement and labour laws pushing investors away, and the general negative opinion about politicians' unwillingness to hear the truth.
  • Cyril Ramaphosa is generally viewed positively as a leader who takes decisive action when necessary to address the electricity crisis in South Africa. He has shown support for green energy and Eskom's role in it, remained focused on solving the crisis despite disappointment, inspired the author through his "Thuma Mina" campaign, challenged proposals to raise threshold to 100 MW, and demonstrated a willingness to ask tough questions. However, there are critiques of his leadership style prioritizing managing all opinions rather than making decisive decisions, and attributing the loadshedding to sabotage instead of admitting to incompetence suggests a lack of transparency and effective communication.
  • The sentiment towards Jacob Zuma is overwhelmingly negative due to his destructive impact on state-owned companies and involvement in corruption, overshadowing any slight positive sentiment towards him.
  • Pravin Gordhan is praised for his integrity, bravery, and positive reputation as a main character. However, tensions, criticisms, and suspicions of knowledge or lack of action towards corruption allegations generate a slightly negative sentiment. Overall, his work to address issues and willingness to challenge existing ANC orthodoxy contribute to a positive sentiment. Neutral or negative sentiments without compelling reasons are discounted.
  • Jan Oberholzer is portrayed as an experienced and hardworking operator who is proactive in restoring control over power stations. While he may express disappointment or distrust towards certain individuals or situations, he is also grateful and appreciative towards those who remind him of the importance of his work. Jan's behavior shows a positive sentiment towards ethical behavior and accountability. Though he may be caught off-guard and angry, he actively protects the company and himself from unethical behavior.

And now, the reason you are all here, not to get a summary of de Ruyter’s sentiment, but to understand any insights I have gained about these models while working through this example. I present these in order from general to specific.

  1. You need a human in the loop. While it may be possible to completely automate and generalise the end to end generation of sentiment, the effort would be great and I don’t think the results justify that effort. Coming away from this exercise I have reinforced my (Bayesian) prior assumption that LLMs will not make jobs redundant but people who use LLMs will replace people who do not. Would you employ someone who performed  arithmetic with only a slide rule and log book or someone who uses Excel?
  2. LLMs are unpredictable and you cannot exercise fine grained control of their output. Just because you ask nicely, pedantically, succinctly, verbosely or in painstaking detail, does not mean the output will always be in the format you need if you are going to do further automation. If for example, you need output formatted as a table with specific column headings, sure you can get this most of the time but every now and then the model will do something different. You need to check the result.
  3. The exercise of generating text is different from the exercise of checking if the generated text conforms to the input request (prompt). This means that, depending on the importance of the problem, you can always use a two step process. Step 1 is to perform a task, Step 2 is to use the same model to check that the task performed in Step 1 was performed correctly and if not, to fix it. Step 2 will pick up many of the issues that may have been created in Step 1.
  4. You need to be able to create and execute Python Scripts. To perform meaningful tasks will require interacting with the API. LLMs are really useful when there is a lot of input information even if they can only process small chunks at a time. If you want to use these (and remember if you don’t, sooner or later you will be made redundant by someone who does), you should learn some basic Python. The LLMs create great code themselves so you don’t need to become a programmer but you do need to be able to create and run scripts yourself.

The code repository, along with more detailed comments, can be found here but note that this one requires the user to be actively involved in the process. It is however what I think a user of an LLM should be able to do if they are to use one to leverage their productivity.

Appendix: Using an LLM for summarisation

Summarisation is an established use case for an LLM. Starting with the final result, GPT describes de Ruyters book as:

"Truth to Power" is a book that addresses the challenges and corruption faced by South Africa's state-owned power company, Eskom. The author, a former CEO of Eskom, discusses their personal experience and the importance of selfless leadership. The book touches on issues such as corruption, procurement regulations, and operational culture that have led to Eskom's struggles. It also highlights the importance of strong leadership and the efforts of key individuals in uncovering corruption and fraud. In addition, "Truth to Power" explores the challenges associated with South Africa's transition to green energy, including policy integration, funding, and government investment. The book concludes by proposing private sector involvement and a clear path to overcome the country's energy crisis.

While this is not quite as punchy as the Amazon one liner, I hope you agree with me that it is pretty good. I don’t think it is a stretch to conclude that it is better than I could do and it took mere minutes to parse the entire book and combine the results to produce this. It is worth noting that I chose the length of this summary so I could choose to make a summary of almost any length.

I was using OpenAI’s GPT3.5 model because I still don’t have API access to version 4. GPT 4 is much better than 3.5 so I think I could even ask it to create a “click-baity” one liner which may even capture the assassination attempt!

The process to generate the summary is pretty clunky so you need to be able to script. First read in the PDF, then break it into “sensible” chunks that are not too long and summarise each chunk. Combine the summaries of each chunk by chapter and create chapter level summaries. Combine chapter summaries into strings of a sensible length and keep going.

For those of you still left at this stage, we can follow the path of the text in chapter 1 as it bubbles up to the final summary.

Chapter 1, chunk 1 summary

Former Nampak CEO André de Ruyter and former governor of the South African Reserve Bank, Tito Mboweni, had a good rapport despite their differing backgrounds. The two regularly engaged in conversations about the state of South Africa and often spoke about the country's electricity crisis, with Mboweni seeking de Ruyter's insights on the subject. Mboweni, who was critical of Nampak's move to a cheaper, smaller office, provided valuable insight into the opaque world of the ANC's internal politics during the state capture years. However, Mboweni was ostracised by many of his peers for speaking out against abuses of state capture, which he viewed as a serious threat to the country.

Chapter 1 chunk 2 summary

The author and Mboweni regularly discussed solutions for South Africa's problems. Mboweni suggested the author write down their thoughts about an ideal Eskom CEO and what was needed to fix the struggling utility. The author complied with Mboweni's request, outlining a plan to restructure Eskom by splitting its different units. This suggestion had also been floated during Mboweni's time in government but was "buried." Later, the author learned that Mboweni's request may have been more calculating than initially thought, potentially collaborating with Colin Coleman from Goldman Sachs to recruit the author to run Eskom.

These are summarised into Chapter 1 summary

The former Nampak CEO, André de Ruyter, and former South African Reserve Bank governor, Tito Mboweni, frequently discussed South Africa's electricity crisis, with Mboweni seeking de Ruyter's insights on the subject. Despite his critical comments on Nampak's business decisions, Mboweni provided valuable insights into the ANC's internal politics. Mboweni and the author also discussed solutions to the country's problems, with Mboweni suggesting the author record their plan to restructure Eskom by separating its different units and potentially collaborating with Goldman Sachs to recruit the author to run the company.

This is combined with other chapters in the first half of the book and summarised to get:

The book "Truth to Power" features several chapters that shed light on the challenges and corruption faced by South Africa's state-owned power company, Eskom. The author, a former CEO of Eskom, describes their personal experience applying and working in the role and the importance of having selfless leadership. The chapters touch on issues such as operational culture, corruption, and procurement regulations that have contributed to Eskom's struggles. Furthermore, the book highlights the efforts of key individuals in uncovering corruption and fraud at Eskom and the need for strong, decisive leadership to address the country's challenges.

Which is finally combined with the summary of the last half of the book to get the summary I started with. You see, you need to know how to write a loop!

Can an LLM extract information from a document?