After perusing my user guide, you've given ChatGPT and GPT-4 a spin. Your CV is professional, your crucial emails are devoid of spelling and grammar mistakes, and their tone is perfect. You've even dipped your toes into its low-carb meal plan, complete with shopping list. This AI stuff is interesting, but the deeper you delve, the more issues you uncover. It's not up-to-date, its context window is too short, and its reliability leaves much to be desired. But don't fret! You just need to hop off the hype train momentarily and ponder Amara's Law, or maybe Gates' Law, or maybe someone else's law (attributing quotes is harder than you'd think): “We overestimate the impact of technology in the short term but underestimate it in the long term.”

You're a developer. This is just software. With software, anything is possible. Right? Absolutely right!

I hope.

Before we tackle these issues, let's lay out the agenda - and the caveats.

In this series, we'll examine the two major deficiencies in LLMs as they stand (mid 2023):

ensuring they can access your proprietary knowledge base while keeping it, well, proprietary, and
managing their sometimes outlandish but always well written hallucinations.

We won't dive too deep into any specific weeds - that's a journey for another series. Here, we're aiming for a panoramic view of current approaches to see when we can charge forward with existing tools, and when we should pause and ponder before we hastily ‘pip install antigravity’ and take flight.

There are plenty of end-to-end examples out there. In this series, I'll attempt to distinguish between pre-packaged examples and those that might actually withstand the wild outdoors. While there will be some GitHub repos showcased, my goal isn't to provide fully functioning examples. Instead, for each potential use case, I aim to spotlight key issues and identify where they could be deal-breakers or mere trifles.

How many elephants can you squeeze into a room?

Max Planck famously said, “A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it.” Or, for the clickbait aficionados, “Science advances one funeral at a time.” What I need from this quote is that scientific progress is generally slow and sequential, punctuated by the occasional leap. If I haven't entirely misinterpreted Planck (which wouldn't be a first), I'm compelled to conclude that Machine Learning isn't quite a science yet. Observing progress in Machine Learning feels more like watching a work of art come to life. Start with a complex idea. Distil it. Refine it. Extract its essence. Prune unnecessary complexity. But above all, enrich it.

The evolution of DeepMind's Go playing algorithm is a perfect case in point. While the early algorithms are intelligible only to machine learning specialists, by the time DeepMind outwitted the best players in the world, its algorithm was comprehensible to any first-year maths student. If you haven't read DeepMind’s article on this, I highly recommend it. Your time will be well spent.

But here's my quandary. I want to create “evergreen” content, but a single tweak in a neural network configuration could render everything we discuss here obsolete. It's May 2023 now. If you're reading this later, I can only hope it hasn't gathered dust.

Your data sucks.

Given the target audience of this series, you might be blissfully ignorant of a not-so-little secret. If you're dealing with data, you're going to spend a lion's share of your time cleaning it - think 60% - 80% of your time will be spent preparing data. You enter a field like statistics or data science, spend your youth studying nuanced approaches to intricate problems, and then, in the real world, you find yourself wrestling with date formats, joining on missing keys, or trying to decipher the enigmatic output from unreadable regex expressions. Data sucks.

But here's the kicker. It's not a bug. It's a feature. Picture this: you have access to pristine data. More likely than not, every question you could possibly want to explore has already been answered by someone smarter or more dedicated than you. It's just the way the world works. Valuable information is often buried in messy data. Clean up the mess, gain the insight, and then move on to tackle the next messy data set. You need to make peace with the fact that the bulk of your work in this field will be the monotonous scrubbing of data. If you can't stomach that, or can't outsource it (microworkers or amazon mechanical turk), perhaps consider a career change. You're going to be working with data, and data, my friend, sucks.

You need to test.

Is a simplistic approach adequate for a specific example? Should you invest time in managing token lengths or simply opt for a different sentence transformer? Should you use a vector database or train a full encoder network?

Upfront, no one can predict if it's better to swap this black box for that one. You just have to benchmark what you've done, make a change, and re-benchmark.

If you were banking on GitHub Co-pilot and Amazon Code Whisperer to forever abolish your need to test, I hate to be the bearer of bad news. Machine Learning isn't quite like code development. You're not verifying if a piece of code executes correctly under certain conditions; you're comparing black box algorithms that are essentially a mystery to begin with, and you're aiming to do it well (hopefully!). You need to have a slew of objective tests in place before you start, to ascertain if you're making progress or just spinning your wheels.

Here, we're not going to lose sleep over unit, functional, and regression tests; instead, we're going to fret over logic and semantic tests. And we're going to need to stay one step ahead of users discovering the latest form of SQL-Injection, DAN if you don’t want to be financially crippled by some North Korean bot network.

Alright, that's it for the caveats. Now, let's get this show on the road!

1. It's just software. Or is it?