A Simple Vision Model Application

I collect receipts. Specifically, I collect building and renovation receipts that I need to offset the capital gains I would otherwise be taxed on when I sell my primary residence. I owned my last house for 18 years and managed a number of renovations. Over the years, I had a lot of small receipts that added up to something meaningful. During a tax audit, I was asked to provide a full reconciliation of my investment in the property and then to produce evidence of a few of the line items selected at random. Fortunately, I had the data because, otherwise, I would have been unnecessarily sponsoring firepools or skinny jeans for those who feel entitled to the SA Tax purse.

But here's the thing: collecting, organising, and reconciling these invoices is a tedious, menial exercise that I despise. While all I really need to do is scan and save the file with a reasonable name and record the date, amount, and file name in a spreadsheet, I can't explain how much this frustrates me. I put it off for weeks or sometimes months, and by the time I get around to it, I'm convinced that I have misplaced or lost evidence of some expenses.

If I'm totally honest with myself, the reason I procrastinate so much is that I feel it's meaningless work, and my view on meaningless work is that it should be done by a machine. I've tried to automate it and failed. If you have ever tried to use a PDF as an 'input' to an automation task, you won't be surprised. PDFs are designed as an output format (intended only for human consumption). Trying to get a computer to read the full range of invoice formats is just not feasible. So in the end, I begrudgingly do the work manually.

Until last week.

Last Wednesday, I just needed a break from Retrieval Augmented Generation, so I found this excellent paper from Microsoft and created a small project to test Large Multimodal Models (LMMs).

Automating my record-keeping turned out to be much easier than I imagined. I start with a screen capture of the invoice—either from my camera or from a scan. Then I feed the image to an LMM with one simple instruction:

"Please read the text in this image and return the information in the following JSON format: (note XXX is a placeholder, if the information is not available in the image, return 'N/A' instead). {'invoice_date': xxx, 'total_amount': xxx, 'company_name': xxx, 'items_purchased': xxx}. Use the following format for the invoice date: YYYYMMDD. items_purchased should be a short string summary of the items purchased."

Given my experience with PDF documents, my expectations were low. But as I successfully iterated over my backlog of invoices, I remembered Arthur C. Clarke's famous observation that “Any sufficiently advanced technology is indistinguishable from magic.” This felt like magic. It just worked. It read invoices and extracted information from them—printed invoices, handwritten ones, clear ones, crappy ones. Not always 100%, but enough that I felt I was at a Penn and Teller show. Confused, amazed, and left doubting myself, but really wanting to understand what I had just seen.

I know that not everyone loves a magic show, but if you do, you should play with a vision model or read Microsoft's paper.