Automating Invoice Parsing with AI

Automating Invoice Parsing with AI

A practical, low-cost workflow to extract structured data from Google Workspace & Google Cloud PDF invoices. The project combines smart PDF text normalization with an OpenAI model that returns strict JSON, and exports results to CSV.

Handling invoices is tedious: download, scan, copy, paste, verify, repeat. It gets worse when PDFs contain broken text like “.I.n.v.o.i.c.e.” or numbers split into pieces such as “5 0 1 1 6 7 1 7 4 5”. Traditional regex parsers tend to fail across layouts, and one new template can break the entire pipeline.

To make this reliable and scalable, I built a small toolchain that:

  • Extracts text from PDFs with pdfplumber,
  • Normalizes messy text (removes dotted leaders, rejoins fragmented characters and digits),
  • Uses a low-cost OpenAI model to produce strict JSON matching a schema,
  • Exports the result to a clean CSV for your bookkeeping software.

Why Open Source?

This pain is common across freelancers, startups, and finance teams. By publishing the scripts as open source, the community gets a transparent and extensible reference implementation. You can adapt the schema, extend patterns, or plug in different providers while reusing the same normalization and AI-structuring approach.

What the Tool Extracts

  • Invoice number
  • Billing ID
  • Domain name
  • Invoice period (start & end)
  • Subtotal, VAT (rate and amount), Total in EUR
  • Supplier name and VAT number (where present)

How It Works

  1. PDF text extraction using pdfplumber.
  2. Normalization removes leader dots and rejoins fragmented tokens without breaking URLs or decimals.
  3. Structured extraction via the OpenAI API with a strict JSON schema (no invented fields).
  4. CSV export for easy import into accounting systems.

Run it on a folder of invoices:

python ai_invoice_extract.py -i "invoices/*.pdf" -o output/invoices.csv

Why a Small Model?

Invoice parsing doesn’t require heavyweight models. A compact, affordable model keeps costs predictable while still following a strict schema reliably. The script also trims irrelevant lines before sending text to the API, further reducing token usage.

Get Started

git clone https://github.com/OnlineSolutionsGroupBV/parse_invoices/
cd ai-invoice-extractor

python3 -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Add your OpenAI API key to .env
cp .env.example .env
# OPENAI_API_KEY=sk-...

python ai_invoice_extract.py -i "invoices/*.pdf" -o output/invoices.csv

Repository: https://github.com/OnlineSolutionsGroupBV/parse_invoices/

Extending the Project

  • Add more providers (AWS, Azure, telecom) by updating the JSON schema and prompt hints.
  • Introduce a local-regex fallback when the API isn’t available.
  • Export to Excel or push results straight into your bookkeeping system’s API.

Final Thoughts

Automating repetitive admin work pays off quickly. With this open-source workflow, invoice parsing becomes a background task: robust, auditable, and inexpensive. If you work with Google invoices—or want to adapt the approach to other sources—try the repo, open issues, and contribute improvements.

About Online Solutions Group

This open-source project is maintained by Online Solutions Group . We specialize in business process automation, AI-driven workflows, and digital platforms that help organizations save time and operate more efficiently. Visit our website to learn more about our projects and services.

Comments