Automating Invoice Parsing with AI
A practical, low-cost workflow to extract structured data from Google Workspace & Google Cloud PDF invoices. The project combines smart PDF text normalization with an OpenAI model that returns strict JSON, and exports results to CSV.
Handling invoices is tedious: download, scan, copy, paste, verify, repeat. It gets worse when PDFs contain broken text like “.I.n.v.o.i.c.e.” or numbers split into pieces such as “5 0 1 1 6 7 1 7 4 5”. Traditional regex parsers tend to fail across layouts, and one new template can break the entire pipeline.
To make this reliable and scalable, I built a small toolchain that:
- Extracts text from PDFs with pdfplumber,
- Normalizes messy text (removes dotted leaders, rejoins fragmented characters and digits),
- Uses a low-cost OpenAI model to produce strict JSON matching a schema,
- Exports the result to a clean CSV for your bookkeeping software.
Why Open Source?
This pain is common across freelancers, startups, and finance teams. By publishing the scripts as open source, the community gets a transparent and extensible reference implementation. You can adapt the schema, extend patterns, or plug in different providers while reusing the same normalization and AI-structuring approach.
What the Tool Extracts
- Invoice number
- Billing ID
- Domain name
- Invoice period (start & end)
- Subtotal, VAT (rate and amount), Total in EUR
- Supplier name and VAT number (where present)
How It Works
- PDF text extraction using
pdfplumber
. - Normalization removes leader dots and rejoins fragmented tokens without breaking URLs or decimals.
- Structured extraction via the OpenAI API with a strict JSON schema (no invented fields).
- CSV export for easy import into accounting systems.
Run it on a folder of invoices:
python ai_invoice_extract.py -i "invoices/*.pdf" -o output/invoices.csv
Why a Small Model?
Invoice parsing doesn’t require heavyweight models. A compact, affordable model keeps costs predictable while still following a strict schema reliably. The script also trims irrelevant lines before sending text to the API, further reducing token usage.
Get Started
git clone https://github.com/OnlineSolutionsGroupBV/parse_invoices/
cd ai-invoice-extractor
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Add your OpenAI API key to .env
cp .env.example .env
# OPENAI_API_KEY=sk-...
python ai_invoice_extract.py -i "invoices/*.pdf" -o output/invoices.csv
Repository: https://github.com/OnlineSolutionsGroupBV/parse_invoices/
Extending the Project
- Add more providers (AWS, Azure, telecom) by updating the JSON schema and prompt hints.
- Introduce a local-regex fallback when the API isn’t available.
- Export to Excel or push results straight into your bookkeeping system’s API.
Final Thoughts
Automating repetitive admin work pays off quickly. With this open-source workflow, invoice parsing becomes a background task: robust, auditable, and inexpensive. If you work with Google invoices—or want to adapt the approach to other sources—try the repo, open issues, and contribute improvements.
About Online Solutions Group
This open-source project is maintained by Online Solutions Group . We specialize in business process automation, AI-driven workflows, and digital platforms that help organizations save time and operate more efficiently. Visit our website to learn more about our projects and services.
Comments
Post a Comment