How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches

By

Introduction

Extracting structured data from B2B documents—such as purchase orders, invoices, or delivery notes—is a common challenge. Two primary approaches exist: a traditional rule-based method using pytesseract for OCR and regex for parsing, and a modern LLM-based method using Ollama with LLaMA 3. This guide walks you through building both versions of the same document extractor, comparing their strengths and tradeoffs using a realistic B2B order scenario. By the end, you'll be able to choose the right approach for your own projects.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

What You Need

  • Python 3.8+ installed on your machine
  • pytesseract – Python wrapper for Tesseract OCR engine
  • Tesseract OCR engine installed separately (see Tesseract OCR documentation)
  • Ollama – local LLM server (download from ollama.com)
  • LLaMA 3 model (run ollama pull llama3 after installing Ollama)
  • Python libraries: pdf2image, Pillow, re, requests
  • A sample B2B PDF (e.g., a purchase order with fields: company name, date, line items, totals)

Step-by-Step Instructions

Step 1: Set Up the Environment

Create a new Python virtual environment and install all required packages:

pip install pytesseract pdf2image Pillow requests

Ensure Tesseract OCR is installed globally (sudo apt install tesseract-ocr on Linux, or download the Windows installer). Also install and start Ollama, then pull the LLaMA 3 model:

ollama pull llama3

Step 2: Convert PDF to Images

B2B documents are often scanned PDFs. Use pdf2image to turn each page into a PNG image. Write a function that:

  • Takes the PDF path as input
  • Converts pages to images using convert_from_path
  • Returns a list of PIL Image objects

Step 3: Perform OCR with pytesseract

For each image, call pytesseract.image_to_string() to extract raw text. This step is identical for both rule-based and LLM approaches, as they both need the text first. Store the extracted text per page.

Step 4: Build the Rule-Based Extractor

Use regular expressions and string logic to locate fields like Order Number, Date, Client Name, and Line Items. For example:

  • Search for patterns like r'Order\s*#:\s*(\S+)'
  • Use a list of known product names for line items
  • Parse multi-line blocks for tables

This method is fast and predictable, but fragile if the document format changes.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

Step 5: Build the LLM-Based Extractor

Instead of writing rules, send the extracted text to LLaMA 3 via Ollama’s API. Send a structured prompt that asks the model to extract specific fields in JSON format:

prompt = f"""
Extract the following information from this purchase order:
- order_number
- date
- client_name
- line_items (array of objects with 'item', 'quantity', 'price')
Return only valid JSON.

Text:
{text}
"""

Use the requests library to call Ollama:

response = requests.post('http://localhost:11434/api/generate', json={'model':'llama3', 'prompt':prompt, 'stream':False})

Parse the JSON from the response.

Step 6: Compare Outputs

Run both extractors on the same set of PDFs and compare:

  • Accuracy: Which fields are correct?
  • Robustness: How does each handle missing data or typos?
  • Speed: Rule-based usually finishes in seconds; LLM may take 10–30 seconds per page.

The original experiment showed that the rule-based approach failed on a slightly different document format, while the LLM gracefully adapted—but hallucinated one item.

Tips for Success

  • Preprocess images: For rule-based OCR, apply thresholding or deskewing to improve accuracy.
  • Optimize LLM prompts: Include example outputs and specify format clearly to reduce hallucinations.
  • Fallback strategy: Use rule-based extraction for well-known templates and LLM as a fallback for unknown documents.
  • Test with diverse samples: Don’t rely on a single document; vary fonts, layouts, and printing quality.
  • Monitor costs: Local LLMs are free but require GPU; cloud LLMs charge per token.

By following these steps, you can build your own B2B document extractor and decide which approach best fits your needs. For a deep dive into the original comparison, see the full article.

Related Articles

Recommended

Discover More

GitHub Faces Critical Reliability Crisis as AI Coding Tools Trigger Exponential Traffic SurgeHyundai Infotainment Systems Score with FIFA World Cup 2026 ThemesMastering Microservices from the Frontend: A Practical Q&A GuideMonday.com's AI Transformation: What You Need to Know About Its New Native AgentsGoogle Gemini API Webhooks: Eliminating Polling for Long-Running AI Jobs