Back To Blog
Data Extraction

Comparing Parsee Document Loader vs. Langchain Document Loaders for PDFs

March 18, 2024 - 5 min
Comparison between Parsee Document Loader  and Langchain Document Loader for PDFs
In the following we will be comparing the results of the Parsee Document Loader vs. the PyPDF Langchain Document Loader for various datasets. All datasets that are used here can be found on Huggingface (links below), so the results are all reproducible.

With the datasets in this folder we want to test how the results of an LLM for extracting structured data from invoices differs for different document loaders.

Both datasets have their own Readme's with more info about the methodology, notebooks for the creation of the dataset and evaluation results:


1. Invoice Dataset - Langchain Loader

parsee-core version used: 0.1.3.11

This dataset was created on the basis of 15 sample invoices (PDF files).

All PDF files are publicly accessible on parsee.ai, to access them copy the "source_identifier" (first column) and paste it in this URL (replace '{SOURCE_IDENTIFIER}' with the actual identifier):

https://app.parsee.ai/documents/view/{SOURCE_IDENTIFIER}

So for example:

https://app.parsee.ai/documents/view/1fd7fdbd88d78aa6e80737b8757290b78570679fbb926995db362f38a0d161ea

The invoices were selected randomly and are in either German or English.

The following code was used to create the dataset: jupyter notebook

The correct answers for each row were loaded from Parsee Cloud, where they were checked by a human and corrected prior to running this code.

1.1 LLM Evaluation

For the evaluation we are using the mistralai/mixtral-8x7b-instruct-v0.1 model from replicate.

The results of the evaluation can be found here: jupyter notebook

1.2 Result

Even though the Parsee PDF Reader was not initially designed for invoices (which have often quite fractured text pieces and tables that are difficult to structure properly), it is still able to outperform the langchain PyPDF reader with a total accuracy of 88% vs. 82% for the langchain reader.

Parsee PDF Reader compared with Langchain PyPDF

2. Revenues Dataset - Parsing Tables

This dataset consists of 15 pages from annual/quarterly reports of German companies (PDF files), the filings are in English though.

The goal is to evaluate two things:

  1. How well can a state-of-the-art LLM retrieve complex structured information from the documents?

  2. How does the Parsee.ai document loader fare against the langchain PyPDF loader for this document type

We are using the Claude 3 Opus model for all runs here, as this was the most capable model in our prior experiments (beating GPT 4).

Both datasets have their own Readme's with more info about the methodology, notebooks for the creation of the dataset and evaluation results:

2.1 Result

Comparison Extraction Results of Revenue Tables
Explanation of results:

  • Completeness: This measures how often the model gave the expected amount of answers. For example for this file, there are 5 columns with a "Revenue" figure in them. So we are expecting the model to return 5 different "answers", each with one of the revenue figures (you can see these in the tab "Extracted Data" on Parsee Cloud)

  • Revenues Correct: How many times the model extracted a valid "Revenues" figure. If the answer was missing completely, this is counted here as well (so this both accounts for wrong answers, and also missing answers)

  • Revenues Correct (excluding missing answers): This is disregarding the cases where the model simply did not extract the right figure at all, so basically, if it extracted the figure (matched based on the meta information), was it the correct number?

  • Meta Items Correct: How many times did the model extract all the expected meta information (time periods, currencies etc.; missing answers are counted here as well)

  • Meta Items Correct (excluding missing answers): If the model found a valid revenues number, how many times was all the meta information attached to it correct? (this is not counting the times where the answer was missing entirely)

Open Source Framework Data Extraction and Structuring

Try Parsee Cloud for free

Explore Parsee Cloud's Document Processing Capabilities at No Cost
Related posts
  • GPT-4o Benchmark Results Showing that it is Truly a Next-Generation Model
    We tested the performance of GPT-4 Omni (model name: gpt-4o) on our finRAG dataset, the results show that this is truly a next generation model that does not seem to have some common issues that previous generation models had, making it possibly the first model suitable for reliable enterprise use.
  • Data Extraction
    finRAG Datasets & Study
    We wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 1,156 randomly selected annual reports for the year 2023 of publicly listed US companies. The resulting datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images. For our study, we are evaluating 8 state-of-the-art (M)LLMs on a subset of 100 reports.