The MLnotes Newsletter

The MLnotes Newsletter

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
How to Extract Tables from PDF Files🗂️

How to Extract Tables from PDF Files🗂️

Mehdi Allahyari's avatar
Angelina Yang's avatar
Mehdi Allahyari
and
Angelina Yang
Jan 15, 2024
∙ Paid

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
How to Extract Tables from PDF Files🗂️
Share

Have you ever thought of how to extract tables from PDF files?

  • Maybe we can use computer vision?

  • Or the latest and the greatest LLMs?

Today, we’ll introduce ways to achieve this using LLMs and RAG systems.

Use case

One prominent use case for extracting tables from PDFs is in the realm of data analysis and reporting. Consider a scenario where an organization receives regular financial reports or business data in PDF format. These reports contain tables with essential financial data such as revenue, expenses, and profitability. To efficiently analyze and manipulate this data, extracting tables becomes crucial.

Challenges:

  1. Manual Data Entry: Without table extraction, analysts would need to manually transcribe data from the PDFs into a spreadsheet or database, which is time-consuming and error-prone.

  2. Data Accuracy: The risk of errors during manual entry increases, leading to inaccuracies in financial analysis and decision-making.

This is where an AI solution can come in!

But this is far from being a simple task.

🤔 Why is it so tricky?

It's a challenging problem that stems from the complexity and variability of table structures, as well as the inherent lack of semantic understanding in PDF files. Tables in PDFs are not always formatted consistently. They can have merged cells, nested rows, different column spans and more.

Additionally, PDF files are designed for human reading, not machine reading. This means they often lack the semantic cues that would make data extraction straightforward.

Solution?

To successfully extract tables from PDFs, sophisticated techniques are required to identify and correctly interpret the data within these tables. These techniques need to be able to handle the wide range of variability in table structures, and interpret the data in a way that maintains its original meaning and context.

There are several approaches for extracting tables from PDF files, and several libraries exist to facilitate this task. Nevertheless, none of them work 100% of the time. A recently released solution has shown promising performance in this task, we’ll discuss in the rest of the post.

Let’s preview some results:

Before👇

After👇

Table_0 (extracted table)

Here’s the solution:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLnotes
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share