PyMuPDF, LLM & RAG#

Integrating PyMuPDF into your Large Language Model (LLM) framework and overall RAG (Retrieval-Augmented Generation) solution provides the fastest and most reliable way to deliver document data.

There are a few well known LLM solutions which have their own interfaces with PyMuPDF - it is a fast growing area, so please let us know if you discover any more!

Integration with LangChain#

It is simple to integrate directly with LangChain by using their dedicated loader as follows:

from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()

See LangChain Using PyMuPDF for full details.

Integration with LlamaIndex#

Use the dedicated PyMuPDFReader from LlamaIndex 🦙 to manage your document loading.

from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")

See Building RAG from Scratch for more.

Preparing Data for Chunking#

Chunking (or splitting) data is essential to give context to your LLM data and with Markdown output now supported by PyMuPDF this means that Level 3 chunking is supported.

Outputting as Markdown#

In order to export your document in Markdown format you will need a separate helper. Package pdf4llm is a high-level wrapper of PyMuPDF functions which for each page outputs standard and table text in an integrated Markdown-formatted string across all document pages:

import pdf4llm

md_text = pdf4llm.to_markdown("input.pdf")

# Write the text to some file in UTF8-encoding
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

How to use Markdown output#

Once you have your data in Markdown format you are ready to chunk/split it and supply it to your LLM, for example, if this is LangChain then do the following:

import pdf4llm
from langchain.text_splitter import MarkdownTextSplitter

# Get the MD text
md_text = pdf4llm.to_markdown("input.pdf")  # get markdown for all pages

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

splitter.create_documents([md_text])

For more see 5 Levels of Text Splitting