PyMuPDF, LLM & RAG#
Integrating PyMuPDF into your Large Language Model (LLM) framework and overall RAG (Retrieval-Augmented Generation) solution provides the fastest and most reliable way to deliver document data.
There are a few well known LLM solutions which have their own interfaces with PyMuPDF - it is a fast growing area, so please let us know if you discover any more!
If you need to export to Markdown or obtain a LlamaIndex Document from a file:
Integration with LangChain#
It is simple to integrate directly with LangChain by using their dedicated loader as follows:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()
See LangChain Using PyMuPDF for full details.
Integration with LlamaIndex#
Use the dedicated PyMuPDFReader
from LlamaIndex 🦙 to manage your document loading.
from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")
See Building RAG from Scratch for more.
Preparing Data for Chunking#
Chunking (or splitting) data is essential to give context to your LLM data and with Markdown output now supported by PyMuPDF this means that Level 3 chunking is supported.
Outputting as Markdown#
In order to export your document in Markdown format you will need a separate helper. Package PyMuPDF4LLM is a high-level wrapper of PyMuPDF functions which for each page outputs standard and table text in an integrated Markdown-formatted string across all document pages:
# convert the document to markdown
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")
# Write the text to some file in UTF8-encoding
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
For further information please refer to: PyMuPDF4LLM.
How to use Markdown output#
Once you have your data in Markdown format you are ready to chunk/split it and supply it to your LLM, for example, if this is LangChain then do the following:
import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter
# Get the MD text
md_text = pymupdf4llm.to_markdown("input.pdf") # get markdown for all pages
splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)
splitter.create_documents([md_text])
For more see 5 Levels of Text Splitting