기능 비교

기능 비교표

다음 표는 PyMuPDF 가 다른 일반적인 솔루션과 어떻게 비교되는지 보여줍니다.

_images/icon-pdf.svg _images/icon-svg.svg _images/icon-xps.svg _images/icon-cbz.svg _images/icon-mobi.svg _images/icon-epub.svg _images/icon-image.svg _images/icon-fb2.svg _images/icon-txt.svg _images/icon-docx.svg _images/icon-pptx.svg _images/icon-xlsx.svg _images/icon-hangul.svg
Feature PyMuPDF pikepdf PyPDF2 pdfrw pdfplumber / pdfminer
Supports Multiple Document Formats PDF XPS EPUB MOBI FB2 CBZ SVG TXT Image
DOCX XLSX PPTX HWPX See note
PDF PDF PDF PDF
Implementation Python and C Python and C++ Python Python Python
Render Document Pages All document types No rendering No rendering No rendering No rendering
Write Text to PDF Page
See: Page.insert_htmlbox
or:
Page.insert_textbox
or:
TextWriter
Supports CJK characters
Extract Text All document types PDF only PDF only
Extract Text as Markdown (.md) All document types
Extract Tables All document types PDF only
Extract Vector Graphics All document types Limited
Draw Vector Graphics (PDF)
Based on Existing, Mature Library MuPDF QPDF
Automatic Repair of Damaged PDFs
Encrypted PDFs Limited Limited
Linerarized PDFs
Incremental Updates
Integrates with Jupyter and IPython Notebooks
Joining / Merging PDF with other Document Types All document types PDF only PDF only PDF only PDF only
OCR API for Seamless Integration with Tesseract All document types
Integrated Checkpoint / Restart Feature (PDF)
PDF Optional Content
PDF Embedded Files Limited Limited
PDF Redactions
PDF Annotations Full Limited
PDF Form Fields Create, read, update Limited, no creation
PDF Page Labels Read-only
Support Font Sub-Setting


참고

DOCX icon XLSX icon PPTX icon HWPX icon

Office 문서 유형(DOCX, XLXS, PPTX) 및 한글 문서(HWPX)에 대한 참고사항입니다. 이러한 문서는 PyMuPDF 에 로드할 수 있으며 Document 객체를 받게 됩니다.

다음과 같은 주의사항이 있습니다:

  • 내용을 배치하기 위해 입력을 HTML 로 변환합니다.

  • 이로 인해 원본 페이지 구분이 사라집니다.

결과를 저장할 때 원본 레이아웃의 정확한 표현을 기대할 수 없습니다.

따라서 입력 파일은 주로 텍스트 추출에 유용한 형태입니다.


PyMuPDF Product Suite

PyMuPDF is the standard version of the library, however there are a family of additional products each with different features and functionality.

Additional products in the PyMuPDF product suite are:

  • PyMuPDF Pro adds support for Office document formats.

  • PyMuPDF4LLM is optimized for large language model (LLM) applications, providing enhanced text extraction and processing capabilities.

It focuses on layout analysis and semantic understanding, ideal for document conversion and formatting tasks with enhanced results.

참고

All of the products above depend on the same core product - PyMuPDF and therefore have full access to all of its features. These additional products can be seen as optional extras to the enhance the core PyMuPDF library.

PyMuPDF Products Comparison

The following table illustrates what features the products offer:

PyMuPDF Products Comparison

PyMuPDF

PyMuPDF Pro

PyMuPDF4LLM

Input Documents

PDF, XPS, EPUB, CBZ, MOBI, FB2, SVG, TXT, Images (standard document types)

as PyMuPDF and: DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX

as PyMuPDF

Output Documents

Can convert any input document to PDF, SVG or Image

as PyMuPDF

as PyMuPDF and: Markdown (MD), JSON or TXT

Page Analysis

Basic page analysis to return document structure

as PyMuPDF

Advanced Page Analysis with trained data for enhanced results

Data extraction

Basic data extraction with structured layout information and bounding box data

as PyMuPDF

Advanced data extraction including layout analysis with semantic understanding and enhanced bounding box data

Table extraction

Basic table extraction as part of text extraction

as PyMuPDF

Advanced table extraction with cell structure, including support for merged cells and complex layouts

Image extraction

Basic image extraction

as PyMuPDF

Advanced detection and rendering of image areas on page saving them to disk or embedding in MD output

Vector extraction

Vector extraction and clustering

as PyMuPDF

Superior detection of “picture” areas

Popular RAG Integrations

Langchain, LlamaIndex

as PyMuPDF

as PyMuPDF and with some additional help methods for RAG workflows

OCR

On-demand invocation of built-in Tesseract for text detection on pages or images

as PyMuPDF

Automatic OCR based on page content analysis. OCR adapators for popular OCR engines available


성능

다양한 작업에 대해 PyMuPDF 성능을 벤치마크하기 위해 텍스트와 이미지가 포함된 총 7,031페이지의 8개 PDF 파일 로 구성된 고정된 테스트 스위트를 사용하여 성능 타이밍을 측정합니다.

다음은 작업별로 그룹화된 현재 결과입니다:


Copying

This refers to opening a document and then saving it to a new file. This test measures the speed of reading a PDF and re-writing as a new PDF. This process is also at the core of functions like merging / joining multiple documents. The numbers below therefore apply to PDF joining and merging.

The results for all 7,031 pages are:

600
500
400
300
200
100

seconds
3.05
10.54
33.57
494.04
PyMuPDF
PDFrw
PikePDF
PyPDF2
fastest
slowest

Text Extraction

This refers to extracting simple, plain text from every page of the document and storing it in a text file.

The results for all 7,031 pages are:

400
300
200
100

seconds
8.01
27.42
101.64
227.27
PyMuPDF
XPDF
PyPDF2
PDFMiner
fastest
slowest

Rendering

This refers to making an image (like PNG) from every page of a document at a given DPI resolution. This feature is the basis for displaying a document in a GUI window.

The results for all 7,031 pages are:

1000
800
600
400
200

seconds
367.04
646
851.52
PyMuPDF
XPDF
PDF2JPG
fastest
slowest

참고

이러한 성능 타이밍 방법론에 대한 자세한 내용은 성능 비교 방법론 을 참조하세요.