機能の比較#

機能比較表#

以下の表は、PyMuPDF が他の典型的な解決策と比較した場合の違いを示しています。

_images/icon-pdf.svg _images/icon-svg.svg _images/icon-xps.svg _images/icon-cbz.svg _images/icon-mobi.svg _images/icon-epub.svg _images/icon-image.svg _images/icon-fb2.svg _images/icon-txt.svg _images/icon-docx.svg _images/icon-pptx.svg _images/icon-xlsx.svg _images/icon-hangul.svg
Feature PyMuPDF pikepdf PyPDF2 pdfrw pdfplumber / pdfminer
Supports Multiple Document Formats PDF XPS EPUB MOBI FB2 CBZ SVG TXT Image
DOCX XLSX PPTX HWPX See note
PDF PDF PDF PDF
Implementation Python and C Python and C++ Python Python Python
Render Document Pages All document types No rendering No rendering No rendering No rendering
Write Text to PDF Page
See: Page.insert_htmlbox
or:
Page.insert_textbox
or:
TextWriter
Supports CJK characters
Extract Text All document types PDF only PDF only
Extract Text as Markdown (.md) All document types
Extract Tables All document types PDF only
Extract Vector Graphics All document types Limited
Draw Vector Graphics (PDF)
Based on Existing, Mature Library MuPDF QPDF
Automatic Repair of Damaged PDFs
Encrypted PDFs Limited Limited
Linerarized PDFs
Incremental Updates
Integrates with Jupyter and IPython Notebooks
Joining / Merging PDF with other Document Types All document types PDF only PDF only PDF only PDF only
OCR API for Seamless Integration with Tesseract All document types
Integrated Checkpoint / Restart Feature (PDF)
PDF Optional Content
PDF Embedded Files Limited Limited
PDF Redactions
PDF Annotations Full Limited
PDF Form Fields Create, read, update Limited, no creation
PDF Page Labels
Support Font Sub-Setting


_images/icon-docx.svg _images/icon-xlsx.svg _images/icon-pptx.svg _images/icon-hangul.svg

注釈

A note about Office document types (DOCX, XLXS, PPTX) and Hangul documents (HWPX). These documents can be loaded into PyMuPDF and you will receive a Document object.

There are some caveats:

  • we convert the input to HTML to layout the content.

  • because of this the original page separation has gone.

When saving out the result any faithful representation of the original layout cannot be expected.

Therefore input files are mostly in a form that's useful for text extraction.


パフォーマンス#

8つのPDFファイル(合計7,031ページ) にテキストと画像が含まれている固定されたセットのテストスイートを使用して、PyMuPDF のパフォーマンスをさまざまなタスクに対してベンチマークします。

以下は、タスクごとにグループ化された現在の結果です:


Copying

This refers to opening a document and then saving it to a new file. This test measures the speed of reading a PDF and re-writing as a new PDF. This process is also at the core of functions like merging / joining multiple documents. The numbers below therefore apply to PDF joining and merging.

The results for all 7,031 pages are:

600
500
400
300
200
100

seconds
3.05
10.54
33.57
494.04
PyMuPDF
PDFrw
PikePDF
PyPDF2
fastest
slowest

Text Extraction

This refers to extracting simple, plain text from every page of the document and storing it in a text file.

The results for all 7,031 pages are:

400
300
200
100

seconds
8.01
27.42
101.64
227.27
PyMuPDF
XPDF
PyPDF2
PDFMiner
fastest
slowest

Rendering

This refers to making an image (like PNG) from every page of a document at a given DPI resolution. This feature is the basis for displaying a document in a GUI window.

The results for all 7,031 pages are:

1000
800
600
400
200

seconds
367.04
646
851.52
PyMuPDF
XPDF
PDF2JPG
fastest

注釈

これらのパフォーマンスのタイミングに関する方法の詳細については、パフォーマンス比較方法 を参照してください。