.. include:: ../header.rst .. _pymupdf-pro: .. raw:: html .. _tesseract-language-packs: Tesseract Language Packs ======================== .. meta:: :description: How to install additional Tesseract language packs on macOS, Linux, and Windows. Overview -------- Tesseract identifies languages using three-letter `ISO 639-2 `_ codes. English (``eng``) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR. A full list of supported language codes is available on the `Tesseract tessdata repository `_. .. tip:: To see which languages are already installed on your system, run ``tesseract --list-langs`` in your terminal. ---- Linux ----- Language pack installation varies slightly by distribution. **Ubuntu / Debian** .. code-block:: bash # List all available language packs apt-cache search tesseract-ocr # Install a specific language (e.g. German) sudo apt install tesseract-ocr-deu # Install all available languages at once sudo apt install tesseract-ocr-all Language packages follow the naming pattern ``tesseract-ocr-``, for example ``tesseract-ocr-fra`` for French or ``tesseract-ocr-chi-sim`` for Simplified Chinese. **Fedora / RHEL** .. code-block:: bash # Search for available language packs dnf search tesseract # Install a specific language (e.g. German) sudo dnf install tesseract-langpack-deu # Install all language packs sudo dnf install tesseract-langpack-* On Fedora, packages are named ``tesseract-langpack-``. **Arch Linux** .. code-block:: bash # Search for available language packs pacman -Ss tesseract-data # Install a specific language (e.g. German) sudo pacman -S tesseract-data-deu On Arch, packages are named ``tesseract-data-``. Manual Installation (All Distros) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If a language pack is not available through your package manager, download the ``.traineddata`` file directly from GitHub and copy it to your Tesseract data directory: .. code-block:: bash # Download language pack (e.g. French) curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \ -o fra.traineddata # Copy to tessdata directory (path varies by distro) sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/ # or sudo cp fra.traineddata /usr/share/tessdata/ Common tessdata locations on Linux: .. list-table:: :header-rows: 1 :widths: 40 60 * - Distribution - Path * - Ubuntu / Debian - ``/usr/share/tesseract-ocr/4.00/tessdata/`` * - Fedora / RHEL - ``/usr/share/tesseract/tessdata/`` * - Arch Linux - ``/usr/share/tessdata/`` ---- Windows ------- During Installation (Recommended) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Tesseract Windows installer from `UB Mannheim `_ lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need. After Installation (Manual) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If Tesseract is already installed, download language packs manually: 1. Go to `github.com/tesseract-ocr/tessdata `_ 2. Download the ``.traineddata`` file for your language (e.g. ``fra.traineddata`` for French) 3. Copy the file into your Tesseract ``tessdata`` folder, typically: .. code-block:: text C:\Program Files\Tesseract-OCR\tessdata\ .. note:: The Chocolatey (``choco install tesseract``) package only includes English. All additional languages must be added manually using the steps above. Verify the Install ~~~~~~~~~~~~~~~~~~ Open Command Prompt or PowerShell and run: .. code-block:: powershell tesseract --list-langs Your newly installed language should appear in the output. ---- macOS ----- The recommended approach on macOS is `Homebrew `_. There are two options depending on how much disk space you want to use. Install All Languages at Once ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``tesseract-lang`` formula bundles Tesseract with every available language pack: .. code-block:: bash brew install tesseract-lang Install Specific Languages ~~~~~~~~~~~~~~~~~~~~~~~~~~ If you only need a few languages, install ``tesseract`` first and then manually download the ``.traineddata`` files you need: .. code-block:: bash # Install Tesseract engine only brew install tesseract # Find the tessdata directory brew info tesseract # Look for a line like: /opt/homebrew/share/tessdata # Download a specific language pack (e.g. French) curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \ -o /opt/homebrew/share/tessdata/fra.traineddata Replace ``fra`` with your target language code and adjust the tessdata path to match what ``brew info tesseract`` reports on your machine. .. note:: If you installed Tesseract via MacPorts instead of Homebrew, use ``port install tesseract-``, for example ``sudo port install tesseract-fra``. ---- Using a Language with pymupdf4llm ---------------------------------- Once a language pack is installed, pass its code to ``to_markdown()`` via the ``ocr_language`` parameter: .. code-block:: python import pymupdf4llm # Single language md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra") # Multiple languages md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu") ---- Common Language Codes --------------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Language - Code * - English - ``eng`` * - French - ``fra`` * - German - ``deu`` * - Spanish - ``spa`` * - Italian - ``ita`` * - Portuguese - ``por`` * - Simplified Chinese - ``chi_sim`` * - Traditional Chinese - ``chi_tra`` * - Japanese - ``jpn`` * - Korean - ``kor`` * - Arabic - ``ara`` * - Russian - ``rus`` * - Hindi - ``hin`` For the full list of supported languages and their codes, see the `Tesseract tessdata repository `_. .. include:: ../footer.rst