How to Extract all Document Text
This script will take a document filename and generate a text file from all of its text.
The document can be any supported type like PDF, XPS, etc.
The script works as a command line tool which expects the document filename supplied as a parameter. It generates one text file named “filename.txt” in the script directory. Text of pages is separated by a form feed character:
import sys, fitz fname = sys.argv # get document filename doc = fitz.open(fname) # open document out = open(fname + ".txt", "wb") # open text output for page in doc: # iterate the document pages text = page.get_text().encode("utf8") # get plain text (is in UTF-8) out.write(text) # write text of page out.write(bytes((12,))) # write page delimiter (form feed 0x0C) out.close()
The output will be plain text as it is coded in the document. No effort is made to prettify in any way. Specifically for PDF, this may mean output not in usual reading order, unexpected line breaks and so forth.
You have many options to rectify this – see chapter Appendix 2: Considerations on Embedded Files. Among them are:
Extract text in HTML format and store it as a HTML document, so it can be viewed in any browser.
Extract text as a list of text blocks via Page.get_text(“blocks”). Each item of this list contains position information for its text, which can be used to establish a convenient reading order.
Extract a list of single words via Page.get_text(“words”). Its items are words with position information. Use it to determine text contained in a given rectangle – see next section.
See the following two sections for examples and further explanations.
How to Extract Text from within a Rectangle
There is now (v1.18.0) more than one way to achieve this. We therefore have created a folder in the PyMuPDF-Utilities repository specifically dealing with this topic.
How to Extract Text in Natural Reading Order
One of the common issues with PDF text extraction is, that text may not appear in any particular reading order.
This is the responsibility of the PDF creator (software or a human). For example, page headers may have been inserted in a separate step – after the document had been produced. In such a case, the header text will appear at the end of a page text extraction (although it will be correctly shown by PDF viewer software). For example, the following snippet will add some header and footer lines to an existing PDF:
doc = fitz.open("some.pdf") header = "Header" # text in header footer = "Page %i of %i" # text in footer for page in doc: page.insert_text((50, 50), header) # insert header page.insert_text( # insert footer 50 points above page bottom (50, page.rect.height - 50), footer % (page.number + 1, doc.page_count), )
The text sequence extracted from a page modified in this way will look like this:
PyMuPDF has several means to re-establish some reading sequence or even to re-generate a layout close to the original:
Page.get_text(). It will sort the output from top-left to bottom-right (ignored for XHTML, HTML and XML output).
fitzmodule in CLI:
python -m fitz gettext ..., which produces a text file where text has been re-arranged in layout-preserving mode. Many options are available to control the output.
You can also use the above mentioned script with your modifications.
How to Extract Tables from Documents
If you see a table in a document, you are not normally looking at something like an embedded Excel or other identifiable object. It usually is just text, formatted to appear as appropriate.
Extracting a tabular data from such a page area therefore means that you must find a way to (1) graphically indicate table and column borders, and (2) then extract text based on this information.
The wxPython GUI script extract.py strives to exactly do that. You may want to have a look at it and adjust it to your liking.
How to Mark Extracted Text
There is a standard search function to search for arbitrary text on a page:
Page.search_for(). It returns a list of Rect objects which surround a found occurrence. These rectangles can for example be used to automatically insert annotations which visibly mark the found text.
This method has advantages and drawbacks. Pros are:
The search string can contain blanks and wrap across lines
Upper or lower case characters are treated equal
Word hyphenation at line ends is detected and resolved
Return may also be a list of Quad objects to precisely locate text that is not parallel to either axis – using Quad output is also recommend, when page rotation is not zero
But you also have other options:
import sys import fitz def mark_word(page, text): """Underline each word that contains 'text'. """ found = 0 wlist = page.get_text("words") # make the word list for w in wlist: # scan through all words on page if text in w: # w is the word's string found += 1 # count r = fitz.Rect(w[:4]) # make rect from word bbox page.add_underline_annot(r) # underline return found fname = sys.argv # filename text = sys.argv # search string doc = fitz.open(fname) print("underlining words containing '%s' in document '%s'" % (word, doc.name)) new_doc = False # indicator if anything found at all for page in doc: # scan through the pages found = mark_word(page, text) # mark the page's words if found: # if anything found ... new_doc = True print("found '%s' %i times on page %i" % (text, found, page.number + 1)) if new_doc: doc.save("marked-" + doc.name)
This script uses
Page.get_text("words") to look for a string, handed in via cli parameter. This method separates a page’s text into “words” using spaces and line breaks as delimiters. Further remarks:
If found, the complete word containing the string is marked (underlined) – not only the search string.
The search string may not contain spaces or other white space.
As shown here, upper / lower cases are respected. But this can be changed by using the string method lower() (or even regular expressions) in function mark_word.
There is no upper limit: all occurrences will be detected.
You can use anything to mark the word: ‘Underline’, ‘Highlight’, ‘StrikeThrough’ or ‘Square’ annotations, etc.
Here is an example snippet of a page of this manual, where “MuPDF” has been used as the search string. Note that all strings containing “MuPDF” have been completely underlined (not just the search string).
How to Mark Searched Text
This script searches for text and marks it:
# -*- coding: utf-8 -*- import fitz # the document to annotate doc = fitz.open("tilted-text.pdf") # the text to be marked t = "¡La práctica hace el campeón!" # work with first page only page = doc # get list of text locations # we use "quads", not rectangles because text may be tilted! rl = page.search_for(t, quads = True) # mark all found quads with one annotation page.add_squiggly_annot(rl) # save to a new PDF doc.save("a-squiggly.pdf")
The result looks like this:
How to Mark Non-horizontal Text
The previous section already shows an example for marking non-horizontal text, that was detected by text searching.
But text extraction with the “dict” / “rawdict” options of
Page.get_text() may also return text with a non-zero angle to the x-axis. This is indicated by the value of the line dictionary’s
"dir" key: it is the tuple
(cosine, sine) for that angle. If
line["dir"] != (1, 0), then the text of all its spans is rotated by (the same) angle != 0.
The “bboxes” returned by the method however are rectangles only – not quads. So, to mark span text correctly, its quad must be recovered from the data contained in the line and span dictionary. Do this with the following utility function (new in v1.18.9):
span_quad = fitz.recover_quad(line["dir"], span) annot = page.add_highlight_annot(span_quad) # this will mark the complete span text
If you want to mark the complete line or a subset of its spans in one go, use the following snippet (works for v1.18.10 or later):
line_quad = fitz.recover_line_quad(line, spans=line["spans"][1:-1]) page.add_highlight_annot(line_quad)
spans argument above may specify any sub-list of
line["spans"]. In the example above, the second to second-to-last span are marked. If omitted, the complete line is taken.
How to Analyze Font Characteristics
To analyze the characteristics of text in a PDF use this elementary script as a starting point:
import sys import fitz def flags_decomposer(flags): """Make font flags human readable.""" l =  if flags & 2 ** 0: l.append("superscript") if flags & 2 ** 1: l.append("italic") if flags & 2 ** 2: l.append("serifed") else: l.append("sans") if flags & 2 ** 3: l.append("monospaced") else: l.append("proportional") if flags & 2 ** 4: l.append("bold") return ", ".join(l) doc = fitz.open(sys.argv) page = doc # read page text as a dictionary, suppressing extra spaces in CJK fonts blocks = page.get_text("dict", flags=11)["blocks"] for b in blocks: # iterate through the text blocks for l in b["lines"]: # iterate through the text lines for s in l["spans"]: # iterate through the text spans print("") font_properties = "Font: '%s' (%s), size %g, color #%06x" % ( s["font"], # font name flags_decomposer(s["flags"]), # readable font flags s["size"], # font size s["color"], # font color ) print("Text: '%s'" % s["text"]) # simple print of text print(font_properties)
Here is the PDF page and the script output:
How to Insert Text
PyMuPDF provides ways to insert text on new or existing PDF pages with the following features:
choose the font, including built-in fonts and fonts that are available as files
choose text characteristics like bold, italic, font size, font color, etc.
position the text in multiple ways:
either as simple line-oriented output starting at a certain point,
or fitting text in a box provided as a rectangle, in which case text alignment choices are also available,
choose whether text should be put in foreground (overlay existing content),
all text can be arbitrarily “morphed”, i.e. its appearance can be changed via a Matrix, to achieve effects like scaling, shearing or mirroring,
independently from morphing and in addition to that, text can be rotated by integer multiples of 90 degrees.
All of the above is provided by three basic Page, resp. Shape methods:
Page.insert_font()– install a font for the page for later reference. The result is reflected in the output of
Document.get_page_fonts(). The font can be:
provided as a file,
via Font (then use
already present somewhere in this or another PDF, or
be a built-in font.
Page.insert_text()– write some lines of text. Internally, this uses
Page.insert_textbox()– fit text in a given rectangle. Here you can choose text alignment features (left, right, centered, justified) and you keep control as to whether text actually fits. Internally, this uses
Both text insertion methods automatically install the font as necessary.
How to Write Text Lines
Output some text lines on a page:
import fitz doc = fitz.open(...) # new or existing PDF page = doc.new_page() # new or existing page via doc[n] p = fitz.Point(50, 72) # start point of 1st line text = "Some text,\nspread across\nseveral lines." # the same result is achievable by # text = ["Some text", "spread across", "several lines."] rc = page.insert_text(p, # bottom-left of 1st char text, # the text (honors '\n') fontname = "helv", # the default font fontsize = 11, # the default font size rotate = 0, # also available: 90, 180, 270 ) print("%i lines printed on page %i." % (rc, page.number)) doc.save("text.pdf")
With this method, only the number of lines will be controlled to not go beyond page height. Surplus lines will not be written and the number of actual lines will be returned. The calculation uses a line height calculated from the fontsize and 36 points (0.5 inches) as bottom margin.
Line width is ignored. The surplus part of a line will simply be invisible.
However, for built-in fonts there are ways to calculate the line width beforehand - see
Here is another example. It inserts 4 text strings using the four different rotation options, and thereby explains, how the text insertion point must be chosen to achieve the desired result:
import fitz doc = fitz.open() page = doc.new_page() # the text strings, each having 3 lines text1 = "rotate=0\nLine 2\nLine 3" text2 = "rotate=90\nLine 2\nLine 3" text3 = "rotate=-90\nLine 2\nLine 3" text4 = "rotate=180\nLine 2\nLine 3" red = (1, 0, 0) # the color for the red dots # the insertion points, each with a 25 pix distance from the corners p1 = fitz.Point(25, 25) p2 = fitz.Point(page.rect.width - 25, 25) p3 = fitz.Point(25, page.rect.height - 25) p4 = fitz.Point(page.rect.width - 25, page.rect.height - 25) # create a Shape to draw on shape = page.new_shape() # draw the insertion points as red, filled dots shape.draw_circle(p1,1) shape.draw_circle(p2,1) shape.draw_circle(p3,1) shape.draw_circle(p4,1) shape.finish(width=0.3, color=red, fill=red) # insert the text strings shape.insert_text(p1, text1) shape.insert_text(p3, text2, rotate=90) shape.insert_text(p2, text3, rotate=-90) shape.insert_text(p4, text4, rotate=180) # store our work to the page shape.commit() doc.save(...)
This is the result:
How to Fill a Text Box
This script fills 4 different rectangles with text, each time choosing a different rotation value:
import fitz doc = fitz.open(...) # new or existing PDF page = doc.new_page() # new page, or choose doc[n] r1 = fitz.Rect(50,100,100,150) # a 50x50 rectangle disp = fitz.Rect(55, 0, 55, 0) # add this to get more rects r2 = r1 + disp # 2nd rect r3 = r1 + disp * 2 # 3rd rect r4 = r1 + disp * 3 # 4th rect t1 = "text with rotate = 0." # the texts we will put in t2 = "text with rotate = 90." t3 = "text with rotate = -90." t4 = "text with rotate = 180." red = (1,0,0) # some colors gold = (1,1,0) blue = (0,0,1) """We use a Shape object (something like a canvas) to output the text and the rectangles surrounding it for demonstration. """ shape = page.new_shape() # create Shape shape.draw_rect(r1) # draw rectangles shape.draw_rect(r2) # giving them shape.draw_rect(r3) # a yellow background shape.draw_rect(r4) # and a red border shape.finish(width = 0.3, color = red, fill = gold) # Now insert text in the rectangles. Font "Helvetica" will be used # by default. A return code rc < 0 indicates insufficient space (not checked here). rc = shape.insert_textbox(r1, t1, color = blue) rc = shape.insert_textbox(r2, t2, color = blue, rotate = 90) rc = shape.insert_textbox(r3, t3, color = blue, rotate = -90) rc = shape.insert_textbox(r4, t4, color = blue, rotate = 180) shape.commit() # write all stuff to page /Contents doc.save("...")
Several default values were used above: font “Helvetica”, font size 11 and text alignment “left”. The result will look like this:
How to Use Non-Standard Encoding
Since v1.14, MuPDF allows Greek and Russian encoding variants for the
Base14_Fonts. In PyMuPDF this is supported via an additional encoding argument. Effectively, this is relevant for Helvetica, Times-Roman and Courier (and their bold / italic forms) and characters outside the ASCII code range only. Elsewhere, the argument is ignored. Here is how to request Russian encoding with the standard font Helvetica:
page.insert_text(point, russian_text, encoding=fitz.TEXT_ENCODING_CYRILLIC)
The valid encoding values are TEXT_ENCODING_LATIN (0), TEXT_ENCODING_GREEK (1), and TEXT_ENCODING_CYRILLIC (2, Russian) with Latin being the default. Encoding can be specified by all relevant font and text insertion methods.
By the above statement, the fontname helv is automatically connected to the Russian font variant of Helvetica. Any subsequent text insertion with this fontname will use the Russian Helvetica encoding.
If you change the fontname just slightly, you can also achieve an encoding “mixture” for the same base font on the same page:
import fitz doc=fitz.open() page = doc.new_page() shape = page.new_shape() t="Sômé tèxt wìth nöñ-Lâtîn characterß." shape.insert_text((50,70), t, fontname="helv", encoding=fitz.TEXT_ENCODING_LATIN) shape.insert_text((50,90), t, fontname="HElv", encoding=fitz.TEXT_ENCODING_GREEK) shape.insert_text((50,110), t, fontname="HELV", encoding=fitz.TEXT_ENCODING_CYRILLIC) shape.commit() doc.save("t.pdf")
The snippet above indeed leads to three different copies of the Helvetica font in the PDF. Each copy is uniquely identified (and referenceable) by using the correct upper-lower case spelling of the reserved word “helv”:
for f in doc.get_page_fonts(0): print(f) [6, 'n/a', 'Type1', 'Helvetica', 'helv', 'WinAnsiEncoding'] [7, 'n/a', 'Type1', 'Helvetica', 'HElv', 'WinAnsiEncoding'] [8, 'n/a', 'Type1', 'Helvetica', 'HELV', 'WinAnsiEncoding']
This software is provided AS-IS with no warranty, either express or implied. This software is distributed under license and may not be copied, modified or distributed except as expressly authorized under the terms of that license. Refer to licensing information at artifex.com or contact Artifex Software, Inc., 1305 Grant Avenue - Suite 200, Novato, CA 94945, U.S.A., +1(415)492-9861, for further information.