PDFs in Python
A quick-start guide for working with PyMuPDF
- Installation
- Import (fitz) & Version Info
- Working with Documents
- Working with Pages
- More Features...
This notebook primarily intended as a quick reference for working with PDFs in Python, to be expanded over time. The structure and much of the content is based on following this tutorial in the PyMuPDF docs.
PyMuPDF:
- GitHub
- Docs
- Recipes:
- Docs - Recipes
- Wiki - Recipes (e.g. working with SVGs, extract fonts, extract text from rectangle)
- GitHub - Utilities (e.g. demo.py - python script similar to this notebook)
- Supported formats:
- PDF, XPS, OpenXPS, CBZ, CBR, FB2, EPUB
# !pip install PyMuPDF
import fitz
print(fitz.__doc__)
First, download a document to work with. Note the use of joblib
to cache the response, which saves us time on reloading the notebook and also is nice by not hitting the server again):
from joblib import Memory
from pathlib import Path
# !pip install requests
import requests
path = Path('.')
CACHE_DIR = path / '.jupyter_cache'
memory = Memory(CACHE_DIR, verbose=0)
@memory.cache
def download(url, dst):
response = requests.get(url, allow_redirects=True)
with open(dst, 'wb') as f:
f.write(response.content)
url = 'https://ai2-website.s3.amazonaws.com/publications/Siegel16eccv.pdf'
fn = path / 'example.pdf'
download(url, fn)
fn
doc = fitz.open(fn)
# doc.close()
doc.pageCount, doc.metadata, doc.getToC()
# index by page numer
page_no = 0
page = doc[page_no]
page
# iterate over pages
for page in doc:
pass
# slice over pages
for page in doc.pages(2,6):
pass
# all links in one page
links = page.getLinks()
# iterator over links
for link in page.links():
pass
links
# iterate over annotations
for annot in page.annots():
print(annot)
# iterate over form fields
for field in page.widgets():
print(field)
pix
is a Pixmap object which (in this case) contains an RGB image of the page, ready to be used for many purposes. Method Page.getPixmap()
offers lots of variations for controlling the image: resolution, colorspace (e.g. to produce a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring, shifting, shearing, etc. For example: to create an RGBA image (i.e. containing an alpha channel), specify pix = page.getPixmap(alpha=True)
.
# default (poor resolution causes text in example.pdf to be barely readable)
# file size: 120 kB
pix = page.getPixmap()
# 2x default resolution (text is clear, image text still hard to read)
# file size: 328 kB
zoom_xy = (2., 2.)
mat = fitz.Matrix(*zoom_xy)
pix = page.getPixmap(matrix=mat) # use 'mat' instead of the identity matrix
# 4x default resolution (image text is barely readable)
# file size: 691 kB
zoom_xy = (4., 4.)
mat = fitz.Matrix(*zoom_xy)
pix = page.getPixmap(matrix=mat) # use 'mat' instead of the identity matrix
# 8x default resolution (image text is pretty clear but still not perfect)
# file size: 1.4 MB
zoom_xy = (8., 8.)
mat = fitz.Matrix(*zoom_xy)
pix = page.getPixmap(matrix=mat) # use 'mat' instead of the identity matrix
dst = fn.parent / f'{fn.stem}_page-{page.number}.png'
dst
pix.writeImage(str(dst))
from PIL import Image
mode = "RGBA" if pix.alpha else "RGB"
img = Image.frombytes(mode, [pix.width, pix.height], pix.samples)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,20))
plt.imshow(img);
Use one of the following strings for opt to obtain different formats [2]:
“text”: (default) plain text with line breaks. No formatting, no text position details, no images.
“blocks”: generate a list of text blocks (= paragraphs).
“words”: generate a list of words (strings not containing spaces).
“html”: creates a full visual version of the page including any images. This can be displayed with your internet browser.
“dict” / “json”: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() resp. TextPage.extractJSON() for details of its structure.
“rawdict”: a super-set of TextPage.extractDICT(). It additionally provides character detail information like XML. See TextPage.extractRAWDICT() for details of its structure.
“xhtml”: text information level as the TEXT version but includes images. Can also be displayed by internet browsers.
“xml”: contains no images, but full position and font information down to each single text character. Use an XML module to interpret.
To get an idea about the output of these alternatives, see Appendix 2: Details on Text Extraction.
text_options = {
'text', 'blocks', 'words', 'html',
'dict', 'json', 'rawDict', 'xhtml', 'xml'}
opt = 'text'
text = page.getText(opt)
text
rectangles = page.searchFor('We', hit_max = 16)
rectangles
- PDF Maintenance: can only modify in PDF format, first convert to PDF using
doc.convertToPDF()
, after modifying, save to disk withdoc.save()
. - Join & Split PDF documents
- Modify, Create, Re-arrange & Delete PDF pages
- Embed arbitrary data (similar to ZIP files)