8 minutes read

Everyone is familiar with PDF files. Portable Document Format (PDF) is a file format developed by Adobe Systems. It was introduced in 2008 as an open standard and governed by the ISO under the ISO 32000 standard.

Each PDF file is a complete representation of a document. It can include images, texts, tables, styles, links to web pages, and other multimedia elements. It maintains the exact format of the document, so PDF is one of the most popular formats for printing. The PDF specification also provides for encryption, digital signatures, file attachments, and metadata. To view it, we need a PDF reader like Adobe Reader® or a web browser.

PDF ensured the transit from paper to digital format. We can create documents and send their electronic version anywhere. In this topic, we'll learn how to read an existing document and create .pdf files in Python.

Reading PDF using Python

PDF documents are binary files; they are more complex than plain text files because they may contain different fonts, colors, tables, and so on. In this section, we will use the pypdf library to work with this type of file. First, let's install it using the Python Package Index (PyPI) with the command below. If everything is fine, you'll see the package version:

pip install pypdf

# Installing collected packages: pypdf
# Successfully installed pypdf-6.0.0

Now, let's use the pypdf library to extract the metadata from dummy.pdf. You can download it here. Don't forget to put it in the same folder as your python script!

from PyPDF4 import PdfFileReader

with open(file="dummy.pdf", mode='rb') as file:
    pdf = PdfReader("dummy.pdf")
    information = pdf.metadata
    number_of_pages = len(pdf.pages)

The PdfReader class provides all necessary methods and attributes to access data in a PDF file. In the example above, the .metadata property extracts the PDF metadata that is embedded when a PDF is created:

metadata = f"""
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """

Let's take a look at the dummy.pdf metadata:

    Author: Evangelos Vlachogiannis
    Creator: Writer
    Producer: OpenOffice.org 2.1
    Subject: None
    Title: None
    Number of pages: 1

Writing PDF files

The PdfWriter class creates new PDF files. Before you can save it, you need to add some pages. You can leave them blank. Let's write a new file with a unique page from the dummy file.

from pypdf import PdfReader, PdfWriter

new_pdf = PdfWriter()
original_pdf = PdfReader("dummy.pdf")

# Rotate page 90 degrees to the right
page_0 = original_pdf.pages[0].rotate(90)
new_pdf.add_page(page_0)

# Rotate page 90 degrees to the left
page_1 = original_pdf.pages[0].rotate(-90)
new_pdf.add_page(page_1)

with open("rotate_page.pdf", "wb") as new_file:
    new_pdf.write(new_file)

In the code snippet above, we use pages[0] to get the first page. Since the dummy.pdf file has only one page, both page_0 and page_1 refer to it. We then use .rotate(90) and .rotate(-90) to change the orientation of the page to the right and left. After that, we call .add_page() to add the rotated pages to the writer object. Finally, we create a new PDF using .write().

PdfWriter objects can create new PDF files, but they can't write new contents from scratch other than blank pages. This is a limitation. Next, we will learn an alternative way to generate PDF files.

Creating a PDF from scratch

Let's continue with fpdf2. This is a minimalist library that allows us to generate PDF documents. It is a port of FPDF that is written in PHP. Let's install it with the Python Package Index (PyPI) as we've done with the previous library:

pip install fpdf2

# Installing collected packages: fpdf2
# Successfully installed fpdf2-2.4.3

The fpdf2 library is simple, small, and versatile, advanced-level, but it's also user-friendly, comprehensive, and well-maintained. Have a look at the following example:

from fpdf import FPDF, XPos, YPos

pdf = FPDF()
pdf.add_page()
pdf.set_font(family='times', style='I', size=14)
pdf.cell(w=1, txt="hello world!")
pdf.output("dummy.pdf")

As a result, we can get the following PDF:

PDF output screenshot

We've created an FPDF object with the default values: the pages are in A4 portrait mode, the unit of measurement is a millimeter. It is possible to set the PDF in landscape mode (L) and use other page formats (such as Letter or Legal), as well as different units of measurement (pt, cm, in).

pdf = FPDF(orientation="L", format="Letter", unit="cm")

As you've seen, we are using the .cell() function to add a line of text to our file. A cell is a rectangular area with text. It is rendered at the current position. We can specify some of its dimensions: width and height in the units defined in the FPDF object (for borders) and text alignment. Note that w is a required argument.

pdf.cell(w=1, h=1, txt="I am the first line at the left", border=0, new_x=XPos.LMARGIN, new_y=YPos.NEXT, align="L")
pdf.cell(w=1, h=1, txt="I am the centered line", border=0, new_x=XPos.LMARGIN, new_y=YPos.NEXT, align="C")

The fpdf2 library has a great deal of other interesting features, such as changing the font color, adding multiple cells, header, footer, images, etc. You can check out the Official Documentation.

Finally, the PDF document is closed and saved in the directory using the output method.

Conclusion

PDF is the standard for secure sharing and distribution of electronic documents around the globe, both on corporate intranets and on the Web. In this topic, you've learned how to use the pypdf library to read PDF files with the help of the PdfReader class and write the new ones using the PdfWriter class. You've also learned how to create custom PDF files with the fpdf2 library.

54 learners liked this piece of theory. 4 didn't like it. What about you?
Report a typo