8 minutes read

In this topic, we will discuss how beautifulsoup can help us with XML. Beautifulsoup is a library for processing HTML and XML files. It provides parsing, information extraction, web-scraping, and a lot of other useful features.

Imagine, your boss gave you a pile of unified XML files. You need to extract the text from them for further analysis. Files are myriad; no way you could handle them manually. That's wherebeautifulsoup comes in handy. It can parse the files and get the information from them.

Installation

You can install beautifulsoup with pip. Note that 4 in the name makes for the official name of the library:

pip install beautifulsoup4

You also need the lxml library to enable the parser that we'll discuss below. You can install it like this:

pip install lxml

Do not forget to import the library before you start:

from bs4 import BeautifulSoup

You don't need to explicitly import lxml to parse your files, it will be automatically done once we set the parser type.

First example

Below is an example of an XML file:

<?xml version="1.0" encoding="UTF-8"?>
<movie_library>
  <movie>
    <title year="1994">Pulp Fiction</title>
    <director>Quentin Tarantino</director>
  </movie>
  <movie>
    <title year="2001">Mulholland Dr.</title>
    <director>David Lynch</director>
  </movie>
</movie_library>

To start processing it, we need to read this XML file just like any other file. You can rewrite the code later to automatically read the files one by one:

file = open("my_file_1.xml", "r").read()

And create a BeautifulSoup object:

soup = BeautifulSoup(file, "xml")  

The first argument is the opened document and the second one is the type of parser we'd like to use.

Keep in mind that xml is the name of the parser while lxml is the name of the library we need to install and import to be able to use that parser when creating the soup variable. xml parser type won't work the way we expect it to work without lxml installed.

Now soup contains the parsed document and a tree. You can use prettify() to make it more readable:

print(soup.prettify())

# <?xml version="1.0" encoding="utf-8"?>
# <movie_library>
#  <movie>
#   <title year="1994">
#    Pulp Fiction
#   </title>
#   <director>
#    Quentin Tarantino
#   </director>
#  </movie>
#  <movie>
#   <title year="2001">
#    Mulholland Dr.
#   </title>
#   <director>
#    David Lynch
#   </director>
#  </movie>
# </movie_library>

Finding information

Even though the output above looks decent with prettify(), it is somewhat hard to follow. It'll get even more confusing with larger documents. If you're interested in something particular, look for tags. Tags are not unified in XML, they may differ from document to document, so you'll have to find the ones you're looking for by yourself. Once you have found the tags you need, you can use the following methods to find them in the tree:

  • find() returns the first occurrence of the tag in the tree:

    tag1 = soup.find("title")  
    print(tag1)  # <title year="1994">Pulp Fiction</title>
  • find_all() returns the list that contains all occurrences of the tag you are searching for:

    tag2 = soup.find_all("director")
    print(tag2)  
    # [<director>Quentin Tarantino</director>, <director>David Lynch</director>]

If the specified tags are not found, find() returns None; find_all() returns an empty list.

If a tag has an attribute, you can include it in the search:

tag3 = soup.find("title", {"year": "1994"})
print(tag3)  # <title year="1994">Pulp Fiction</title>

This query is different since we've added a dictionary specifying an attribute name (key) and the value it stores as the second parameter.

Alternative way

Another way to search for tags is soup.<tag> where <tag> is the tag you're searching for. This will return the contents between the specified tags. If several tag pairs were found, it will return only the first occurrence:

print(soup.director) # <director>Quentin Tarantino</director>

You can also find out additional information about tags with main relationship types in XML files:

  • parent shows the tag inside which the one you're searching for is placed:
    print(tag3.parent)
    
    # <movie>
    # <title year="1994">Pulp Fiction</title>
    # <director>Quentin Tarantino</director>
    # </movie>
  • children shows the tag(s) that are placed in the searched tag:

    tag4 = soup.find("movie")
    print(list(tag4.children)) 
    
    # ['\n', <title year="1994">Pulp Fiction</title>, '\n', <director>Quentin Tarantino</director>, '\n']
    tag4.children returns a generator, so we need to make it a list to be able to see the contents. The contents method is similar to the children attribute but returns a list instead:
    print(tag4.contents)
    
    # ['\n', <title year="1994">Pulp Fiction</title>, '\n', <director>Quentin 
    # Tarantino</director>, '\n']
  • siblings shows the tag(s) that are placed on the same level as the searched tag. Siblings may precede (previous_sibling and previous_siblings) or follow (next_sibling and next_siblings) it. Previous_siblings and next_siblings both return generators:

    tag5 = soup.find("director")
    print(tag5.previous_sibling) # \n
    print(list(tag5.previous_siblings)) # ['\n', <title year="1994">Pulp Fiction</title>, '\n']
    
    tag3 = soup.find("title", {"year": "1994"})
    print(tag3.next_sibling) # \n
    print(list(tag3.next_siblings)) # ['\n', <director>Quentin Tarantino</director>, '\n']
    

Extracting information

The results we've got can be improved; let's learn how to extract the data. We'll learn how to extract the text contained in tags and attribute values.

Earlier, we have created a variable with a list of the <director> tags. To print them out, you can use a for loop and the text method to get the text data:

tag2 = soup.find_all("director")

for t in tag2:
    print(t.text)

# Quentin Tarantino
# David Lynch

Each t.text returns a text paragraph from the page.

Another helpful method that can be used to get the tag attributes is get(). Include a quoted attribute of the tag you need to extract in round brackets.

tag1 = soup.find("title")
print(tag1.get("year"))  # 1994

Summary

In this topic, we have covered the main features of beautifulsoup. If you work with XML, it can help you with:

  • creating a parse tree;
  • searching for tags by their names and relations;
  • data extraction.

If you need more information on beautifulsoup, take a look at the official Beautiful Soup Documentation.

58 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo