In this topic, we will discuss how beautifulsoup can help us with XML. Beautifulsoup is a library for processing HTML and XML files. It provides parsing, information extraction, web-scraping, and a lot of other useful features.
Imagine, your boss gave you a pile of unified XML files. You need to extract the text from them for further analysis. Files are myriad; no way you could handle them manually. That's wherebeautifulsoup comes in handy. It can parse the files and get the information from them.
Installation
You can install beautifulsoup with pip. Note that 4 in the name makes for the official name of the library:
pip install beautifulsoup4
You also need the lxml library to enable the parser that we'll discuss below. You can install it like this:
pip install lxml
Do not forget to import the library before you start:
from bs4 import BeautifulSoup
You don't need to explicitly import lxml to parse your files, it will be automatically done once we set the parser type.
First example
Below is an example of an XML file:
<?xml version="1.0" encoding="UTF-8"?>
<movie_library>
<movie>
<title year="1994">Pulp Fiction</title>
<director>Quentin Tarantino</director>
</movie>
<movie>
<title year="2001">Mulholland Dr.</title>
<director>David Lynch</director>
</movie>
</movie_library>
To start processing it, we need to read this XML file just like any other file. You can rewrite the code later to automatically read the files one by one:
file = open("my_file_1.xml", "r").read()
And create a BeautifulSoup object:
soup = BeautifulSoup(file, "xml")
The first argument is the opened document and the second one is the type of parser we'd like to use.
Keep in mind that xml is the name of the parser while lxml is the name of the library we need to install and import to be able to use that parser when creating the soup variable. xml parser type won't work the way we expect it to work without lxml installed.
Now soup contains the parsed document and a tree. You can use prettify() to make it more readable:
print(soup.prettify())
# <?xml version="1.0" encoding="utf-8"?>
# <movie_library>
# <movie>
# <title year="1994">
# Pulp Fiction
# </title>
# <director>
# Quentin Tarantino
# </director>
# </movie>
# <movie>
# <title year="2001">
# Mulholland Dr.
# </title>
# <director>
# David Lynch
# </director>
# </movie>
# </movie_library>Finding information
Even though the output above looks decent with prettify(), it is somewhat hard to follow. It'll get even more confusing with larger documents. If you're interested in something particular, look for tags. Tags are not unified in XML, they may differ from document to document, so you'll have to find the ones you're looking for by yourself. Once you have found the tags you need, you can use the following methods to find them in the tree:
-
find()returns the first occurrence of the tag in the tree:tag1 = soup.find("title") print(tag1) # <title year="1994">Pulp Fiction</title> -
find_all()returns the list that contains all occurrences of the tag you are searching for:tag2 = soup.find_all("director") print(tag2) # [<director>Quentin Tarantino</director>, <director>David Lynch</director>]
If the specified tags are not found, find() returns None; find_all() returns an empty list.
If a tag has an attribute, you can include it in the search:
tag3 = soup.find("title", {"year": "1994"})
print(tag3) # <title year="1994">Pulp Fiction</title>
This query is different since we've added a dictionary specifying an attribute name (key) and the value it stores as the second parameter.
Alternative way
Another way to search for tags is soup.<tag> where <tag> is the tag you're searching for. This will return the contents between the specified tags. If several tag pairs were found, it will return only the first occurrence:
print(soup.director) # <director>Quentin Tarantino</director>
You can also find out additional information about tags with main relationship types in XML files:
parentshows the tag inside which the one you're searching for is placed:print(tag3.parent) # <movie> # <title year="1994">Pulp Fiction</title> # <director>Quentin Tarantino</director> # </movie>
-
childrenshows the tag(s) that are placed in the searched tag:tag4 = soup.find("movie") print(list(tag4.children)) # ['\n', <title year="1994">Pulp Fiction</title>, '\n', <director>Quentin Tarantino</director>, '\n']tag4.childrenreturns a generator, so we need to make it a list to be able to see the contents. Thecontentsmethod is similar to thechildrenattribute but returns a list instead:print(tag4.contents) # ['\n', <title year="1994">Pulp Fiction</title>, '\n', <director>Quentin # Tarantino</director>, '\n']
-
siblingsshows the tag(s) that are placed on the same level as the searched tag. Siblings may precede (previous_siblingandprevious_siblings) or follow (next_siblingandnext_siblings) it.Previous_siblingsandnext_siblingsboth return generators:tag5 = soup.find("director") print(tag5.previous_sibling) # \n print(list(tag5.previous_siblings)) # ['\n', <title year="1994">Pulp Fiction</title>, '\n'] tag3 = soup.find("title", {"year": "1994"}) print(tag3.next_sibling) # \n print(list(tag3.next_siblings)) # ['\n', <director>Quentin Tarantino</director>, '\n']
Extracting information
The results we've got can be improved; let's learn how to extract the data. We'll learn how to extract the text contained in tags and attribute values.
Earlier, we have created a variable with a list of the <director> tags. To print them out, you can use a for loop and the text method to get the text data:
tag2 = soup.find_all("director")
for t in tag2:
print(t.text)
# Quentin Tarantino
# David Lynch
Each t.text returns a text paragraph from the page.
Another helpful method that can be used to get the tag attributes is get(). Include a quoted attribute of the tag you need to extract in round brackets.
tag1 = soup.find("title")
print(tag1.get("year")) # 1994Summary
In this topic, we have covered the main features of beautifulsoup. If you work with XML, it can help you with:
- creating a parse tree;
- searching for tags by their names and relations;
- data extraction.
If you need more information on beautifulsoup, take a look at the official Beautiful Soup Documentation.