Need to gather information from the web? Maybe you're a market researcher analyzing trends, a business owner tracking competitor prices, or someone with a project that requires data from websites. You realize that manually copying and pasting from websites would take a long time and be very inefficient. Web scraping offers a way to automate this process. It lets you extract specific data points from websites and transform them into a structured format you can actually use. While web scraping itself has its complexities, it offers a more efficient way to handle large amounts of web data compared to manual methods.
In this topic, we will explore the essentials of web scraping using BeautifulSoup, discuss best practices, navigate common challenges, and uncover how generative AI, like ChatGPT, can enhance the process. Whether you're new to web scraping or looking to refine your skills, this topic will provide valuable insights and techniques.
Web Scraping
Generally, web scraping tools, regardless of their complexity, follow a similar process:
Sending a Request: The first step is to send a request to the website you want to scrape. This is similar to how your web browser requests a webpage when you type in a URL.
Receiving and Parsing HTML: Once the website receives your request, it sends back the requested webpage's data, primarily in HTML format. Web scraping tools then parse this HTML, which means they structure the messy code into a format that's easier to work with.
Locating and Extracting Data: With the HTML parsed and organized, the tools then pinpoint the specific data you need. This could be anything from text content and links to images and product prices.
While these steps might seem simple, real-world web scraping often presents a few more hurdles. What if the website doesn't respond to your request correctly? What about poorly structured websites or content that is dynamically loaded with JavaScript? And then there's the fact that some website owners might not want you scraping their data at all. These are all complexities we need to consider, and we'll discuss ways to handle them later in this guide.
BeautifulSoup
BeautifulSoup is a commonly used Python library for parsing HTML documents. In this topic, We're going to use BeautifulSoup to scrape a website called http://books.toscrape.com. It has lots of book listings with links, categories, and other details, making it a good initial resource for learning practical web scraping skills.
Lets go through the the web scraping process mentioned above using BeautifulSoup.
Sending a Request
While BeautifulSoup itself doesn't send requests, it often works alongside the 'requests' library, which handles fetching the web page for you.
Installing: Lets first install BeautifulSoup and the requests library using pip:
pip install beautifulsoup4 requestsFetching a Web Page: Import the necessary libraries and fetch the web page content:
from bs4 import BeautifulSoup #bs4 refers to beautifulsoup4(version 4)
import requests
url = 'http://books.toscrape.com'
response = requests.get(url)
html_content = response.textYou can print html_content to make sure you indeed are receiving the html version of the site.
Parsing HTML
This is where BeautifulSoup excels. It converts the raw HTML code of a web page into a structured format called a parse tree. This makes searching for specific items much easier and faster. Think of it as transforming the string HTML response into an organized, folder-like structure where each subfolder represents a specific tag in the HTML.
To achieve this, BeautifulSoup uses parsing engines. These engines are libraries or tools that read the HTML or XML and build the parse tree. Let take a look at the most common ones and how to apply them to our html_content:
Description: Python’s built-in HTML parser.
When to Use: It’s fast and doesn’t require any additional installations. Ideal for simple tasks.
Usage:
soup = BeautifulSoup(html_content, 'html.parser')
Description: A third-party library known for its speed and ability to parse both HTML and XML.
When to Use: Best for larger documents or when parsing speed is critical.
Usage:
soup = BeautifulSoup(html_content, 'lxml')
Description: A pure-python library that parses HTML the way a web browser does, following the HTML5 specification.
When to Use: When you need the most robust parsing and can tolerate slower performance.
Usage:
soup = BeautifulSoup(html_content, 'html5lib')
In this topic we will be using html.parser , but go ahead and try printing the parsed content (print(soup.prettify()) ) using all three parsers – you will notice differences in how they handle and format the output.
Locating and Extracting Data
Now that BeautifulSoup has organized the HTML into a neat parse tree, finding the specific data you want becomes much easier. Remember, you're essentially searching through a well-organized filing cabinet now! However, every website is built differently, using different tags and structures. That's why it's always a good idea to print the parsed soup and examine it carefully. This will help you understand how the data you need is organized within the HTML structure of that particular website.
Once you know what you're looking for, BeautifulSoup provides several methods to pinpoint and extract data:
find(): This method helps you find the first occurrence of a specific tag. For example, you can use it to find the first<title>tag on the page.find_all(): If you need to find all occurrences of a particular tag, this method returns a list of all matching elements.Other Navigation Methods: BeautifulSoup offers additional methods to move through the parse tree, like
parent,children,next_sibling, andprevious_sibling, which can be helpful for navigating complex structures. A comprehensive list of these attributes and methods, along with their use cases, can be found in the BeautifulSoup documentation.
Lets now apply this to our parsed soup.
soup = BeautifulSoup(html_content, 'html.parser')
print(soup.prettify())Full Output(Its very long!)
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
<!--<![endif]-->
<head>
<title>
All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
<script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" rel="stylesheet"/>
<link href="static/oscar/css/datetimepicker.css" rel="stylesheet" type="text/css"/>
</head>
<body class="default" id="default">
<header class="header container-fluid">
<div class="page_inner">
<div class="row">
<div class="col-sm-8 h1">
<a href="index.html">
Books to Scrape
</a>
<small>
We love being scraped!
</small>
</div>
</div>
</div>
</header>
<div class="container-fluid page">
<div class="page_inner">
<ul class="breadcrumb">
<li>
<a href="index.html">
Home
</a>
</li>
<li class="active">
All products
</li>
</ul>
<div class="row">
<aside class="sidebar col-sm-4 col-md-3">
<div id="promotions_left">
</div>
<div class="side_categories">
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
Books
</a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
Travel
</a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
Mystery
</a>
</li>
<li>
<a href="catalogue/category/books/historical-fiction_4/index.html">
Historical Fiction
</a>
</li>
<li>
<a href="catalogue/category/books/sequential-art_5/index.html">
Sequential Art
</a>
</li>
<li>
<a href="catalogue/category/books/classics_6/index.html">
Classics
</a>
</li>
<li>
<a href="catalogue/category/books/philosophy_7/index.html">
Philosophy
</a>
</li>
<li>
<a href="catalogue/category/books/romance_8/index.html">
Romance
</a>
</li>
<li>
<a href="catalogue/category/books/womens-fiction_9/index.html">
Womens Fiction
</a>
</li>
<li>
<a href="catalogue/category/books/fiction_10/index.html">
Fiction
</a>
</li>
<li>
<a href="catalogue/category/books/childrens_11/index.html">
Childrens
</a>
</li>
<li>
<a href="catalogue/category/books/religion_12/index.html">
Religion
</a>
</li>
<li>
<a href="catalogue/category/books/nonfiction_13/index.html">
Nonfiction
</a>
</li>
<li>
<a href="catalogue/category/books/music_14/index.html">
Music
</a>
</li>
<li>
<a href="catalogue/category/books/default_15/index.html">
Default
</a>
</li>
<li>
<a href="catalogue/category/books/science-fiction_16/index.html">
Science Fiction
</a>
</li>
<li>
<a href="catalogue/category/books/sports-and-games_17/index.html">
Sports and Games
</a>
</li>
<li>
<a href="catalogue/category/books/add-a-comment_18/index.html">
Add a comment
</a>
</li>
<li>
<a href="catalogue/category/books/fantasy_19/index.html">
Fantasy
</a>
</li>
<li>
<a href="catalogue/category/books/new-adult_20/index.html">
New Adult
</a>
</li>
<li>
<a href="catalogue/category/books/young-adult_21/index.html">
Young Adult
</a>
</li>
<li>
<a href="catalogue/category/books/science_22/index.html">
Science
</a>
</li>
<li>
<a href="catalogue/category/books/poetry_23/index.html">
Poetry
</a>
</li>
<li>
<a href="catalogue/category/books/paranormal_24/index.html">
Paranormal
</a>
</li>
<li>
<a href="catalogue/category/books/art_25/index.html">
Art
</a>
</li>
<li>
<a href="catalogue/category/books/psychology_26/index.html">
Psychology
</a>
</li>
<li>
<a href="catalogue/category/books/autobiography_27/index.html">
Autobiography
</a>
</li>
<li>
<a href="catalogue/category/books/parenting_28/index.html">
Parenting
</a>
</li>
<li>
<a href="catalogue/category/books/adult-fiction_29/index.html">
Adult Fiction
</a>
</li>
<li>
<a href="catalogue/category/books/humor_30/index.html">
Humor
</a>
</li>
<li>
<a href="catalogue/category/books/horror_31/index.html">
Horror
</a>
</li>
<li>
<a href="catalogue/category/books/history_32/index.html">
History
</a>
</li>
<li>
<a href="catalogue/category/books/food-and-drink_33/index.html">
Food and Drink
</a>
</li>
<li>
<a href="catalogue/category/books/christian-fiction_34/index.html">
Christian Fiction
</a>
</li>
<li>
<a href="catalogue/category/books/business_35/index.html">
Business
</a>
</li>
<li>
<a href="catalogue/category/books/biography_36/index.html">
Biography
</a>
</li>
<li>
<a href="catalogue/category/books/thriller_37/index.html">
Thriller
</a>
</li>
<li>
<a href="catalogue/category/books/contemporary_38/index.html">
Contemporary
</a>
</li>
<li>
<a href="catalogue/category/books/spirituality_39/index.html">
Spirituality
</a>
</li>
<li>
<a href="catalogue/category/books/academic_40/index.html">
Academic
</a>
</li>
<li>
<a href="catalogue/category/books/self-help_41/index.html">
Self Help
</a>
</li>
<li>
<a href="catalogue/category/books/historical_42/index.html">
Historical
</a>
</li>
<li>
<a href="catalogue/category/books/christian_43/index.html">
Christian
</a>
</li>
<li>
<a href="catalogue/category/books/suspense_44/index.html">
Suspense
</a>
</li>
<li>
<a href="catalogue/category/books/short-stories_45/index.html">
Short Stories
</a>
</li>
<li>
<a href="catalogue/category/books/novels_46/index.html">
Novels
</a>
</li>
<li>
<a href="catalogue/category/books/health_47/index.html">
Health
</a>
</li>
<li>
<a href="catalogue/category/books/politics_48/index.html">
Politics
</a>
</li>
<li>
<a href="catalogue/category/books/cultural_49/index.html">
Cultural
</a>
</li>
<li>
<a href="catalogue/category/books/erotica_50/index.html">
Erotica
</a>
</li>
<li>
<a href="catalogue/category/books/crime_51/index.html">
Crime
</a>
</li>
</ul>
</li>
</ul>
</div>
</aside>
<div class="col-sm-8 col-md-9">
<div class="page-header action">
<h1>
All products
</h1>
</div>
<div id="messages">
</div>
<div id="promotions">
</div>
<form class="form-horizontal" method="get">
<div style="display:none">
</div>
<strong>
1000
</strong>
results - showing
<strong>
1
</strong>
to
<strong>
20
</strong>
.
</form>
<section>
<div class="alert alert-warning" role="alert">
<strong>
Warning!
</strong>
This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
</div>
<div>
<ol class="row">
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html">
<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
</a>
</div>
<p class="star-rating Three">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
A Light in the ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£51.77
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/tipping-the-velvet_999/index.html">
<img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/>
</a>
</div>
<p class="star-rating One">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">
Tipping the Velvet
</a>
</h3>
<div class="product_price">
<p class="price_color">
£53.74
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/soumission_998/index.html">
<img alt="Soumission" class="thumbnail" src="media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"/>
</a>
</div>
<p class="star-rating One">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/soumission_998/index.html" title="Soumission">
Soumission
</a>
</h3>
<div class="product_price">
<p class="price_color">
£50.10
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/sharp-objects_997/index.html">
<img alt="Sharp Objects" class="thumbnail" src="media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/>
</a>
</div>
<p class="star-rating Four">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">
Sharp Objects
</a>
</h3>
<div class="product_price">
<p class="price_color">
£47.82
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html">
<img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg"/>
</a>
</div>
<p class="star-rating Five">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">
Sapiens: A Brief History ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£54.23
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/the-requiem-red_995/index.html">
<img alt="The Requiem Red" class="thumbnail" src="media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg"/>
</a>
</div>
<p class="star-rating One">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">
The Requiem Red
</a>
</h3>
<div class="product_price">
<p class="price_color">
£22.65
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html">
<img alt="The Dirty Little Secrets of Getting Your Dream Job" class="thumbnail" src="media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg"/>
</a>
</div>
<p class="star-rating Four">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">
The Dirty Little Secrets ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£33.34
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html">
<img alt="The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull" class="thumbnail" src="media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg"/>
</a>
</div>
<p class="star-rating Three">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull">
The Coming Woman: A ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£17.93
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html">
<img alt="The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics" class="thumbnail" src="media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg"/>
</a>
</div>
<p class="star-rating Four">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html" title="The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics">
The Boys in the ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£22.60
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/the-black-maria_991/index.html">
<img alt="The Black Maria" class="thumbnail" src="media/cache/58/46/5846057e28022268153beff6d352b06c.jpg"/>
</a>
</div>
<p class="star-rating One">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/the-black-maria_991/index.html" title="The Black Maria">
The Black Maria
</a>
</h3>
<div class="product_price">
<p class="price_color">
£52.15
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html">
<img alt="Starving Hearts (Triangular Trade Trilogy, #1)" class="thumbnail" src="media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg"/>
</a>
</div>
<p class="star-rating Two">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html" title="Starving Hearts (Triangular Trade Trilogy, #1)">
Starving Hearts (Triangular Trade ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£13.99
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/shakespeares-sonnets_989/index.html">
<img alt="Shakespeare's Sonnets" class="thumbnail" src="media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg"/>
</a>
</div>
<p class="star-rating Four">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/shakespeares-sonnets_989/index.html" title="Shakespeare's Sonnets">
Shakespeare's Sonnets
</a>
</h3>
<div class="product_price">
<p class="price_color">
£20.66
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/set-me-free_988/index.html">
<img alt="Set Me Free" class="thumbnail" src="media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg"/>
</a>
</div>
<p class="star-rating Five">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/set-me-free_988/index.html" title="Set Me Free">
Set Me Free
</a>
</h3>
<div class="product_price">
<p class="price_color">
£17.46
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html">
<img alt="Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)" class="thumbnail" src="media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg"/>
</a>
</div>
<p class="star-rating Five">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html" title="Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)">
Scott Pilgrim's Precious Little ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£52.29
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/rip-it-up-and-start-again_986/index.html">
<img alt="Rip it Up and Start Again" class="thumbnail" src="media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg"/>
</a>
</div>
<p class="star-rating Five">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/rip-it-up-and-start-again_986/index.html" title="Rip it Up and Start Again">
Rip it Up and ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£35.02
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html">
<img alt="Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991" class="thumbnail" src="media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg"/>
</a>
</div>
<p class="star-rating Three">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html" title="Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991">
Our Band Could Be ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£57.25
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/olio_984/index.html">
<img alt="Olio" class="thumbnail" src="media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg"/>
</a>
</div>
<p class="star-rating One">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/olio_984/index.html" title="Olio">
Olio
</a>
</h3>
<div class="product_price">
<p class="price_color">
£23.88
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html">
<img alt="Mesaerion: The Best Science Fiction Stories 1800-1849" class="thumbnail" src="media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg"/>
</a>
</div>
<p class="star-rating One">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html" title="Mesaerion: The Best Science Fiction Stories 1800-1849">
Mesaerion: The Best Science ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£37.59
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/libertarianism-for-beginners_982/index.html">
<img alt="Libertarianism for Beginners" class="thumbnail" src="media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg"/>
</a>
</div>
<p class="star-rating Two">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/libertarianism-for-beginners_982/index.html" title="Libertarianism for Beginners">
Libertarianism for Beginners
</a>
</h3>
<div class="product_price">
<p class="price_color">
£51.33
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/its-only-the-himalayas_981/index.html">
<img alt="It's Only the Himalayas" class="thumbnail" src="media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg"/>
</a>
</div>
<p class="star-rating Two">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/its-only-the-himalayas_981/index.html" title="It's Only the Himalayas">
It's Only the Himalayas
</a>
</h3>
<div class="product_price">
<p class="price_color">
£45.17
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>
</ol>
<div>
<ul class="pager">
<li class="current">
Page 1 of 50
</li>
<li class="next">
<a href="catalogue/page-2.html">
next
</a>
</li>
</ul>
</div>
</div>
</section>
</div>
</div>
<!-- /row -->
</div>
<!-- /page_inner -->
</div>
<!-- /container-fluid -->
<footer class="footer container-fluid">
</footer>
<!-- jQuery -->
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js">
</script>
<script>
window.jQuery || document.write('<script src="static/oscar/js/jquery/jquery-1.9.1.min.js"><\/script>')
</script>
<!-- Twitter Bootstrap -->
<script src="static/oscar/js/bootstrap3/bootstrap.min.js" type="text/javascript">
</script>
<!-- Oscar -->
<script charset="utf-8" src="static/oscar/js/oscar/ui.js" type="text/javascript">
</script>
<script charset="utf-8" src="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.js" type="text/javascript">
</script>
<script charset="utf-8" src="static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js" type="text/javascript">
</script>
<script type="text/javascript">
$(function() {
oscar.init();
oscar.search.init();
});
</script>
<!-- Version: N/A -->
</body>
</html>
After scrolling down a bit, you'll notice a pattern in how the book information is laid out. We can use this pattern to our advantage to automatically grab various information about every book listed. Here's a closer look at how the HTML for the first book is structured:
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html">
<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
</a>
</div>
<p class="star-rating Three">
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
<i class="icon-star">
</i>
</p>
<h3>
<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
A Light in the ...
</a>
</h3>
<div class="product_price">
<p class="price_color">
£51.77
</p>
<p class="instock availability">
<i class="icon-ok">
</i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
Add to basket
</button>
</form>
</div>
</article>
</li>All book items seem to share the above structure with similar class names. Now we can use the methods mentioned above to find title and price of everybook.
book_list = []
#find all 'li' elements that have this specific class -> returns a List
book_listings = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
for book in book_listings:
book_data = {}
book_data['title'] = book.h3.a.get('title') # h3 -> a -> title
book_data['price'] = book.find('p', class_='price_color').text
book_list.append(book_data)
print(book_list)Output
[{'title': 'A Light in the Attic', 'price': '£51.77'},
{'title': 'Tipping the Velvet', 'price': '£53.74'},
{'title': 'Soumission', 'price': '£50.10'},
{'title': 'Sharp Objects', 'price': '£47.82'},
{'title': 'Sapiens: A Brief History of Humankind', 'price': '£54.23'},
{'title': 'The Requiem Red', 'price': '£22.65'},
{'title': 'The Dirty Little Secrets of Getting Your Dream Job',
'price': '£33.34'},
{'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
'price': '£17.93'},
{'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
'price': '£22.60'},
{'title': 'The Black Maria', 'price': '£52.15'},
{'title': 'Starving Hearts (Triangular Trade Trilogy, #1)',
'price': '£13.99'},
{'title': "Shakespeare's Sonnets", 'price': '£20.66'},
{'title': 'Set Me Free', 'price': '£17.46'},
{'title': "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
'price': '£52.29'},
{'title': 'Rip it Up and Start Again', 'price': '£35.02'},
{'title': 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
'price': '£57.25'},
{'title': 'Olio', 'price': '£23.88'},
{'title': 'Mesaerion: The Best Science Fiction Stories 1800-1849',
'price': '£37.59'},
{'title': 'Libertarianism for Beginners', 'price': '£51.33'},
{'title': "It's Only the Himalayas", 'price': '£45.17'}]
See how straightforward that was? You just need to spot the patterns in the parsed HTML structure and then use BeautifulSoup's tools to target and extract the data you want.
Using Generative AI
But what if there isn't a pattern like above in the HTML to take advantage of , or Websites aren't always perfectly structured. Sometimes, similar pieces of information might be tagged differently across a site. Instead of writing different specific cases of searching for the data, you can utilize ChatGPT to extract information you want.
To achieve this you can simply give ChatGPT or any other LLM a prompt containing sample examples of the data you want to extract along side with the HTML. Lets take a look at an example:
Once you have that initial data, feed it to a Generative AI model to:
Clean up inconsistencies in the data.
Extract additional insights from text-heavy content (sentiment analysis, keyword extraction).
Generate summaries or categorize the data.
Best Practices for Web Scraping
When engaging in web scraping, it's crucial to follow best practices to ensure your activities are effective, ethical, and respectful of website policies. Here are some key points to consider:
Be Aware of Anti-Scraping Measures
Websites may implement measures to detect and block scraping activities. It's important to be mindful of this and to scrape responsibly to avoid getting blocked.
Respect Robots.txt
Before scraping a site, always check its robots.txt file. This file indicates which parts of the site can be accessed by web crawlers. Respecting the rules laid out in robots.txt helps avoid unintentional overloading of servers. You can find this file by adding /robots.txt to the end of the website's domain name (e.g., https://www.example.com/robots.txt).
Implement Rate Limiting
To prevent overloading the target server, introduce delays between your requests. This practice, known as rate limiting, helps to mimic human browsing behavior. This can be simply achieved using the time.sleep() method from the time module.
Handle Errors Gracefully
Web scraping can encounter various issues such as network errors, missing pages, or server blocks. Implementing proper error handling using try...except blocks to ensure that your scraper can manage these situations without crashing.
If a website uses JavaScript to load content dynamically, BeautifulSoup might need help from tools like Selenium or Playwright to render the page fully before scraping.
Conclusion
Web scraping, especially with a tool like BeautifulSoup, opens up a world of possibilities for gathering and making sense of information on the internet. As you become more comfortable with the techniques, remember that responsible scraping involves respecting website policies and using these tools ethically. And as AI technology advances, we can expect even more powerful ways to automate and gain insights from the vast amount of data available online.