Table of contents

What is Data Scraping and How to Do It Right

Posted on

Feb 8, 2024

Data scraping is a technique where a computer program extracts a wide range of data from human-readable output (i.e., email addresses) coming from another program. In modern programming, we most often have to implement it when a certain website does not have an API. That's nice, but before we do that, we need to make sure the website owner agrees with it and hasn't banned it.

Building a website scraper

In this short article, I will explain how to implement a solution with JavaScript — the common language of the web. I created a website which we will use to target our simple web scraper. This website can be found here.

In fact, it is just a shopping-cart with some content. All the results data and scripts which we will describe you can get from the article GitHub repository.

Our ultimate goal is to get the information about these products in a format that works for us and is easy to get in the future (Google Sheet, for example).

As you can see, every shopping cart item consists of information which we can specify as four properties – name, description, quantity, and price. It will be excellent for us if we can have, for example, a .JSON file with this data, which we can later use to import it into a database. Often, for quick data extraction process, we use frontend scrapers. For our purposes, we can use just plain JavaScript and directly execute it inside the browser console. Here is an example code snippet. You can find this in the article repository. Its name is data-scraper.user.js.

(() => {
    const data = Array.from(document.querySelectorAll('div.card.mb-3')).map((item) => {
        const name = item.querySelector("div.card-body div.ms-3 > h5").textContent;
        const description = item.querySelector("div.card-body div.ms-3 > p").textContent;
        const qty = item.querySelector("div.card-body h5.fw-normal.mb-0").textContent;
        const price = item.querySelector("div.card-body h5:not(.fw-normal).mb-0").textContent;
        return {
            name,
            description,
            qty,
            price
        }
    });


    const downloadLink = document.createElement('a');
    downloadLink.setAttribute('href', 'data:text/plain;charset=utf-8,' + 
    encodeURIComponent(JSON.stringify(data, null, ' ')));
    downloadLink.setAttribute('download', 'data.json');
    downloadLink.style.display = 'none';
    document.body.appendChild(downloadLink);
    downloadLink.click();
    document.body.removeChild(downloadLink);
})();

You can also watch a demonstration how it works life on this TikTok video.

‍

With document.querySelectorAll('div.card.mb-3') we receive a NodeList from all these items.

To use the methods on an Array object, we need to convert this NodeList to an object. We achieve this with Array.from which creates a new, shallow-copied Array instance from an iterable or array-like object. After this, we can create a new array which consists of plain objects. In JavaScript, a plain object is a set of key-value pairs, created by the {} object literal notation. Information in web development is usually returned to JavaScript as an array from plain objects in web development. In the map function for every of these DOM objects we use the querySelector method to extract information from his child elements — those below him in the hierarchy. The objects which are returned from map method callback will become items in the new array. In the end, array from plain objects with such structure (see data.json file):

[
 {
  "name": "Iphone 11 pro",
  "description": "256GB, Navy Blue",
  "qty": "2",
  "price": "$900"
 },
 {
  "name": "Samsung galaxy Note 10 ",
  "description": "256GB, Navy Blue",
  "qty": "2",
  "price": "$900"
 },
 {
  "name": "Canon EOS M50",
  "description": "Onyx Black",
  "qty": "1",
  "price": "$1199"
 },
 {
  "name": "MacBook Pro",
  "description": "1TB, Graphite",
  "qty": "1",
  "price": "$1799"
 }
]

If you paste the script in browser console like this:

And just press Enter, you will get the result file downloaded to your pc, and you can open it directly. Such an approach is good but has some cons – it cannot be started automatically – for example from cronjobs or pipelines.

Using Puppeteer library solution for scraping

To do it in the right way, we can use npm package scripts, which look almost the same as the front-end one. For this purpose, we will use Puppeteer, which is a Node.js library which provides a high-level API to control Google Chrome/Chromium over the DevTools Protocol. Puppeteer allows us to execute JavaScript on page level, which makes it very useful for scraper tasks. You can watch a demonstration of the whole approach in this TikTok video.

To have Puppeteer installed and to use it from npm first we need to have a npm project which we start with the command npm init. This command will ask you several questions and create a package.json file. Then, when you have a package.json file, you can install Puppeteer with npm i puppeteer. This will create a record in package.json file like this:

A screenshot of a record in package.json

When the project needs to be started on another PC, typing npm install or npm I (the short form) will be enough to provide all the necessary libraries. This is how our NodeJS-based version looks (see node-scrapper.js file).

import puppeteer from 'puppeteer';
import { writeFile } from 'fs/promises';


(async () => {
    const resultFileName = 'result-data.json';
    // Launch the browser and open a new blank page
    const browser = await puppeteer.launch({
        headless: 'new'
    });


    const page = await browser.newPage();


    // Navigate the page to a URL
    await page.goto('https://gonaumov.github.io/what_is_data_scraping/');


    // Wait for cards
    const cardsSelector = 'div.card.mb-3';


    await page.waitForSelector(cardsSelector);


    const dataItems = await page.evaluate((selector) => {
        const data = Array.from(document.querySelectorAll(selector)).map((item) => {
            const name = item.querySelector("div.card-body div.ms-3 > h5").textContent;
            const description = item.querySelector("div.card-body div.ms-3 > p").textContent;
            const qty = item.querySelector("div.card-body h5.fw-normal.mb-0").textContent;
            const price = item.querySelector("div.card-body h5:not(.fw-normal).mb-0").textContent;
            return {
                name,
                description,
                qty,
                price
            }
        });


        return JSON.stringify(data, null, ' ');
    }, cardsSelector);


    try {
        await writeFile(resultFileName, dataItems);
        console.log(`File has been written successfully. Please check ${resultFileName}`);
    } catch (err) {
        console.error('There was an error writing the file.', err);
    }


    await page.close();
    await browser.close();
})();

To be able to run it, you have to specify such a section in the package.json file.

  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1",
    "scrape": "node node-scrapper.js"
  },

In such a way, you will be able to run it with this command: npm run scrape and to ensure the command will be started from installed inside node modules binaries.

With this snippet, we import needed libraries:

import puppeteer from 'puppeteer';
import { writeFile } from 'fs/promises';

This part is

(async () => {
    // our code goes here ... 
})();

An IIFE (Immediately Invoked Function Expression) is a JavaScript function that runs as soon as it is defined. By using it, we avoid polluting the global namespace. With using async, we say that in the function we will use await promises. Promises are a very intriguing approach which JavaScript uses to handle asynchronous logic. You can think about Promises as something similar to eggs. An egg can hatch a chicken, can break and nothing happens and to not be still hatched — waiting state. After waiting, we will have chicken or an exception. So we will start the browser:

 const page = await browser.newPage();

And wait it until launch.

Then we will go to the website:

await page.goto('https://gonaumov.github.io/what_is_data_scraping/');

And wait until the goods appear:

    // Wait for cards
    const cardsSelector = 'div.card.mb-3';


    await page.waitForSelector(cardsSelector);

The interesting part is that Puppeteer has possibility to execute JavaScript on page level and receive the result:

    const dataItems = await page.evaluate((selector) => {
        const data = Array.from(document.querySelectorAll(selector)).map((item) => {
            const name = item.querySelector("div.card-body div.ms-3 > h5").textContent;
            const description = item.querySelector("div.card-body div.ms-3 > p").textContent;
            const qty = item.querySelector("div.card-body h5.fw-normal.mb-0").textContent;
            const price = item.querySelector("div.card-body h5:not(.fw-normal).mb-0").textContent;
            return {
                name,
                description,
                qty,
                price
            }
        });


        return JSON.stringify(data, null, ' ');
    }, cardsSelector);

Inside dataItems constant, we will have the needed data structure in the same way as from the pure front-end part. Then we just write the data in the result-data.json, close the page and close the browser. This is how the script output looks:

Conclusion

Data scraping tools are powerful resources that allow users to extract valuable information from various websites and sources. The article provided a practical example using Puppeteer, which is a popular tool for web scraping.

The code snippet demonstrated how to scrape data from a website, specifically retrieving product details such as name, description, quantity, and price. It highlighted the use of JavaScript functions like querySelector and evaluate to navigate the page and extract relevant information.

Overall, data scraping tools offer a wide range of possibilities for collecting and analyzing data from various sources. They can be used for tasks such as email address extraction, screen scraping, content aggregation, and competitor analysis. These tools leverage technologies like optical character recognition, pattern recognition engines, and application programming interfaces to gather data efficiently.

However, it is crucial to consider ethical concerns and adhere to privacy policies while using data scraping tools. It is important to respect the terms of service of the websites being scraped and ensure that the process does not infringe upon user privacy or violate any legal regulations.

If you need any clarification, please let me know. 😊

‍

Share this article

Get more articles
like this

Thank you! Your submission has been received!

Oops! Something went wrong.

Article by

Georgi Naumov

"I am a programmer with over 24 years of experience in the development of frontend, back-end, and desktop applications. I enjoy working with JavaScript, TypeScript, Java, Python, and other programming languages. In my free time, I explore Kotlin.

Get more articles like this

Thank you! Your submission has been received!

Oops! Something went wrong.

What is Data Scraping and How to Do It Right

Building a website scraper

Using Puppeteer library solution for scraping

Conclusion

Create a free account to access the full topic

More on this

JavaScript Array Slicing Techniques

The Secret World Behind What is JavaScript Used For

What is Data Scraping and How to Do It Right

Building a website scraper

Using Puppeteer library solution for scraping

Conclusion

Related Hyperskill topics

Create a free account to access the full topic

More on this

JavaScript Array Slicing Techniques

The Secret World Behind What is JavaScript Used For