How to Build a Webscraper

A Step-by-Step Guide to Scraping the Web with Node.js and Playwright

October 23, 2024

12 min read

Sometimes you need some data from a webpage that is changing continuously, but the website does not provide a good API to get the data. In such a case, it can be a great way to scrape that data directly from the websites' HTML. But before you start this you need to know that many sites prohibit, scraping their sites on terms of service. If the site allows scraping, you can continue to start building a web scraper.

What is a webscraper?

A webscraper is an automated tool for reading the HTML of a website to gather the information from within the webpage. That can be very useful if no REST API is available, but the data is needed for specific use cases. Since webscrapers need to load a complete webpage, it should only be used if it's not possible to get the data in a better way it's much slower than a REST API, for example.

Introduction

In this guide I will show you how to write a scraper using NodeJS and the Playwright browser. To be clear, this is not the only way to write a webscraper, but it is very easy for a playwright because everything you need to know is the basics of JavaScript and how the HTML DOM works.

The example that we build will open the Google search and will scrape the first page of a search result. The code for that is shared on this GitHub repository. Feel free to take a look at it and try it yourself.

Setup

First, before we can start to code, we need to set up our project folder and install everything we need. For that, create a new director with mkdir webscraper and go into that directory with cd webscraper. After that, you can initialize the npm project with npm init -yes, which will create a package.json that will contain the needed dependencies. In our case, the needed dependency is playwright, which can be installed with npm install --save playwright. That will install the playwright node package that we will use. To make sure that the browser that Playwright will start is also installed, use npx playwright install. That should install all browsers that can be used by playwright.

Open the Browser

Now we have got everything set up that we need to use Playwright in Node.js. As the first test, we will start with a script that will open the browser and close it again after 10 seconds. For that, we create an index.js file and fill it with the following code and run it with node index.js:

const { chromium } = require("playwright");

async function sleep(ms) {
    await new Promise((resolve) => setTimeout(resolve, ms));
}

(async () => {
    // open browser tab
    const browser = await chromium.launch({ headless: false });
    const context = await browser.newContext({
        locale: "en-GB",
    });
    const page = await context.newPage();

    // wait and close browser
    await sleep(10000);
    await browser.close();
})()
    .then(() => {
        process.exit(0);
    })
    .catch((e) => {
        console.error(e);
        process.exit(1);
    });

The first sleep function is only for helper purposes, so that we can add some waiting to our code. In theory, we should be able to remove all sleep at the end to speed up the scraping process, but while writing it helpful to see what is happening in the browser.

After that, the main function includes the process and logic to open the browser with a non-headless configuration and English localization settings. After it is opened, it will wait for 10 seconds and close the browser again.

Open a page and interact with accessibility selectors

Now, let's extend our script and open Google. Here we first have to accept or reject the cookie modal and, after that, we are able to interact with the Google search bar:

const page = await context.newPage();
await sleep(2000);

// open url
await page.goto("https://google.com");
await sleep(2000);
await page.getByRole("button", { name: "Reject all" }).click();
await sleep(2000);
await page.getByRole("combobox", { name: "Search" }).fill("What is a wombat?");
await sleep(2000);
await page.getByRole("combobox", { name: "Search" }).press("Enter");
await sleep(2000);

As you see here, we use the goto method to open a specific web page. And after that, use the getByRole selector to select the reject button for the cookie modal. This selector is one of a few accessibility selectors from the Testing Library, which is built into the Playwright API.

Accessibility is the easier way to access elements in the DOM if the web page is properly set up, since it is more natural to access elements by their role or a describing label they have. To check the accessibility of elements, you can select them in the inspector of the chrome devtools and select the Accessibility tab. Here you can see all properties such as Role and Name to easily get an element with a selector like getByRole.

After we rejected the cookie modal, we can fill out our Google search with the same getByRole method and press enter to go to the results after that.

Access data and interact with general locators

Now we are on the page where we want to gather the data from. We can start to scrape the data right away. For that, we need to get the correct elements first.

await page.getByRole("combobox", { name: "Search" }).press("Enter");
await sleep(2000);
// get all search result from page 1
const resultLocators = await page.locator("#search").locator("#rso").locator("div.K7khPe").all();
console.log("result count:", resultLocators.length);
const results = [];
for (let result of resultLocators) {
    // read title and link
    const title = await result.getByRole("heading").innerText();
    const link = await result.getByRole("link").first().getAttribute("href");
    results.push({ title, link });
}
// output all links
console.log(results);

This time we used the locator method. This allows us to access elements with CSS like selectors. You can also test these selectors on the browsers console by using the document.querySelector method. So the translation to querySelector would be document.querySelector("#search").querySelector("#rso").querySelectorAll("div.K7khPe") which results in the same elements as the playwright's locator chain.

After loading all elements, we can further use that to scrape the needed data from all elements by looping over all elements and combining our already loaded selector for a specific result with a getByRole selector for the title and the link of our search result. We put these into a results array so that we can output the result onto our console.

Full Example Code

Now let's take a look at the full example code. You can run it again with node index.js and you should see the result in the opened browser and the console. You can also take a look at the full example code at this GitHub repository.

const { chromium } = require("playwright");

async function sleep(ms) {
    await new Promise((resolve) => setTimeout(resolve, ms));
}

(async () => {
    // open browser tab
    const browser = await chromium.launch({ headless: false });
    const context = await browser.newContext({
        locale: "en-GB",
    });
    const page = await context.newPage();
    await sleep(2000);

    // open url
    await page.goto("https://google.com");
    await sleep(2000);
    await page.getByRole("button", { name: "Reject all" }).click();
    await sleep(2000);
    await page.getByRole("combobox", { name: "Search" }).fill("What is a wombat?");
    await sleep(2000);
    await page.getByRole("combobox", { name: "Search" }).press("Enter");
    await sleep(2000);
    // get all search result from page 1
    const resultLocators = await page
        .locator("#search")
        .locator("#rso")
        .locator("div.K7khPe")
        .all();
    console.log("result count:", resultLocators.length);
    const results = [];
    for (let result of resultLocators) {
        // read title and link
        const title = await result.getByRole("heading").innerText();
        const link = await result.getByRole("link").first().getAttribute("href");
        results.push({ title, link });
    }
    // output all links
    console.log(results);

    // wait and close browser
    await sleep(10000);
    await browser.close();
})()
    .then(() => {
        process.exit(0);
    })
    .catch((e) => {
        console.error(e);
        process.exit(1);
    });

Conclusion

Now we have got all the knowledge that we need for basic web scraping. In the end, it's nothing more than this. Sometimes it is maybe complicated because accessibility selectors are not working well or other problems like iframes are in use, but all of these are just additional small challenges on top of normal web scraping.

But a tool like playwright can not only be used to scrape data from a web page but also to automate processes within a web browser for things like bots or other things. But be always aware of the terms of service where you want to use something like this, since it could be prohibited in these.

And now it's time for you to build your first webscraper for whatever you need it for. If you need a starting point, you can take a look at this GitHub repository which contains an example from this guide. Have fun and, if you want, you can reach me out to tell me more about your project. I am always happy to talk about interesting projects.

The article was helpful, or are you feeling a bit unsure about anything?

Buy me a Coffee or Get in Touch

How to Upgrade Postgres in Docker

Traefik Proxy for Multiple Hosts