Proxyrack - July 19, 2023

Getting Started With Puppeteer and Web Scraping

Tutorials

Not all websites and web applications expose their data through APIs; you may need to scrape a website for additional information.

Following a traditional approach—which may involve writing complex queries or copying and pasting—can take time and effort. However, solutions like Puppeteer allow you to automate your web scraping.

This post will be your Puppeteer tutorial, where we’ll explore the fundamentals of web scraping with Puppeteer. We’ll cover topics such as setting up Puppeteer, navigating web pages, interacting with elements, handling dynamic content, and extracting data.

By the end of this tutorial, you'll have a solid foundation and the confidence to tackle your web scraping projects using Puppeteer.

What Is Web Scraping?

Web scraping is the automated method of obtaining information from websites. It entails programming code to visit websites programmatically, parse their HTML structure, and extract particular information of interest.

Web scraping allows you to collect vast volumes of data efficiently. You may use this data for various reasons, including market research, competitive analysis, data analysis, or content aggregation. For example, you may scrape travel websites to extract flight details, hotel prices, reviews, and availability for travel planning or price comparison.

However, it's crucial to be mindful of legal and ethical considerations while scraping websites, respecting website terms of service, and ensuring compliance with data privacy regulations.

What Is Puppeteer?

Puppeteer is an open-source library developed by the Chrome team at Google. Puppeteer's headless mode allows you to run the Chrome browser in the background without a visible graphical interface, enabling efficient and automated web scraping.

Puppeteer simplifies the web scraping process by abstracting away the complexities of rendering web pages, executing JavaScript, and navigating websites. It allows you to perform actions on web pages—such as clicking buttons, filling out forms, scrolling, and capturing screenshots—just like you’d do using any browser.

With Puppeteer, you can scrape data from both static and dynamic websites. It supports modern JavaScript syntax and provides a user-friendly API, making writing concise and readable code easier.

Puppeteer Tutorial

This tutorial will help you get started with web scraping using Puppeteer and leveraging the bookstore from ToScrape’s web scraping sandbox.

But to follow along, you’ll need

basic HTML, CSS, and JavaScript knowledge;
Node.js, or you can install it here; and
Code Editor.

Initialization

Create a folder that will handle your code logic. On your terminal, type the following to create and go into the folder:

mkdir puppeteer-tutorial && cd puppeteer-tutorial

We'll name this folder puppeteer-tutorial, but you can name it anything you want.

Since you’ll need to use the Puppeteer library, initialize npm, generating a package.json file that allows you to manage your project’s dependencies.

npm init -y

You can pass the -y flag to skip answering the prompts, and it’ll create the package.json file with default values.

The package.json file allows you to define custom scripts that automate common tasks like running your web scraping code, starting a development server, running tests, or building your project for deployment.

Next, you’ll install Puppeteer using the following command:

npm install puppeteer --save

Puppeteer will also install the latest version of Chromium relevant to your operating system.

To run your web scraper, use node index.js or add a start script to the package.json file. Do this by finding the scripts section and, just below the tests, add the script “start”: “node index.js."

Now, you can do npm start. Don’t forget to include a comma after the test script.

{
    "name": "puppeteer-tutorial",  
    "version": "1.0.0",  
    "description": "",  
    "main": "index.js",
    "scripts": {  
      "test": "echo \"Error: no test specified\" && exit 1",  
      "start": "node index.js"  
    },
    "keywords": [],  
    "author": "",  
    "license": "ISC",  
    "dependencies": {  
      "puppeteer": "^20.7.4"  
    }  
  }

This is optional, but if you want to use the ES6 syntax, modify the package.json by including “type”: ”module."

{
    "name": "puppeteer-tutorial",  
    "version": "1.0.0",  
    "description": "",  
    "main": "index.js",
    "type":"module", 
    "scripts": {  
      "test": "echo \"Error: no test specified\" && exit 1"  
    },  
    "keywords": [],  
    "author": "",  
    "license": "ISC",  
    "dependencies": {  
      "puppeteer": "^20.7.4"  
    }  
  }

Find the perfect Proxy Product.

Residential proxies

Never get blocked, choose your location

View all options available →

Datacenter proxies

Super fast and reliable

View all options available →

7 Day Trial

Test all products to find the best fit

Test now →

Setting Up Your Browser

Now, consider how you use a browser to access a web page and website. We’ll replicate those steps but do it programmatically.

Create an index.js file in your puppeteer-tutorial folder. Then, inside, add the code that follows.

Import Puppeteer to get access to its functions, classes, and methods necessary to control a headless Chrome or Chromium browser and perform web scraping tasks.

import puppeteer from "puppeteer";
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto("https://www.proxyrack.com/", {
    waitUntil: "domcontentloaded",
  });
  await page.screenshot({ path: "puppeteer-tutorial-proxyrack.png" });
})();

You’re using an asynchronous function because of the asynchronous nature of tasks such as navigating to pages, interacting with elements, and waiting for responses.

Using the puppeteer.launch() method creates a new browser instance and returns a promise that resolves to an instance of the browser class. By default, it launches a Chrome browser in headless mode. You can pass an object with various launch options to customize the behavior.

For example, if you need the user interface, you can set the headless object to false:

const browser = await puppeteer.launch({
    headless: false,
});

Next, use the browser to create a new page and set a viewport size for the page using the .setViewport() method. You’ll then navigate to that page using the .goto() method.

Pass the URL of the website you need to scrape, and for this example, you’ll pass the Proxyrack URL and a waitUntil property that allows for DOM content to load.

To test if your browser initialization works, take and save a screenshot of your landing page.

Finally, close the browser using the .close() method.

Scraping Data

After successfully setting up the browser, it's time to extract data from the bookstore. First, you’ll update the URL to match that of the bookstore.

We want to extract these book details from the bookstore title, image, price, and available number, and then store them. We’ll try to scrape 200 books.

You’ll need to inspect the web page source code using dev tools to identify the location of these details. The link to each book is wrapped in an a tag found in the article tag section, either under the image_container class or wrapped around the h3 tag.

Using CSS selectors, loop through each product_pod class.

You’ll find out that this is going to be a two-step process whereby you extract the link of each book from the homepage. You’ll then navigate to these links, where you’ll extract the specific details of each book.

Below is the logic for extracting the links. You’ll call the .$$eval() method on the page because it evaluates the function based on all elements that match the selector and, as a result, it returns a node list.

It's like how you would run Array.from().

import puppeteer from "puppeteer";
(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto("https://books.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });
  const booklinks = await page.$$eval(".product_pod", (elements) =>
    elements.map((e) => e.querySelector(".image_container a").href)
  );
  console.log(booklinks);
  await browser.close();
})();

On your terminal, you'll see an array of 20 links.

Scraping Data From Multiple Pages

You should note that you’re scraping for 200 books, but the homepage only displays 20 books, so you’ll need to simulate clicking the Next button until you reach your target.

You’ll need to visit 10 pages. To simulate this, you’ll instantiate the start and target pages, then use a while loop to keep track. Additionally, you’ll need an array to store the scraped links.

Before implementing the next page logic, inspect your page for the next button CSS selectors, which you’ll find as the next class and the anchor tag.

Call the waitForSelector method on the page to ensure the button is loaded in the DOM. Then, you can call the .click() method to go to the next page until you reach the desired page.

const desiredPages = 11;
  let currentPage = 1;
  const booklinks = [];
  while (currentPage < desiredPages) {
    const booklinks2 = await page.$$eval(".product_pod", (elements) =>
      elements.map((e) => e.querySelector(".image_container a").href)
    );
    await page.waitForSelector(".next a");
    await page.click(".next a");
    booklinks.push(...booklinks2);
    currentPage++;
  }

So, now that you have the 200 links, you’ll need to visit each link by opening a new page and extracting the required book data.

You’ll create a bookPromise function that’ll encapsulate the web scraping logic. We’re using the .$eval() method to evaluate JavaScript code based on the first element that matches the selector.

let bookPromise = (link) =>
new Promise(async (resolve, reject) => {
  try {
    let bookpage = await browser.newPage();
    await bookpage.goto(link);
    const bookdata = await bookpage.$eval(".product_page", (e) => ({
      title: e.querySelector(".product_main h1").innerText,
      image: e.querySelector("#product_gallery img").src,
      price: e.querySelector(".price_color").innerText,
      available_number: e
        .querySelector(".instock.availability")
        .innerText.match(/\d+/)[0],
    }));
    resolve(bookdata);
    await bookpage.close();
  } catch (error) {
    reject(error);
  }
});
const bookPromises = booklinks.map(async (link) => {
return await bookPromise(link);
});
const allBookData = await Promise.all(bookPromises);

Once you’ve scraped this data, you may want to store it in a database or a file. Below is how to store it in a file using the fs module, where you pass the file name and the data you want to store.

import fs from "fs";
fs.writeFile("booksData.json", JSON.stringify(allBookData), (err) => {
    if (err) {
      throw err;
    }
    console.log("File Saved Successfully!");
});

Here’s the complete code syntax for scraping 200 books from a fictional bookstore:

import puppeteer from "puppeteer";
import fs from "fs";
(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto("https://books.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });
  const desiredPages = 11;
  let currentPage = 1;
  const booklinks = [];
  while (currentPage < desiredPages) {
    const booklinks2 = await page.$$eval(".product_pod", (elements) =>
      elements.map((e) => e.querySelector(".image_container a").href)
    );
    await page.waitForSelector(".next a");
    await page.click(".next a");
    booklinks.push(...booklinks2);
    currentPage++;
  }
  let bookPromise = (link) =>
    new Promise(async (resolve, reject) => {
      try {
        let bookpage = await browser.newPage();
        await bookpage.goto(link);
        const bookdata = await bookpage.$eval(".product_page", (e) => ({
          title: e.querySelector(".product_main h1").innerText,
          image: e.querySelector("#product_gallery img").src,
          price: e.querySelector(".price_color").innerText,
          available_number: e
            .querySelector(".instock.availability")
            .innerText.match(/\d+/)[0],
        }));
        resolve(bookdata);
        await bookpage.close();
      } catch (error) {
        reject(error);
      }
    });
  const bookPromises = booklinks.map(async (link) => {
    return await bookPromise(link);
  });
  const allBookData = await Promise.all(bookPromises);
  fs.writeFile("booksData.json", JSON.stringify(allBookData), (err) => {
    if (err) {
      throw err;
    }
    console.log("File Saved Successfully!");
  });
  await browser.close();
})();

Puppeteer Tutorial Conclusion

You can create robust and efficient web scraping scripts by leveraging Puppeteer's comprehensive API and asynchronous programming with async/await.

With the code examples and explanation from this Puppeteer tutorial, you have the basics to explore and build your web scraping solutions using Puppeteer.

Always be mindful of the legal and ethical implications of web scraping, and ensure that you respect the policies and guidelines of any website.

Also, try Proxyrack today to see how to route your requests through proxy servers to avoid IP restrictions.

This post was written by Mercy Kibet. Mercy is a full-stack developer with a knack for learning and writing about new and intriguing tech stacks.