As a developer, when you need to get data from an external website, you usually look for an API exposed by the website that you can use. However, in most scenarios, websites may not maintain such an API. In such cases, you'll need to figure out a different method to get that data.
Web scraping is a technique that allows you to extract meaningful data from external websites. You can write code that can help you navigate to different web pages, figure out the right HTML element and extract the information from within those elements. But this may not always be straightforward, sometimes the HTML of websites also change dynamically. In that case, developers use additional tools to navigate different web pages and extract the relevant information from HTML.
The technique of extracting data from websites is known as web scraping. In this guide, we'll explore the fundamentals of web scrapping with JavaScript and understand how you can scrape data using JavaScript and relevant frameworks.
First, let's understand what web scraping really means. Web scraping is a common technique used by developers to extract data from websites. This data might be present in different formats, including text, images, links, or tables. Web scraping can be extremely helpful in gathering and aggregating information from multiple websites and then analyzing it.
For instance, if you can extract trending products from an E-commerce website over the course of a few weeks, you can understand what products you should sell on your own website to get traction. There are other use cases of web scraping which has made it a popular technique in today's world.
With web scraping, you locate a specific DOM element on the page and using some tools or code, extract the data inside those DOM elements.
If you're familiar with HTML, CSS and have a decent knowledge and familiarity with JavaScript, you can use some common JavaScript logic to scrape data from websites.
Here are two common techniques that can help you scrape useful data from websites using JavaScript:
Inspecting and Identifying Target Data: You can scrape data effectively by identifying the relevant HTML elements that contain the information you're interested in. For this, you can just go to the website whose data you want to scrape and open the browsers' developer tools. It can help inspect the source code and locate the desired elements. You can then use CSS selectors along with the correct HTML element to narrow down the correct DOM element.
Handling Dynamic Content: Some websites heavily rely on JavaScript to load or update content. In this case, looking at the DOM elements may not work and you may require additional techniques. You can use popular scraping libraries like Puppeteer to render and interact with JavaScript-rendered pages. This allows you to scrape data from dynamically generated content or single-page applications (SPAs).
We'll use a NodeJS environment to run our JavaScript code to scrape data. Ensure you have:
NodeJS installed on your device
A text editor or integrated development environment (IDE) of your choice
After that, create a new directory:
mkdir web-scraping-javascript
Then initialize a new NPM project inside this library:
npm init -y
After that, we'll install the following libraries:
npm i cheerio axios
We'll use cheerio to get DOM elements and Axios to make HTTP requests.
Let's start with a simple example of scraping data from a website using JavaScript. In this example, we'll get all the author names from the Dzone website.
By inspecting the author name DOM elements, you can see that the author names are span tags with the class name article-author in it.
We can make an HTTP request to the Dzone website using Axios. Then, we can get the response data and convert it into HTML using cheerio. After that, we can get all the span elements with the class name as article-author and push them to an array.
Finally, we can output this array.
// Import required libraries
const axios = require('axios');
const cheerio = require('cheerio');
// Fetch the HTML content of the target webpage
axios.get('https://dzone.com/')
.then((response) => {
const html = response.data;
// Load HTML content into Cheerio
const $ = cheerio.load(html);
// Extract the blog post titles
const blogPostTitles = [];
$('span.article-author').each((index, element) => {
const title = $(element).text();
blogPostTitles.push(title);
});
// Output the scraped data
console.log(blogPostTitles);
})
.catch((error) => {
console.log(`Error: ${error}`);
});
When you run the above code, you should see a list of all the author names on the console as shown below:
Great! You can try doing this for other DOM elements as well, which have well-defined HTML structures and have a class name or another CSS selector that can help you get the element.
Here's another example where we can scrape the news titles from a news website like BBC.
As you can see the news titles are h3 with the class name gs-c-promo-heading__title.
We can write the following code:
axios.get('https://www.bbc.com/news')
.then((response) => {
const html = response.data;
const $ = cheerio.load(html);
const newsHeadlines = [];
$('h3.gs-c-promo-heading__title').each((index, element) => {
const headline = $(element).text();
newsHeadlines.push(headline);
});
console.log(newsHeadlines);
})
.catch((error) => {
console.log(`Error: ${error}`);
});
In the above code, we get all the headlines, then extract their text content and push them in an array.
If you run the above code, you should get the following output:
And now we get all the news titles on the console. Awesome!
We've seen how we can use JavaScript to scrape websites using DOM manipulation. However, there are situations in which websites change their content dynamically. In that case, the identifying HTML element or CSS selector won't always be the same. To scrape such websites, we'll use a library called Puppeteer. First, let's install Puppeteer in our project by running:
npm i puppeteer
Great. Now let's see two common use cases in which you can scrape data from a website through Puppeteer.
One of the trickiest tasks in web scraping is scraping data from a table. This is because tables have a different markup structure and navigating through the relevant HTML tags can be difficult. Let's walk through a practical example of scraping a table from a website.
Suppose we want to extract data from a table displaying cryptocurrency prices.
We can use the website Coinmarket Cap for this.
As you can see, the details about the cryptocurrencies and their prices are displayed in the form of a table.
Let's look at how we can use Puppeteer to extract this data:
function scrapeCryptoTable() {
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://coinmarketcap.com/');
// Wait for the table to load (you may need to adjust the selector based on website changes)
await page.waitForSelector('table.cmc-table');
const tableData = await page.evaluate(() => {
const data = [];
const tableRows = document.querySelectorAll('table.cmc-table tr');
tableRows.forEach((row) => {
const columns = row.querySelectorAll('td');
if (columns.length >= 10) {
const name = columns[2].querySelector('p').textContent.trim();
const price = columns[3].textContent.trim();
const marketCap = columns[7].textContent.trim();
data.push({ name, price, marketCap });
}
});
return data;
});
console.log(tableData);
await browser.close();
})();
}
In the above example, we use Puppeteer to navigate to the "https://coinmarketcap.com/" website.
Once we navigate to the website, we wait for the table to load using the page.waitForSelector. This is because the content is loaded dynamically and asynchronously.
We then use document.querySelectorAll to select all rows <tr> within the table. Then we loop through each row to extract data from each of the columns using the <td> tag. From the third column, we can extract the cryptocurrency name.
Then we can get the price from the fourth column and the market cap from the eighth column.
We only consider rows with at least 10 columns to ensure we are selecting the correct data rows and not including table headers or other elements.
Let's now run this code to see the output:
As you can see now, we get the name, price and market cap from the table.
Another common use case for web scraping is extracting product information from e-commerce websites. Let's say we need to scrape some product data from a popular E-commerce website like Amazon.
Here's the code to scrape the above data using Puppeteer:
const scrapeAmazonProducts = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.amazon.in/");
await page.type("#twotabsearchtextbox", "iphone 14");
await page.click("#nav-search-submit-text");
await page.waitForNavigation();
const products = await page.evaluate(() => {
let results = [];
const items = document.querySelectorAll(".s-result-item");
for (let i = items.length; i--;) {
const item = items[i];
const title = item.querySelector("h2 > a > span");
const price = item.querySelector(".a-price-whole");
if (!title || !price) continue;
results = [...results, {
title: title.innerText,
price: `${price.innerText}`,
}]
}
return results;
});
console.log(products);
await browser.close();
}
We navigate to the Amazon website and search for the product in the text box. Then we get all the result items using the CSS selector .s-result-item.
Finally, we get the title and price by navigating through the DOM nodes for each result item.
Here's what the output should look like:
That's how you can scrape dynamic content from an E-commerce site like Amazon.
We've learned various web scraping techniques alongside examples, but let's explore some best practices we should adopt when doing web scraping:
Ensure that you're respecting the website's scraping guidelines and are in accordance with its terms of service
Whenever possible, cache data locally to avoid a large number of requests
Make sure you have retry mechanisms implemented in your scraping code
Use appropriate delay mechanisms to avoid overwhelming the target website with requests
Web scraping with JavaScript can be a powerful tool for front-end developers to automate data extraction and perform various tasks efficiently. In this guide, we covered the fundamentals of web scraping, including environment setup, prerequisites, libraries, and techniques, and provided examples of common scraping use cases. Remember to use web scraping responsibly and ethically, respecting website terms of service and legal boundaries. Happy scraping!
This post was written by Siddhant Varma. Siddhant is a full stack JavaScript developer with expertise in frontend engineering. He’s worked with scaling multiple startups in India and has experience building products in the Ed-Tech and healthcare industries. Siddhant has a passion for teaching and a knack for writing. He's also taught programming to many graduates, helping them become better future developers.
Proxyrack - August 6, 2023
Getting Started With Web Scraping in Playwright
Proxyrack - August 7, 2023
Playwright vs. Selenium: Key Differences
Proxyrack - August 8, 2023
Getting Started With Octoparse
Proxyrack - August 1, 2023
Playwright and GitHub Actions: Getting Started