Web scraping with Puppeteer

Today, I decided to bring sort of a different topic to the blog: web scraping.
Sometimes, I find myself wanting to analyse or visualize some data that is not available through an API or is not really structured, so I turn to web scraping. This is a data extraction method where the user or an automated software copies specific data from a website. I used to use Python to do that, but recently I came across puppeteer and that is the Node library we’re going to use today.

According to the docs, puppeteer is “a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol.” This means we can do things like: crawl simple web pages or even SPAs (single page applications), automate form submissions, UI testing, generate screenshots and PDFs, and more.

Assuming you have Node.js installed in your computer, you can run this snippet as a .js file and quickly see how it works.

const puppeteer = require('puppeteer');

(async () => {
	// Launch the headless browser
	const browser = await puppeteer.launch();

	// open a new page
	const page = await browser.newPage();

	// go to the specified url
	await page.goto('https://example.com');

	// take a screenshot
	await page.screenshot({path: 'screenshot.png'});

	// close the browser
	await browser.close();
})();

If you want to see what the browser is doing instead of just waiting for the results, you can change the headless mode:

const browser = await puppeteer.launch({
	headless: false
});

This will open up a browser window as soon as the script starts running and you will see each step as they happen.

Now, the cool thing about using puppeteer for web scraping is that you can use “vanilla” JavaScript syntax to find elements and data on the page. You will have access to the document interface inside the evaluate method. Let’s try it on the Google News page:

const puppeteer = require('puppeteer');

(async () => {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();

	await page.goto('https://news.google.com/news/');

	const data = await page.evaluate(() => {
		// Find all anchor tags whose parent is an H3
		const headlineNodes = document.querySelectorAll('h3 > a')
		// Transform the NodeList in an array and map it to get the textContent of each anchor

		return {
			headlines: Array.from(headlineNodes).map(a => a.textContent)
		}
	});

	console.log(data);

	await browser.close();
})();

As I write this post, these are the headlines it returned:

{
  headlines: [
    "POLITICO's Election Forecast: Trump, Senate GOP in trouble",
    'Trump campaign “strongly” encourages face masks at outdoor rally in New Hampshire.',
    'Facebook, WhatsApp Suspending Review of Hong Kong Requests for User Data',
    'Trump celebrates Fourth of July by stoking division over pandemic and race',
    '7-Year-Old Among 13 Killed in Weekend Shootings in Chicago',
    'LA County New COVID-19 Cases Shatter Daily Record In First Report Since Data Processing Changes'
  ]
}

Google News has an infinite scroll behaviour. This means that when you first load the page, only a few articles are shown and the others are loaded as you scroll down the page. You can mimic this behaviour to get more data by using a combination of window.scrollBy and setInterval inside the evaluate method. Something like this (beware this can cause an infinite loop, make sure to create an exit strategy that meets your requirements):

const data = await page.evaluate(async () => {

    await new Promise((resolve) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    })

    const headlineNodes = document.querySelectorAll(‘h3 > a’)
    return {
      headlines: Array.from(headlineNodes).map(a => a.textContent)
    }
  });

And that’s it. Hopefully you can see how this technique can be useful to automate boring tasks and maybe create APIs where there isn't one. On a final note, be respectful of the website you’re scraping. You should follow the rules stated in the /robots.txt for each website, make sure to agree with the terms of service and check if scraping is actually legal wherever you are doing it. And try to not DDOS them when running this in loops :)