Web scraping with Puppeteer
Today, I decided to bring sort of a different topic to the blog: web scraping.
Sometimes, I find myself wanting to analyse or visualize some data that is not available through an API or is not really structured, so I turn to web scraping. This is a data extraction method where the user or an automated software copies specific data from a website. I used to use Python to do that, but recently I came across puppeteer and that is the Node library weāre going to use today.
According to the docs, puppeteer is āa Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol.ā This means we can do things like: crawl simple web pages or even SPAs (single page applications), automate form submissions, UI testing, generate screenshots and PDFs, and more.
Assuming you have Node.js installed in your computer, you can run this snippet as a .js
file and quickly see how it works.
const puppeteer = require('puppeteer');
(async () => {
// Launch the headless browser
const browser = await puppeteer.launch();
// open a new page
const page = await browser.newPage();
// go to the specified url
await page.goto('https://example.com');
// take a screenshot
await page.screenshot({path: 'screenshot.png'});
// close the browser
await browser.close();
})();
If you want to see what the browser is doing instead of just waiting for the results, you can change the headless
mode:
const browser = await puppeteer.launch({
headless: false
});
This will open up a browser window as soon as the script starts running and you will see each step as they happen.
Now, the cool thing about using puppeteer for web scraping is that you can use āvanillaā JavaScript syntax to find elements and data on the page. You will have access to the document
interface inside the evaluate
method. Letās try it on the Google News page:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.google.com/news/');
const data = await page.evaluate(() => {
// Find all anchor tags whose parent is an H3
const headlineNodes = document.querySelectorAll('h3 > a')
// Transform the NodeList in an array and map it to get the textContent of each anchor
return {
headlines: Array.from(headlineNodes).map(a => a.textContent)
}
});
console.log(data);
await browser.close();
})();
As I write this post, these are the headlines it returned:
{
headlines: [
"POLITICO's Election Forecast: Trump, Senate GOP in trouble",
'Trump campaign āstronglyā encourages face masks at outdoor rally in New Hampshire.',
'Facebook, WhatsApp Suspending Review of Hong Kong Requests for User Data',
'Trump celebrates Fourth of July by stoking division over pandemic and race',
'7-Year-Old Among 13 Killed in Weekend Shootings in Chicago',
'LA County New COVID-19 Cases Shatter Daily Record In First Report Since Data Processing Changes'
]
}
Google News has an infinite scroll behaviour. This means that when you first load the page, only a few articles are shown and the others are loaded as you scroll down the page. You can mimic this behaviour to get more data by using a combination of window.scrollBy
and setInterval
inside the evaluate method. Something like this (beware this can cause an infinite loop, make sure to create an exit strategy that meets your requirements):
const data = await page.evaluate(async () => {
await new Promise((resolve) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
})
const headlineNodes = document.querySelectorAll(āh3 > aā)
return {
headlines: Array.from(headlineNodes).map(a => a.textContent)
}
});
And thatās it. Hopefully you can see how this technique can be useful to automate boring tasks and maybe create APIs where there isn't one. On a final note, be respectful of the website youāre scraping. You should follow the rules stated in the /robots.txt
for each website, make sure to agree with the terms of service and check if scraping is actually legal wherever you are doing it. And try to not DDOS them when running this in loops :)