Introduction to puppeteer
What is it used for?
Simply put: Automate the browser.
Puppeteer[^1] is a tool created by google and is a headless browser.
A headless browser is like a regular browser, but instead of using it via the user interface, you use it via a programming interface.
How to get started
With your version of node installed and a project folder set up we can integrate it into our project by
npm i puppeteer
Once we have that done we can create a new javascript file like this.
puppeteer.js
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await browser.close();
})();
First, we need to import puppeteer into our project. Then we can set up an async environment to run the browser. We need this environment as most of the stuff that we want to do on the web will require the browser to wait until everything we need has finished loading. This is a crucial part as this step is usually taken care of by the human that is in front of the machine and verifies if everything is visible and working.
Run the file
If you finished everything so far you can run the file using:
node puppeteer.js
make sure that the current location of your terminal window is in the folder where you created the puppeteer.js
file.
By now when you run the file you will notice that nothing happens. And while you are right about that it's not exactly true. What happens is that chrome opens in headless mode goes to "example.com" and when the site loads it closes it.
We can look at that if add a parameter to the launch function and modify the code to:
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();
await page.goto("https://example.com");
await browser.close();
Now if you run the file again you can see what's going on. This helped me, in the beginning, to understand what the browser is doing and why it might throw errors.
🎉 Congratulations! You started a tool made for human interactions via a machine. The first level of automation is done.
What's next?
From here your possibilities are whatever you can imagine. Downloading files that you are interested in into a database, writing your stock reports, doing home automation, you name it.
So how do we get to a different page?
Define a variable that holds the links to the sites you need to visit and use the goto function to get there. Of course, you can use other javascript capabilities like arrays and so on to store your URLs if you need more than 1-3.
const url1 = "https://example.com";
const url2 = "https://google.com";
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();
await page.goto(url1);
// do stuff on the page
await page.goto(url2);
// do stuff on the page
await browser.close();
Doing stuff on a page
As soon as the browser finished loading the site it is ready to do stuff. Now you can do what you are interested in on this page. To do so you can use all the document functions that Chrome provides The following example selects all elements on the page where the class contains 'item--property'
document.querySelectorAll("[class*='item--property']");
Waiting
A big part of using puppeteer is waiting for sites to finish loading and only when it is finished trying to get the data. I found these to be the most helpful waiting solutions.
// use page evaluate to wait until the content you're looking for is there
await page.evaluate(() => {
// your code goes here
});
// wait for the selector to wait until a certain selector is available. In this case
// if an image is visible, this can be helpful if you try to take screenshots and the site is not yet loaded completely
await page.waitForSelector("img", { visible: true });
// of course we have waitForTimeout which lets you pause the script for a certain amount of time. In some cases, you can not avoid it.
await page.waitForTimeout(300);
I hope this gives you a little head start. Let me know if you got stuck somewhere so I know where to add more content to make it easier to use for you :)
[^1]: read more about it here
- Affiliate Disclaimer
- Disclaimer:
Links on the site might be affiliate links, so if you click them I might earn a small commission.