Web scraping can be an excellent source of critical data that is used in the decision-making process in any business. Therefore, it is at the core of data analysis as it's the one sure way of gathering reliable data. But, because the amount of online content available to be scrapped is always on the rise, it may become almost impossible to scrap each page manually. This calls for automation.
While there are many tools out there that are tailored for different automated scraping projects, the majority of them are premium and will cost you a fortune. This is where Puppeteer+ Chrome+ Node.JS come in. This tutorial will guide you through the process ensuring that you can scrape websites with ease automatically.
How does the setup work?
On creating a new project, proceed to create a file (.js). In the first line, you will have to call up the Puppeteer dependency that you had installed earlier. This is then followed by a primary function "getPic()" which will hold all of the automation code. The third line will invoke the "getPic()" function so as to run it. Considering that the getPic() function is an "async" function, we can then use the await expression which will pause the function while waiting for the "promise" to resolve before moving on to the next line of code. This will function as the primary automation function.
How to call up headless chrome
The next line of code: "const browser = await puppeteer.Launch();" will automatically launch puppeteer and run a chrome instance setting it to our newly created "browser" variable. Proceed to create a page which will then be used to navigate to the URL which you want to scrap.
How to scrap data
Puppeteer API allows you to play around with different website inputs such as clocking, form filling as well as reading data. You can refer to it to get a close view as to how you can automate those processes. The "scrape ()" function will be used to input our scraping code. Proceed to run the node scrape.js function to initiate the scraping process. The whole setup should then automatically begin outputting the required content. It's important to remember to go through your code and check that everything is working according to the design to avoid running into errors along the way.