On New Year’s Eve 2018, I published an article which instructed how to scrape pages of a site and write the results into Google BigQuery. I considered it to be a cool way to build your own web scraper, as it utilized the power and scale of the Google Cloud platform combined with the flexibility of a headless crawler built on top of Puppeteer. In today’s article, I’m revisiting this solution in order to share with you its latest version, which includes a feature that you might find extremely useful when auditing the cookies that are dropped on your site.
Last updated 11 September 2020: Added important note about how the custom domain should be mapped with A/AAAA DNS records rather than a CNAME record. Ah, Safari’s Intelligent Tracking Prevention - the gift that keeps on giving. Having almost milked this cow for all it’s worth, I was sure there would be little need to revisit the topic. Maybe, I thought, it would be better to just sit back and watch the world burn.
In my intense love affair with the Google Cloud Platform, I’ve never felt more inspired to write content and try things out. After starting with a Snowplow Analytics setup guide, and continuing with a Lighthouse audit automation tutorial, I’m going to show you yet another cool thing you can do with GCP. In this guide, I’ll show you how to use an open-source web crawler running in a Google Compute Engine virtual machine (VM) instance to scrape all the internal and external links of a given domain, and write the results into a BigQuery table.
Google Cloud Platform is very, very cool. It’s a fully capable, enterprise-grade, scalable cloud ecosystem which lets even total novices get started with building their first cloud applications. I wrote a long guide for installing Snowplow on the GCP, and you might want to read that if you want to see how you can build your own analytics tool using some nifty open-source modules. But this guide will not be about Snowplow.