Crawling Not Crawling

Having been in the SEO space for about 15 years I am surprised at all the things that are still being used to help index and crawl sites.

Robots.txt was originally discussed back in 1993 in an mailing list. Many sites, but not all, have a robots.txt file. Almost 2 years ago, July 1, 2019 Google announced that robots.txt protocol is working towards becoming an internet standard. Just took 25 years!

Sitemaps were launched in June 2005 by the Big G and many of the crawlers have leveraged them.

Sitemaps really come in 2 flavors: List of URLs or a List of Sitemaps This can easily be detected by parsing the xml.

By crawling the sitemaps, you can also get a better idea of the structure of the website, understanding the folders or in many cases categories of the website as well as keyword data, specifically on products. I even posted up a quick code snippet on ways to extract keywords from URLs although I recommend David Sottimano implementation as it’s a nice refactor

The past year I have been crawling a lot of ecommerce sites and have been thinking of ways to optimize a crawl. It takes a lot of time and resources when crawling a large site (1MM+ URLS). There are ways to limit wasted resources such as blocking images or stylings, especially if you are only interested in the HTML. One thing that I got excited about while doing some crawling was how many ecommerce stores have sitemaps and even put them in their robots.txt! There is a specification for robots.txt to include sitemaps as a directive, but many sites also default to domain.com/sitemap.xml route as defined in the spec. A few of them put them at similar or different routes, but either way, it’s a great way to collect URLs.

So these sitemaps (or in some cases sitemaps of sitemaps) have a ton of information, such as the expected URL but also how often that URL is updated and/or when it was lastModified. Now keep in mind, many of these sitemaps are auto generated and not always 100% accurate, so take it with a grain of salt.

The kicker to me and why I am so excited about sitemaps, is it finally provides a route (as well as a pattern) of URLs to generate an initial list of urls at scale. So rather than kick off a crawl with [insert seo tool here], and go depth first hoping your do not run out of RAM or forget to turn it off while running it in the cloud, I can take a domain, fetch the sitemaps via robots.txt (or auto detect /sitemap.xml) and build out an entire list of urls without actually crawling the website directly, just the sitemaps!

Hence, crawling not crawling.

In the past, I was leveraging Archive.org and a few other data sources to do this to help give an idea on the size of a site and any patterns with the URL for products, categories, etc.

While building out a script to do this, I also realized a lot of sites leverage gzip to optimize performance and save on bandwidth and storage cost, which is awesome!

This also means my scripts needs to be able to handle uncompressing the data before processing it. Also worth mentioning is that not every .gz file is gzip, but to check the content type as (cough cough Target.com cough) helped with finding this edge case.

More to come, but wanted to start the discussion!

📣 Software Development And Life

Also be sure to 👋 and follow me on 🐦 @johnmurch