Day job, night job. Food between. Nothing special happened today. But night job lately... Oh boy.
I've been tasked with building something to let people see when there have been changes made to a website. A new page, a change to a page, whatever. That sounds easy, right? Just scrape some pages, save the text, and the next time you scrape the page, compare the new text with the last bit of text we saved, and if there are differences, save those. Easy peasy, right? Fucking wrong!!
First off, scraping a site - by which I mean using some code to "look at" the homepage, grab the content and save it, extract any links in that content, figure out if we should follow any of those links, then follow them and start the whole "extract content, find links, follow" process until there's nothing else to "look at," is no easy task. It sounds simple, but it's really not. One way we can do this is to grab a page and try to extract links with regular expressions. That's a bad idea for a number of reasons that need no explanation. Another better option we have is to use a library, like Cheerio, that does all that hard work for us and let's us do something like
$('a[href]') to pull all the links out of a page, but we still need to figure out if they're links we should follow. And if humans had anything to with the composition of a page, there's a good chance you'll encounter some dumb shit that makes no sense and needs a special exception in your code. Broken HTML? Guaranteed. Links with
href values that make no sense, but a browser can somehow understand? Sure. Links with no
Okay, so "scraping" a site is out of the question. Well, you know what most sites (though, not the bigger ones, strangely enough) have? Sitemaps! Boom!! We'll just grab the sitemaps, which tell us exactly which pages exist on the site, and grab the content on those pages. Easy enough. Oh, except some sitemaps contain URLs that aren't valid.
https://somedomain.com /a-page.html is not a valid URL, unless I missed the announcement from ICANN that "dot-com-space" is a new TLD. I guess technically
https://somedomain.com/./a-page.html is a valid URL, but like, what? Slash-dot-slash? Was this website built by Fatboy Slim? Oh, and these sites will also block us if we scrape pages too quickly. OH! Some sites have thousands - or hundreds of thousands - of pages. So even with a one second delay between scraping pages, it'll still take a stupid amount of time to scrape an entire site.
Oh, you know what?!
robots.txt tells us pages we're not allowed to scrape, and also (I learned during all of this) how long we should wait between scraping pages. Let's look at the
robots.txt and make sure (a) we're not scraping stuff we shouldn't and (b) we're not scraping a site too quickly. Of course, any jackass would realize that
robots.txt probably isn't telling them what not to scrape or how long to wait 100% of the time, but I'm a special kind of jackass, and that did not occur to me. Not every robots tells you how long to wait, or what you're not allowed to scan. Cool, cool, cool, cool.
So here I am. Waiting sometimes 4 days for a site to finish being scraped (I'm not sure how long it actually took because I killed the process after 4 days) and learning about all the ways people can supremely fuck up something as simple as a sitemap. Seriously?! SLASH-DOT-SLASH?! WHAT THE FUCKING FUCK?!. I'm pretty sure the only way to get this thing to run in any sort of timely fashion is to completely rewrite it so it can be distributed across a fuckton of servers and processes... And a lot more infrastructure.