The Joy of Screen Scraping

January 18, 2015

I was tickled pink the other day to see that over 1,000 people have downloaded the downcache module I maintain on NPM. The way it works is simple: You request a web page, just like you would with any other HTTP module in Node, and in addition to returning the page it saves a copy of the response to your machine. If you request it again, it loads the cached version instead of making another trip to the server.

The idea for downcache comes from a utility function we use at the @UnitedStates project for the efficient crawling of government sites. Whenever I'm embarking on a new scraping effort, I inevitably need to request the HTML of a sample page many times to work out how the markup is structured. By using downcache, the process is both much faster and much kinder to the target's server.

Screen scraping tends to be a menial and inefficient task assigned to junior coders, particularly as more and more sites offer APIs. But I still do it every chance I get at work, because I think it represents one of the core pleasures of programming: Working through the puzzle of a complicated structure to extract what's meaningful. It also gives one an intimate understanding of how others have structured their pages: How they build their markup, how much of the content is present on page load and which parts get requested by the client. I consider it reconnaissance work, the more challenging the better. Up until I encounter some horrific .NET framework.

In the past few months I've seen pitches from companies like Priceonomics that promise to turn pages into APIs and handle all the parsing under the hood. I'm sure that they're fine services, but they end up doing work for me that I genuinely like doing myself. Scraping pages is like rewinding the entire process of web development from the final product back to the data. Though you really ought to still have a simple API.