Recreating the Atlantic's Netflix database with Node
The Atlantic published a delightful exploration last week of Netflix's surgically precise category descriptions, which include gems like "Critically-acclaimed Irreverent Crime Movies" or "Dramas starring Charlotte Rampling" (whoever she is). Because Netflix identifies the categories with integers in the URLs, Alexis Madrigal was able to scrape each one of the 76,897 unique categories names. It's data mining at its best.
Madrigal's description of how he did it, however, made me wince.
I'd been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web. Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file.
After some troubleshooting and help from Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.
The cheapest version of UBot goes for $245, and the 20 hours it took Madrigal to run the script is far, far too long for 100,000-some URL calls. This struck me as a good job for Node, so last weekend I decided to see how quickly I could rebuild the data.
(Node is the immensely popular server-side implementation of Javascript. I converted from Python land about four months ago out of sheer frustration with mixing up syntaxes, and I'm never going back.)
Because Netflix shut down its API last spring—bad Netflix!—this has to be accomplished by recreating the login process using POST requests. This was easy enough to figure out by watching all the network activity in Chrome while logging in to Netflix. As it turns out, you need to make two requests in order to get a token from the login page.
After logging in programmatically, and circumventing an annoying cookie, I was able to start pulling down the category URLs in order. The whole, horrible unoptimized script took about 90 minutes to run. I ended up with 76,268 unique categories. It cost me $0.
(I should state here that this is probably a violation of the Netflix terms of service.)
I point all of this out not to show off—mostly not to show off—but because it's useful to see what sort of data mining is possible without any extraordinary programming ability, and because it's important that any great piece of data journalism be replicated independently when possible.
The script is here. To run it yourself, you need to install Node and Git, which you really ought to do anyway. Then download my script. From the command line, that looks like this:
git clone https://gist.github.com/8312213.git
cd 8312213
This pulls down two files, the script itself and one called package.json
with some information about what the script needs to work.
To install the dependencies, run:
npm install
A whole bunch of stuff will fly by. These are libraries the script uses to help download web pages and extract information from the HTML.
Once this is done, you can get started:
node scrape.js <YOUR NETFLIX LOGIN> <YOUR NETFLIX PASSWORD>
The data is stored in a SQLite database, netflix.sqlite
. You can easily extract it once the script is done by installing sqlite3 and then running the following command:
sqlite3 netflix.sqlite
.mode csv
.header on
.out categories.csv
select * from categories WHERE title != '' GROUP BY title;
The entire script is below, heavily commented.