May 20, 2013
It is not quite sufficient to point out that D3 is fully capable of creating and manipulating tables, divs, spans and paragraph tags. We ought to recognize that even the simplest markup is a fully qualified data visualization, working off this horseback definition: "Data visualization is a process whereby some portion of a message of sufficient entropy is represented on one or more spatial dimensions."

Last week, the White House released 100 pages of printed emails documenting the intelligence community's public response to the Sept. 11, 2012 attacks on the American diplomatic compound in Libya. There are 91 unique messages in the documents and a high level of redundancy due to long reply-chains being printed multiple times.

At Yahoo News, we decided to arrange this information as an interactive inbox, in which readers could view the messages in a basic approximation of a standard email client.

The emails arrived to us as paper printouts, with many identities of CIA and State Department officials redacted, so the first task was to translate them into machine-readable text. On short notice, the easiest way to do this was to store the metadata about each message--to, from, cc, subject, date--in one JSON object and to store the content of the messages in individual text files. I used Pagedown, the Javascript markdown processer, to convert the raw text of the messages into spartan HTML.

The interactive inbox then became a straightforward matter of displaying the JSON file as a table element with a row for each message and making an AJAX call to the text file containing the body of the email when the user clicked on it, mimicking the functionality of an Outlook preview pane. (Very short messages were stored directly in the JSON file.) I wouldn't do it this way again, but such is the nature of deadline-driven development.

This means that when a user clicks on the table row representing a given message as a line in an inbox, the code has to somehow access the original data object used to create that DOM element. It then populates the preview pane with the To and Cc fields and makes an AJAX call to the a text file whose name is also stored in the object.

Until recently, my solution would have been to assign each row of this table a unique id, store the metadata about the messages in a dictionary-like object with those ids as the keys, and use that id to get the object back. There is nothing terribly wrong with this strategy, but it's tedious and prone to error. After a few false starts, I realized that this issue of projecting data onto the DOM is quite literally the philosophy behind Mike Bostock's D3 library.

Nearly every example of D3 in action on Mike's Github page uses the library's abstraction of Scalable Vector Graphics (SVG) to visualize a dataset. This is the intended use of D3, I think. But it is just as valid to use for building traditional elements. Here's a simplified demo:

<table id="inbox" class="inbox">
    <thead>
        <tr class="field">
            <td>From</td>
            <td>Subject</td>
            <td>Date</td>
        </tr>
    </thead>
    <tbody id="messages"></tbody>           
</table>

var row = d3.select("#messages").selectAll(".message")
    .data(messages)
    .enter()
    .append("tr")
    .attr("id", function(d) { return d.id; })
    .attr("class", "message")
    .on("click", function(d) {
        //we now have access to all of the properties of the indiviudal message
    });

row.append("td").text(function(d) { return d.From; });
row.append("td").append("div").text(function(d) { return d.Subject; })
row.append("td").text(function(d) { return d.Date + " " + d.Time; });

It is not quite sufficient to point out that D3 is fully capable of creating and manipulating tables, divs, spans and paragraph tags. We ought to recognize that even the simplest markup is a fully qualified data visualization, working off this horseback definition:

Data visualization is a process whereby some portion of a high-entropy message is represented on one or more spatial dimensions.

It naturally follows that text is data, just like integers and RGB values. By this definition, the browser itself is the greatest engine of data visualization ever invented. And it won that distinction long before it was capable of drawing shapes on a screen.

May 2, 2013
This tutorial demonstrates how you can take a raw series of coordinates and end up with a binned hexagonal map rendered in the browser using d3js and topojson.

Most Americans prefer to huddle together around urban areas, which raises all sorts of problems for map-based visualizations. Coloring regions according to a data value, known as a choropleth map, leaves the map maker beholden to arbitrary political boundaries and, at the county level, pixel-wide polygons in parts of the Northeast. Many publications prefer to place dots proportional in area to the data values over the center of each county, which inevitably produces overlapping circles in these same congested regions. Here's a particularly atrocious example of that strategy I once made at Slate:

Alt text

Two weeks ago, Kevin Schaul released an exciting new command-line tool called binify that offers a brilliant alternative. Schaul's tool takes a series of points and clusters them (or "bins" them) into hexagonal tiles. Check out the introductory blog post on his site.

Binify operates on .shp files, which can be a bit difficult to work with for those of us who aren't GIS pros. I put together this tutorial to demonstrate how you can take a raw series of coordinates and end up with a binned hexagonal map rendered in the browser using d3js and topojson, both courtesy of the beautiful mind of Mike Bostock. All the source files we'll need are on Github.

Sample data

I downloaded about 2,000 addresses from a Craigslist-like website and converted them to coordinates with geopy.

Setup

We're going to use one small Python script to create our .shp file. It's recommended you first create and activate a virtualenv with:

virtualenv virt
source virt/bin/activate

Whether or not you use virtualenv:

pip install -r requirements.txt

You also need to install ogr2ogr and topojson for working with the shapefiles.

Conversions

CSV -> SHP

Binify takes as input a .shp file, a format developed by ESRI for geospatial data. Specifically, it needs a "point shapefile" that contains a layer of individual coordinates. (Most .shp files you're likely to encounter consist of a lot of polygons marking territorial boundaries and so forth.) We can make a .shp file from a list of raw coordinates with the pyshp library. The shpify.py script in the Github repo for this demo will take care of this:

./script/shpify.py

If you look at the source, you'll see this is a very simple process of loading the coordinates from coordinates.csv and writing them to a shapefile, same as you might to when creating a new .csv file in Python.

This script should place a file called output.shp in the shapefiles directory. Pyshp also creates the companion files output.dbf and output.shx. We also need a projection file, output.prj, so this script manually creates one.

Load these files into an ArcGIS program such as Quantum GIS and you'll see a nice collection of points:

Alt text

SHP -> Binned SHP

Here is where Binify comes in. Per the documentation, we simply feed it our point shapefile with a few arguments.

First, we want to give it enough hexagons to achieve the granularity we want. 120 hexagons across sounds like a good starting target.

Because these sample coordinates span the United States, we will expect many of the hexagons to encompass 0 points. We can greatly reduce the filesize by including the -e argument, which prevents binify from writing empty polygons.

binify -n=120 -e shapefiles/original.shp shapefiles/binned.shp

This may take a few minutes to run. When finished, you'll have a new set of files named binary.shp and so forth.

Load those files into QGIS and, like magic, we've got hexagons:

Alt text

Binned SHP -> GeoJSON -> topoJSON

The mechanics of how to build GeoJSON and topoJSON files are well-documented--see this Stack Overflow Question of mine and and the generous answer from Bostock, for example--so we'll skip to the CLI commands:

ogr2ogr -f GeoJSON binned.json shapefiles/binned.shp

Make sure to use the -p flag with the next line to preserve the COUNT property:

topojson -s 7e-9 -p -o coordinates.json -- binned.json

This reduces the 1.9MB .shp file to an 88KB .json file.

Mapping

We can reuse 90 percent of the code in the d3 choropleth map example, which serves as a nice introduction to topoJSON mapping.

As Schaul notes in his introductory blog post, how you divide your data into color bins is critically important to how viewers interpret the information. In this case, I was lazy and simply colored all the hexagons red and then dimmed them according to the COUNT value (specifically, the square root of the ratio of the value to the maximum value on the map).

And there you have it. If the hexagons look a little too big, just rerun the binify command with a larger value for n. The following map has been rendered live in your browser:

You can see the map with the code on bl.ocks.org.

April 25, 2013
Try out this simple Javascript and SVG-powered metronome.

I've recently been playing around with Sound Slice, an ambitious new project from Django co-founder Adrian Holovaty that creates interactive guitar tabs. The site has huge potential for moving guitars beyond the domain of ASCII files and 14 variations on the same song.

SoundSlice pegs each transcription to a specific recording of the song which is embedded as a video alongside the tabs. Each chord or tablature is pegged to a specific timestamp in the video.

My favorite feature is the ability to tap out measures on your laptop's keyboard as the user listens to the song, marking out blank chords that he or she can then fill in. At the moment, however, there's no margin of error for users who cannot tap out each measure with precisely the same duration. I'm interested in how a computer could be trained to correct for human error to create regular measures. (After all, tabs are always slightly idealized versions of exactly what's happening in a recording.)

To play around with this task, I needed a metronome that plays in the browser. So I made one using Raphael, my favorite Javascript wrapper for SVG graphics. The math and philosophy behind tempo error correction will come in a future post, but for now, here's a demo and the source for the metronome.

You initialize the metronome with a few parameters to set the size and angle of the animation. You can also attach custom functions to the tick event and the event that fires on the final tick. In the above example, I attach two functions that write updates to the screen.

function tick(t) {
    $("<div />").html(t%2 === 1 ? "Tick":"Tock").appendTo(".status");
    $("#count").html(t);    
}

function done() {
    $("<div />").html("Done!").appendTo(".status");
    $("#startstop").html("start");
}

var m = metronome({
    len: 200,
    angle: 20,
    paper: "metronome_container",
    tick: tick,
    complete: done,
    path: ""
});

These are not actual Javascript events, though they probably should be.

The metronome has two functions, .start() and .stop(). The first takes two arguments, a tempo (expressed as beats per minute, like your piano teacher taught you) and a number of ticks:

m.start(120, 50);

You can interrupt the execution with:

m.stop()

At fast tempos, the weight occasionally gets disconnected from the metronome's arm. I addressed this issue with Raphael's .animateWith() function on Stack Overflow, but I'm not convinced the accepted answer is complete.

April 17, 2013
In scrambling to find something to do around the bombings at the Boston Marathon, I came across a searchable database of all 26,000 participants. Each runner's page lists up to 10 timestamps marking his or her progress: one every five kilometers and two more at the halfway point and the finish line. It took about an hour to scrape every page and extract this information. The result was about 15MB of data.

See the source code for this project on GitHub.

In scrambling to find something to do around the bombings at the Boston Marathon, I came across a searchable database of all 26,000 participants. Fortunately, it's possible to search everyone by hitting "Enter" with no search terms.

Each runner's page lists up to 10 timestamps marking his or her progress: one every five kilometers and two more at the halfway point and the finish line. It took about an hour to scrape every page and extract this information. The result was about 15MB of data.

Even if it was realistic to load this much information in a browser all at once, the human eye cannot make sense of 26,000 simultaneous animations. (The browser can't handle it either.) So I split the race into 72 five-minute intervals and estimated, for each contestent, which kilometer marker he or she would be closest to in each interval.

For markers not divisible by five, this involved a simple linear interpolation. While a runner's pace almost always slows considerably from the start to the finish, the error in assuming constant velocity for five-kilometer spans is probably negligible.

The result, a matrix of 42 kilometers times 72 intervals, is enormously easier to handle.

I wish I had had more time to play with filtering the data by age and gender. Alas, the news moves on.

10:00 AM