Natural Jay Carney processing

July 1, 2013 |

Two weeks ago, my colleague Rachel Hartman and I published a story on Yahoo News with the headline "The top 9,486 ways Jay Carney won’t answer your questions." I'd like to briefly discuss how we did this.

CATEGORY:

As of June 21, 2013, there were 444 briefing transcripts on the White House website that included an appearance by Jay Carney, the White House press secretary. Those transcripts encompass 21,556 lines of dialogue from reporters and 19,764 responses from Carney. (The actual number of questions asked, a subjective measure, is lower; questions often involve some rapid back-and-forth.)

After cleaning up the data a bit, I used the Python Natural Language Toolkit to generate trigrams--three-word-phrases--from every sentence that Carney spoke.

The most frequent trigram that appears in the Carney corpus, including all "stop words" like "the" and "and", is "I don't have."

Here are the top ten:

TrigramFrequency
I don't have1,974
the American People    1,811
the President has1,669
the United States1,575
we need to1,428
I think that1,413
that the President1,347
As you know1,291
And I think1,248
Not going to1,144
I don't have1,974

In fact, Carney speaks with such precision that he resembles a computer program attempting to mimic human speech patterns. From here, we were easily able to construct archetypal evasions by grouping similar trigrams that met our arbitrary standard of what sounded like a variation on "no comment." (See the text.similar() example in the first chapter of O'Reilly's NLTK book.) For example, "would refer you," "would direct you," and "would point you" all appear in nearly identical contexts.

From there, it was a simple matter of matching trigrams back to the original corpus to produce a widget with every single instance.