# Natural Jay Carney processing

July 1, 2013 |

Two weeks ago, my colleague Rachel Hartman and I published a story on Yahoo News with the headline "The top 9,486 ways Jay Carney won’t answer your questions." I'd like to briefly discuss how we did this.

CATEGORY:

As of June 21, 2013, there were 444 briefing transcripts on the White House website that included an appearance by Jay Carney, the White House press secretary. Those transcripts encompass 21,556 lines of dialogue from reporters and 19,764 responses from Carney. (The actual number of questions asked, a subjective measure, is lower; questions often involve some rapid back-and-forth.)

After cleaning up the data a bit, I used the Python Natural Language Toolkit to generate trigrams--three-word-phrases--from every sentence that Carney spoke.

The most frequent trigram that appears in the Carney corpus, including all "stop words" like "the" and "and", is "I don't have."

Here are the top ten:

 Trigram Frequency I don't have 1,974 the American People 1,811 the President has 1,669 the United States 1,575 we need to 1,428 I think that 1,413 that the President 1,347 As you know 1,291 And I think 1,248 Not going to 1,144 I don't have 1,974

In fact, Carney speaks with such precision that he resembles a computer program attempting to mimic human speech patterns. From here, we were easily able to construct archetypal evasions by grouping similar trigrams that met our arbitrary standard of what sounded like a variation on "no comment." (See the text.similar() example in the first chapter of O'Reilly's NLTK book.) For example, "would refer you," "would direct you," and "would point you" all appear in nearly identical contexts.

From there, it was a simple matter of matching trigrams back to the original corpus to produce a widget with every single instance.