What's Behind The Data-Mining Curtain?

Via Josh Marshall (short post, so that's the whole thing):

As you can see, we now have the first hint of what was at the center of the Ashcroft hospital room showdown. According to the New York Times, what the White House calls the 'terrorist surveillance [i.e., warrantless wiretap] program' originally included some sort of largescale data mining.

I don't doubt that this is true as far as it goes. But this must only scratch the surface because, frankly, at least as presented, this just doesn't account for the depth of the controversy or the fact that so many law-and-order DOJ types were willing to resign over what was happening. Something's missing.

Of course, 'data mining' can mean virtually anything. What kind of data and whose you're looking at makes all the difference in the world. Suggestively, the Times article includes this cryptic passage: "Some of the officials said the 2004 dispute involved other issues in addition to the data mining, but would not provide details. They would not say whether the differences were over how the databases were searched or how the resulting information was used."

To put this into perspective, remember that the White House has been willing to go to the public and make a positive argument for certain surveillance procedures (notably evasion of the FISA Court strictures) which appear to be illegal on their face. This must be much more serious and apparently something all but the most ravenous Bush authoritarians would never accept. It is supposedly no longer even happening and hasn't been for a few years. So disclosing it could not jeopardize a program. The only reason that suggests itself is that the political and legal consequences of disclosure are too grave to allow.

Late Update: The Post has a follow story on the data mining issue. It covers most of the same ground but hints a little more directly about possible interception of emails and phone calls. The article suggests that examination of "metadata" was the issue here. But, again, it doesn't fit. The intensity of the covering up doesn't match the alleged secret.

So what does match the intensity of the covering up?

First, a word about data mining. That Wiki article will give you all the info you need, I expect, but here's what I'd like to emphasize. You have a vast amount of data, too much for one person or even a dedicated group of people to pore over. Therefore, you use a computer to search out key phrases or data sets to help focus your search. The computer can cut through the chatter and find the key phrases, giving the investigators a greatly reduced amount of data in which to perform more focused searches, or even to simply start laying human eyes upon.

Say you're in the FBI, and you want to hunt terrorists. You know that the terrorists are using the Internet, for example. So you find a way to allow your data mining program to intercept email messages and scan them as they travel down the tubes. The ones with the key phrases get flagged, and the ones that don't are flushed.

This is my question: where do the key phrases come from?

I rather doubt that the terrorists are talking freely about wanton destruction in any way. There's got to be some kind of coded message, and it needs to be rather innocuous sounding small talk to escape detection. "I got oranges at the market today, but I couldn't lay my hands on any rhubarb." And so bin Laden knows that the cell leader talked to one crucial member of the cell but couldn't find another.

How do you data mine a conversation like that? How do you even know what they are talking about?

This is off the top of my head, but tell me why it wouldn't work. You do some initial surveillance. You do the monitoring of telephone calls to known al-Qaeda members outside the country. You come up with a pared-down list of people in the United States that may or may not be talking to the terrorists about terrorism.

Then you track down their email accounts. You feed the telephone conversations into a text analyzer.

And then you crack those emails and telephone conversations open. You use text analysis to identify common phrases and words, you put human eyes on this mass of data to see if humans think these conversations sound fishy, and then you boil it all down to key phrases to focus the larger task of data mining.

And then you do it all over again. You identify people who are using the same key phrases in their emails and telephone conversations. You boil them down, crack open their conversations, and work the process again. Hits begin to accumulate, maybe. Misses gets refined out of the process, maybe. But since the key phrases are likely innocuous sounding, you get hundreds of thousands of misses.

It's the chicken or the egg problem: which comes first, the data mining or the key phrases?

What if, to get a huge jump start on the process, the Bush Administration violated the privacy rights of thousands of Americans, again and again and again? What if career prosecutors were unwilling to accept any of the tainted results because they would never be admissible in a court of law if the original and continued process of validating key phrases became known? What if millions of dollars were thrown into this kind of a program, all to get the vast amount of evidence thrown out and the guilty set free?

Okay, don't listen to that - it's all about this:

Well, well. As we wrote over a year ago, after combining careful examination of how Republicans parse their statements with network engineering knowledge available through open sources:

Long story short: (1) Internet surveillance is Bush’s goal, not voice calls; (2) the Republican “wiretap” talking point is a diversion, to voice, away from from Internet surveillance; (3) Bush’s domestic surveillance system would pose no engineering challenges whatever to NSA. No rocket science—or tinfoil hats—required.

Can we please stop talking about “wiretaps” now? It’s not your voice communications Bush wants. It’s your mail.
Because email and all Internet communication is sent by packets all around the Internet, some of the packets could go outside the United States. That means the entire message could be forfeit. For all we know, they intentionally direct email out of the country so that they can then grab it.

Powered by ScribeFire.