Bruce Schneier has a fascinating essay on how easy it is to discern someone’s identity from anonymous data. Researchers at the University of Texas discovered that by comparing moving ratings in the data set for the Netflix challenge to movie ratings from IMDB users, you can figure out who rated the movies in the anonymous Netflix data. And as it turns out, you don’t need all that many ratings to do it:
With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.
What interests me about this is how little data uniquely identifies a person. He provides a number of other examples in this vein as well. I imagine you could do the same thing with records of a person’s doctor visits or even dental visits, and I expect that you could pretty easily identify me among all Amazon.com customers based only on the purchases I made in 2007. We really do live in the age of data mining.