Privacy. Everybody talks about it. Grandstanding politicians make plenty of loud noises in the general direction of the internet, disparaging it for turning your perusal of Kim Kardashian-related articles into targeted ads for breast enhancement surgery and Kanye West tickets. Of course, while these politicians are making all this noise about your privacy, they're quietly signing off on efforts allowing them to sneak in the backdoor and raid your browser history.
Putting the government in charge of your privacy has never been a great idea. When HIPAA was enacted, its privacy requirements greatly affected the medical community. Like many regulatory acts, HIPAA both raised costs (additional paperwork and other compliance factors) and lowered quality (negatively affecting retrospective research and curtailing proactive follow up care).
The true cost of all this additional paperwork, regulation and privacy is now coming to light. Via The Volokh Conspiracy comes the news that HIPAA's privacy requirements may have hampered research efforts that could have prevented an estimated 90,000 unnecessary heart attacks and 25,000 deaths.
Vioxx, the non-steroidal anti-inflammatory drug once prescribed for arthritis, was on the market for over five years before it was withdrawn from the market in 2004. Though a group of small-scale studies had found a correlation between Vioxx and increased risk of heart attack, the FDA did not have convincing evidence until it completed its own analysis of 1.4 million Kaiser Permanente HMO members. By the time Vioxx was pulled, it had caused between 88,000 and 139,000 unnecessary heart attacks, and 27,000-55,000 avoidable deaths.
Even the government's own regulators were stymied by HIPAA's privacy requirements, as was pointed out by Dr. Richard Platt, a drug risk researcher for the FDA:
The Vioxx debacle is a haunting illustration of the importance of large-scale data research. If researchers had had access to 7 million longitudinal patient record, a statistically significant relationship between Vioxx and heart attack would have been revealed in under three years. If researchers had had access to 100 million longitudinal patient records, the relationship would have been discovered in just three months. Of course, if public health researchers did post-market studies that looked for everything all the time, many of the results that look significant would be the product of random noise. But even if it took six months or one year to become confident in the results from a nation-wide health research database, tens of thousands of deaths may have been averted.
At least as troubling as the fact that several thousand deaths could have been prevented if HIPAA's restrictions and terms had not been so limiting is the fact that the privacy stipulations were put into place based on a faulty premise and the Dept. of Health and Human Services' misplaced confidence in the erroneous results.
The premise, as demonstrated by Massachusetts graduate student Latayna Sweeney, was that patient reidentification was possible using only voter registration records and Massachusetts Group Insurance Commission's (GIC) anonymized records. Sweeney was able to reidentify Governor Weld using voter record information, including birth date, name, address, zip code and sex and cross-referencing it with GIC's data. But, as Info/Law points out, Sweeney made a couple of errors, not the least of which was conflating two different terms:
Latanya Sweeney used census data to estimate that 87% of the population has a unique combination of 5-digit zip code, birthdate, and gender, and implied that the same sort of attack, using voter registration records or other public files. Phillip Golle's replication corrected the figure to 63%, though that's hardly comforting. But these uniqueness statistics are rather misleading. There is an important difference between distinguishability and identifiability. Distinguishability is a necessary condition to conduct the sort of matching attack that Ohm describes, but it is not sufficient. Latanya Sweeney conflated the two when she suggested that a unique individual can be identified by linking the unique combination of attributes to public records-voter registration records, e.g.. But public records are never complete. We know, for example, that a significant portion of the population is not registered to vote. How was Sweeney so sure that there was not another man who shared Gov. Weld's birth date and zip code who was not registered to vote?
Not only was the data set incomplete, but it was overly simplistic and off by a large margin:
Daniel Barth-Jones has recently uploaded a fascinating new article that revisits the famous Gov. Weld reidentification. To start with, Sweeney's estimate of the Cambridge population is way off. There were nearly 100,000 people living in Cambridge at the time of the William Weld attack. This should have been the first hint that Sweeney's methodology was overly simple. She reported a population of 54,000 because that is the number of Cambridge residents who were registered to vote. Sweeney used these records as if they described the entire population.
By comparing Sweeney's count of Cambridge voter registrants with U.S. Census records, Barth-Jones confirmed that many voting-age adults in Cambridge (about 35%) were not registered to vote. In William Weld's case, the census data show that approximately 174 men living in Weld's zip code were Weld's age. We don't know their precise birth dates, but we can calculate that the chance another man living in Weld's zip code shared his birthdate was about 35%. This is quite important all on its own to illustrate the difference between identifiability and distinguishability. Most of those 174 men had a unique combination of birth date, gender, and zip code, but each one of them was quite likely-35% likely-to be non-unique.
Sweeney presumably used the voter registration records to rule out the possibility that some of these 174 Cambridge men shared Gov. Weld's birth date. But even if Sweeney did indeed confirm that no other registered voter shared Weld's gender, zip, and birth date, she could not have been sure about the 50 or so Cambridge residents who were Weld's age and were not registered to vote. Thus, at best, Weld's chance of having a unique birth date, zip code, and gender combination is 87%. Put differently, the chance that Latanya Sweeney's matching attack would have been wrong using these three variables alone was 13%- much worse than traditional 5% statistical confidence.
Despite these erroneous assumptions based on incomplete data, the Dept. of Health and Human Services stated the study had shown that "97 percent of the individuals in Cambridge whose data appeared in a database which contained only their nine digit ZIP code and birth date could be identified with certainty." This completely ignores the fact that over a third of the population wouldn't even show up on the list.
But bad data and faulty research have never stopped governmental "progress." The threat of reidentification is low and any attacks remain purely speculative. But while bad regulations have a tendency to be able to weather even the toughest criticism without making the slightest concessions, HIPAA has one thing most bad regulations don't, as Info/Law points out: "a body count."