Forgot your password?
typodupeerror
Math Privacy Encryption

Improperly Anonymized Logs Reveal Details of NYC Cab Trips 192

Posted by Unknown Lamer
from the check-your-proof dept.
mpicpp (3454017) writes with news that a dump of fare logs from NYC cabs resulted in trip details being leaked thanks to using an MD5 hash on input data with a very small key space and regular format. From the article: City officials released the data in response to a public records request and specifically obscured the drivers' hack license numbers and medallion numbers. ... Presumably, officials used the hashes to preserve the privacy of individual drivers since the records provide a detailed view of their locations and work performance over an extended period of time.

It turns out there's a significant flaw in the approach. Because both the medallion and hack numbers are structured in predictable patterns, it was trivial to run all possible iterations through the same MD5 algorithm and then compare the output to the data contained in the 20GB file. Software developer Vijay Pandurangan did just that, and in less than two hours he had completely de-anonymized all 173 million entries.
This discussion has been archived. No new comments can be posted.

Improperly Anonymized Logs Reveal Details of NYC Cab Trips

Comments Filter:
  • by FlyHelicopters (1540845) on Monday June 23, 2014 @08:05PM (#47301949)

    Too many governments and corporations continue to fail to understand that it requires having experts who actually know what they are doing be in charge of data security.

    This doesn't mean you contract it out to the lowest bidder or hire the cheapest CS degree you can find.

    It means you hire knowledge and experience, you hire expert skills, and those cost money.

  • This (Score:0, Insightful)

    by Anonymous Coward on Monday June 23, 2014 @08:09PM (#47301975)

    This is why we can't have nice things.....

  • by rsborg (111459) on Monday June 23, 2014 @08:12PM (#47301983) Homepage

    Large organizations will consistently fail to hire/staff competent people for data security related issues, and will push back on fines or punitive findings by criminalizing publicizing their incompetence.

    Thus sending all such talent straight to criminals who'll be happy to reward them with hard cash.

    It's like these guys _want_ a dystopian future.

  • by fuzzyfuzzyfungus (1223518) on Monday June 23, 2014 @08:21PM (#47302037) Journal
    In this case, it sounds like whoever got handed the job just couldn't, didn't care to, or was overruled about, thinking like an attacker.

    There are probably subtler methods of de-anonymizing the data that would require nontrivial skill to think of and counter; but it's a bit surprising to see somebody who knows enough about manipulating data to pull 20GB of records and hash a single field in each one without hurting himself or munging the result; but doesn't think "Medallion numbers are written on cabs. Somebody could grab dozens of them while waiting by the curb at the airport and just MD5 them in milliseconds", much less "Medallion numbers are quite short, someone could traverse the whole damn keyspace in a few days at most".

    Either their person thinks that MD5 is magic, or his thought process marched in a nice straight line from request to solution, without ever thinking about attack: "We need all medallion numbers replaced with internally consistent but unrelated UIDs." "Umm, OK. Hey, a hash function is deterministic and non-reversible, it's perfect!"
  • by WaffleMonster (969671) on Monday June 23, 2014 @08:54PM (#47302281)

    Always assumed anywhere term "anonymized data" is used it is more likely than not to be companies and governments paying lip service to its customers... where data could easily be reversed into an identifiable way by either taking advantage of insufficient entropy or cross referencing datasets.

    There is after all no cost for violating privacy or unnecessary risk exposure associated with disclosure.

    One of my favorite examples of dangers of insufficient entropy stem from a PCI DSS requirement written by "experts" who should know better.

    3.4 Render PAN unreadable anywhere it is stored (including on portable digital media, backup media, and in logs) by using any of the following approaches:

    One-way hashes based on strong cryptography, (hash must be of the entire PAN) ...

    Search space of typical 16-digit card numbers is no match for a modern CPU once you have taken check digit, card type, issuer and issuer specific numbering into account... "strong cryptography" can't fix stupid.

  • by gweihir (88907) on Monday June 23, 2014 @09:32PM (#47302529)

    You are naive. The problem starts to crop up when you start correlating things. Then you can find all sorts of things, like patterns of visiting a mistress, people meeting in secret (which is perfectly legal, but the government fears it), etc.

  • by Opportunist (166417) on Monday June 23, 2014 @10:07PM (#47302747)

    Actually the movement of a cab is a wealth of information. Not by itself, but it's very good at connecting dots. If you want to follow someone around, these things tend to be invaluable. You can, essentially, follow someone around without following them around, even retroactively. People rarely go from place to place randomly. They have destinations. If someone takes a cab from the airport and doesn't live in the area where he landed, it is likely that his destination is the place that he will stay in. After a flight, especially a long one, people want to get rid of their heavy baggage, take a shower, put on new clothing. So you can easily find out where someone stayed. Which becomes twice as interesting if the destination is not a hotel, because now you got another person to screen.

    This information by itself is not much. But as part of a bigger network it is something we'd have killed for back when I was still doing profiling.

  • Cue the DMCA. (Score:2, Insightful)

    by Anonymous Coward on Monday June 23, 2014 @10:42PM (#47302949)

    In other news, the credentials for their plug-n-play coffee machine are 'admin' 'admin', and their gym locker combo is 1234. Someone made a half-assed attempt to obfuscate some data that nobody cares about (unless your husband's a cheating cabbie, I guess) and someone cracked it. News?

  • by Anonymous Coward on Monday June 23, 2014 @11:38PM (#47303217)

    Change all appearances of the first number in the list to "1".

    You have described something most definitely NOT a one time pad. In an OTP scheme, every *instance* of any particular value maps with equal probability to every potential output value. What you described is a basic substitution cipher--trivial to crack by frequency analysis. Every input value has a definite output value to which it maps with 100% probability. Once you find the first correlation between input/output, you can replace all the others. Not so for an OTP. Frequency analysis won't do squat if your OTP was generated in a truly random fashion and applied correctly.

    And this, folks, is why you shouldn't trust advice from strangers about crypto or homebrew crypto schemes. Play with them, learn about the principles, but please, for the love of FSM, do not trust them.

  • by chriscappuccio (80696) on Monday June 23, 2014 @11:44PM (#47303249) Homepage

    Sorry but unless you define "GOOD ITSEC company audit the shit out of it" in tangible terms that can actually hold someone liable for failure in a real way, this is just baloney. And if you define it with teeth, the price will increase. Basically, to define it properly, you'd be able to do it yourself. Oops.

  • by chriscappuccio (80696) on Tuesday June 24, 2014 @12:43AM (#47303477) Homepage

    The government has the info already, they handed it out!

  • Re:Oops. (Score:5, Insightful)

    by philip.paradis (2580427) on Tuesday June 24, 2014 @04:35AM (#47304281)

    The United States dollar [wikipedia.org] is the currency preferred by drug dealers, whose trade is in fact made more profitable by the failed "War on Drugs" [wikipedia.org].

  • by philip.paradis (2580427) on Tuesday June 24, 2014 @04:55AM (#47304319)

    I'm appalled that your post has been modded "informative." Please do us all a favor and abstain from any future posts on cryptography. Instead, I recommend you spend your time with resources like Applied Cryptography [schneier.com]. Seriously, please put down the shovel, and if you're doing anything involving crypto for a living, please do the world a favor and resign today.

  • Re:Oops. (Score:4, Insightful)

    by philip.paradis (2580427) on Tuesday June 24, 2014 @07:37AM (#47304727)

    The War on Drugs is a massively successful enterprise if your definition of success is the ability to extract billions of USD worth of funding from taxpayers, with a disproportionate amount of said funding going to the overt militarization of police forces in the USA at the expense of civil liberties and human rights. However, if your indicators of success are tied to social, medical, or economic improvement for the citizens of the United States of America, the entire affair is indeed a massive failure.

    For reference, this is coming from someone who consumes nothing more than nicotine (vaping these days, gave up cigarettes after 20 years) and whiskey, and once wore an actual military uniform for a living.

Swap read error. You lose your mind.

Working...