Forgot your password?
typodupeerror
Math Privacy Encryption

Improperly Anonymized Logs Reveal Details of NYC Cab Trips 192

Posted by Unknown Lamer
from the check-your-proof dept.
mpicpp (3454017) writes with news that a dump of fare logs from NYC cabs resulted in trip details being leaked thanks to using an MD5 hash on input data with a very small key space and regular format. From the article: City officials released the data in response to a public records request and specifically obscured the drivers' hack license numbers and medallion numbers. ... Presumably, officials used the hashes to preserve the privacy of individual drivers since the records provide a detailed view of their locations and work performance over an extended period of time.

It turns out there's a significant flaw in the approach. Because both the medallion and hack numbers are structured in predictable patterns, it was trivial to run all possible iterations through the same MD5 algorithm and then compare the output to the data contained in the 20GB file. Software developer Vijay Pandurangan did just that, and in less than two hours he had completely de-anonymized all 173 million entries.
This discussion has been archived. No new comments can be posted.

Improperly Anonymized Logs Reveal Details of NYC Cab Trips

Comments Filter:
  • by msauve (701917) on Monday June 23, 2014 @10:28PM (#47302889)
    Sure. I'm assuming there's a requirement to have a unique transformation of medallion numbers (otherwise, you wouldn't have to include even a hashed version)...

    Instead of applying some hash to the medallion number, just do something like:
    Change all appearances of the first number in the list to "1". Change all appearances of the next unique medallion number in the list to "2." Etc.

    The result is in essence a OTP. Unless records of the process are kept, it's irreversible (lacking external info, such as medallion number x picked up a fare at location y at time z and correlated info is in the info provided)..
  • by sexybomber (740588) <boccilinoNO@SPAMgmail.com> on Monday June 23, 2014 @11:34PM (#47303197)

    Your State may be different, but New York's Freedom of Information Law (or FOIL, we like to be different) works like this:

    The agency has to respond within five business days, but that response can read something like:

    Dear Sexybomber:

    We have received your request for public records pursuant to FOIL. Due to the complexity of the records you have requested, it may not be possible to produce them within the standard 20-day statutory period. We anticipate that we will be able to produce the records you have requested within 40 days. If you have questions or concerns, please direct them in writing to the address above.

    If they run into a snag, they have to inform you of this and produce the records within a "reasonable period".

    So it's not like NYC was under a five-day time crunch here. They could easily have responded and said it would take 40 or 60 days, being as there were several million records requested. That's definitely long enough to bring in a consultant (or even one of the more technically-literate staff members) to properly secure the data.

  • by Anonymous Coward on Tuesday June 24, 2014 @01:18AM (#47303621)

    > Target's breach cost them 50% of their revenue for a year.

    No it did not. Not even close. [cbsnews.com] At worst their profits for the subsequent quarter were down 50% or in terms of revenue, that's less than a 6% drop compared to a year ago.

  • by Anonymous Coward on Tuesday June 24, 2014 @04:22AM (#47304255)

    A naive use of salt would mean that you might as well omit the data. The aim of including the values in hashed form is to be able to say: This is the same driver as this. So same numbers have to hash to same numbers, which means you can't hash individual lines with different salts or you lose that information. In order to keep that information, you have to hash same numbers with the same salt each time. That basically gives you a random number with which to replace each number. So that works, but it removes the reason for using a hash, which is to have a local operation which creates a global irreversible one-to-one mapping. If you have to create one salt per unique number, you might as well use the salt as irreversible identifier.

"A great many people think they are thinking when they are merely rearranging their prejudices." -- William James

Working...