GZipping Life Forms: Deflate Reveals Bare-Bones 245
An anonymous reader writes "To distinguish images derived from living vs. non-living sources, USC and NASA JPL researchers report today using the standard gzip compression utility. As a measure of overall pattern complexity, they find that the inherent pixel content of biologically generated fossils produces higher image compression ratios [more data redundancy], compared to their non-biological counterparts. The more the file shrinks, the more likely it is that a living process was involved. A test is live online here. This extends the simple, but powerful, uses of gzip to biogenic fossil detectors, in addition to spam cop filters, DNA sequence comparisons, digital camera image crunchers, etc. In nine months, the two Mars rovers will send back the first microscopic-scale images of Mars rocks, which should be amenable to some of these same techniques: thus gzipping is apparently pretty zippy."
I'd assume (Score:3, Interesting)
(Mods: the last line was a joke, intended to point out a particularly simple example of a problem - not a troll)
horsefeathers. (Score:1, Interesting)
uhhh.. huh? (Score:2, Interesting)
bzip2? (Score:3, Interesting)
After all, they have quite different compression characteristics (on one hand, compression of a megabyte of zeroes is much better in bzip2, OTOH adding the same file on top of itself and then compressing gives much less additional compressed size with gzip than with bzip2 - tested with
The fractal geometry of nature? (Score:5, Interesting)
Then again, what do I know? Maybe something more immersed in this field can tell us whether there's a seed of truth to my ramblings
Greetings
--> R
Some companies are using model based mathmatical t (Score:1, Interesting)
http://www.image-metrics.com/pages/technology.a
Information vs. Meaning (Score:2, Interesting)
Kolmogorov Complexity (Score:5, Interesting)
Roughly, Kolmogorov Complexity is a measure of randomness - the measure is how long a computer program needs to be to reproduce data (pardon an oversimplification).
-Mark
Re:The Mars fossil IS made by life; my wife is not (Score:4, Interesting)
Biological clocks in unicorns... (Score:5, Interesting)
This is the loopiest thing I've heard of since Rosenblatt reported that his Perceptrons could distinguish between music composed by Bach and music composed in imitation of Bach.
Good heavens, any picture that's slightly out of focus will now be declared to be evidence of "biological processes."
I'm guessing that the researchers are not as nutty as they sound and that they've done more than is being reported, but still...
Reminds me of the researchers in the sixties who were publishing analyses of data that supposedly showed "biological clocks." It turned out that they were using smoothing algorithms that, basically, were filters that had a 24-hour peak in the frequency domain--so their analysis was creating the patterns they claimed to be detecting. A debunking article was published in Science in which another research used data from a random number table (the "unicorn" data) and showed that the same analysis techniques showed that the unicorn had a biological clock.
Slightly Dodgy (Score:5, Interesting)
The big problem is the use of JPEG source images. Unless you've stuck it up to the maximum size on quality, then the jpeg artifacting (which is in effect repeating blocks of image data after transitions) will probably mask any hidden level of complexity in the images - the human brain is a much better tool at pattern recognition than most computer algorithms (especially those algorithms not designed for the task!).
Throw high-resolution bitmap files at it, and I'd be more persuaded that there is a genuine effect. Until then, I suspect it's more of a happy coincidence that the files they've thrown at it give results they are excited about.
Jolyon
Re:why no bzip2 ? (Score:5, Interesting)
gzip might be preferable because it works more locally. It only keeps track of the last n bytes of data and does substitutions based on patterns seen in those n bytes.
bzip2 uses a markov predictor and the chain length is typically much longer than gzip uses, so the compression is less local. That's great if you're going for compression but for this work, it might be misleading.
That said, gzip doesn't know about image formats, so I wonder if these guys are getting some false positives on scanline wraps and other non-image data.
Compression to measure semantic content (Score:3, Interesting)
Re:Thought this would be somewhat obvious... (Score:2, Interesting)
Your DNA is only sufficient to create another state machine with the same rules you had at birth.
It will not re-create your complexity because our dna-state machines are designed to create brains which are 'genetically-memoryless', capable of self modification, and have incredible data collection and storage capacity.
Think of your DNA as the graphics engine for Quake. It is relatively small (space-wise) compared to the textures and levels. Add different data, and you have still have a first-person game, but a completely different one.
Re:Makes sense... (Score:5, Interesting)
Pattern Recognition (Score:3, Interesting)
Each algorithm could be fine tuned for a paticular type of pattern.
Is that an elephant or a giraffe?
Does it compress better with the elephant algorithm or the giraffe algorithm?
Seperate the chaff (Score:2, Interesting)
That having been said, it sounds good in theory that 'organisms are highly patterned and therefore compress better', but then why would you use gzip? Why not take that theory and build something a little more adept at locating particular types of patterns you're interested in, or ruling out the ones you know are going to create false positives?
So, THAT having been said, I'm forced to wonder if somebody forgot that March has 31 days. Lord knows I can never keep track.
hidden markov models (Score:3, Interesting)
Maybe if you could have an image recognition system do the Hard Machine Vision probelm of generating a schematic of the picture, and then fed the "leg bone is connected to the hip bone" kinda data into a HMM you could work out which fossils are ancient Cambrian crustations and which ones are Trogdor the Burninator.
viruses? (Score:2, Interesting)
Just a thought.
Re:Slightly Dodgy (Score:3, Interesting)
This was actually published in a (barely) peer-reviewed journal, Vision Research. I didn't say "image processing" above because a lot of these vision scientists seem to be psycologists doing visual psychophysics without having a strong background in math, or optics, or (it seems at time) the fundamentals of science.
The other thing to take into consideration is that gzip is "pseudolinear". It does not take into account the 2-dimensional correlations that exist in image data. Even fax compression takes advantage of it. (and yes, I do realize that gzip can account for runs from previous regions regardless of length or location, but I am trying to point out that there is a specific 2-dimensional set of correlations extant in 2-d image data).
In these cases being cited that use GZIP, the major function of GZIP seems to be as an indicator of the presence or absence of high-frequency components in the signal stream. Lots of irregular high frequency -> Low compressibility, very little irregular high frequency --> High compressibility factors.
Re:Cool (Score:3, Interesting)
Re:I compress.. (Score:3, Interesting)
Not only are you, but are uniquely Mr Methane, because each individual author has unique and identifying characteristics that can be measured using - guess what - compression algorithms.
Given enough samples, individual authors can be identified and graphs of language relationships [economist.com], too.
I think it's interesting because it raises the bar on preserving anonymity if you publish widely.
Add some entropy to your life; write drunk.