Forgot your password?
typodupeerror
Math Science

Text Compressor 1% Away From AI Threshold 442

Posted by kdawson
from the second-hutter-prize dept.
Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
This discussion has been archived. No new comments can be posted.

Text Compressor 1% Away From AI Threshold

Comments Filter:
  • That's cool.. (Score:5, Interesting)

    by Rorian (88503) <james.fysh@gmail ... inus threevowels> on Tuesday July 10, 2007 @02:19AM (#19810101) Homepage Journal
    .. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.
  • by seanadams.com (463190) * on Tuesday July 10, 2007 @02:37AM (#19810183) Homepage
    I've worked with some general purpose compression algorithms like zlib, lossy audio compression like mp3, and also lossless audio.

    Each is very different and interesting in its own right. MP3 especially, because the compression model is built on what the ears+brain can perceive.

    This algorithm I guess would be sort of like mp3 in that it contains some human-based element, maybe a language structure or something, but more like FLAC in that it might use predictors to say what word is likely to come next, with an error bitstream to point to progressively less likely words using bit sequences whose is inversely related to the probability of that word. But that's just a guess from an audio guy.

    Can somebody who's looked at this post a synopsis of how it works?
  • Re:That's cool.. (Score:5, Interesting)

    by Cato (8296) on Tuesday July 10, 2007 @02:38AM (#19810191)
    Or more usefully, compress Wikipedia onto a single SD card in my mobile phone (Palm Treo) - with SDHC format cards, it can do 8 GB today.

    Compression format would need to make it possible to randomly access pages, of course, and an efficient search index would be needed as well, so it's not quite that simple.
  • by qbwiz (87077) * <john@bauma[ ]mily.com ['nfa' in gap]> on Tuesday July 10, 2007 @02:53AM (#19810249) Homepage

    Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

    Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?


    The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.

    If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.
  • by phatvw (996438) on Tuesday July 10, 2007 @02:56AM (#19810265)
    I wonder if lossy text compression where prepositions are entirely thrown out would be effective? Based on context, your brain actually ignores a lot of words you read and fills in the blanks so-to-speak. Perhaps you can use simple grammar rules to predict which prepositions go where based on that same context?
  • by seanadams.com (463190) * on Tuesday July 10, 2007 @03:03AM (#19810301) Homepage
    Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.

    Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
    Maybe I'm misunderstanding the contents of that rar file. Are both of those data files needed? The .exe by itself is 124KB. Where did you get 99,696?
  • by vertigoCiel (1070374) on Tuesday July 10, 2007 @03:54AM (#19810497)
    Great idea, but unfortunately the Hutter Prize uses a static sample of the first 10^8 bits of Wikipedia.
  • by Anonymous Coward on Tuesday July 10, 2007 @04:11AM (#19810543)
    I suppose I have to post this anonymously, or the Hutter Prize thralls will just mod it down; I like my karma.

    I am at a loss as to how this meaningless charade keeps getting posted on Slashdot. Anyone with half a brain who reads TFA (or any of the previous FA Slashdot has posted on this stupid prize) can see this for what it really is: a handful of crazy people who think that this has meaning beyond above-average technogarbage.

    There are all of four people seriously involved in this Hutter Prize: Hutter himself, Bowery (who's made all the /. submissions), Mahoney (who wrote a thesis on this crap), and Ratushnyak, who seems to enjoy wasting his time on this AI-obsessed prize.

    PAQ8HP12 may be able to compress the corpus extremely efficiently, but it has obvious and real drawbacks for any real-world application: it's tuned for this specific corpus ("H[utter]P[rize]" is even in the name of the compressor), it's slow as fuck, and it consumes 2GB of memory. Yes, 2GB of memory for 100MB of input data. This is not a real-world algorithm; this is CS weenies wanking off.

    And what's with the obsession with Wikipedia? It is not the be-all, end-all of human knowledge, and, despite its devotees' claims, never will be; just look at the internal politics, and you'll see that it simply can't scale to that size. Is it a useful resource? Of course. Is it something worthy of adoration and fawning over? No.

    And then, of course, there's the obsession with AI. These people seem to be of the opinion that a text compressor will actually lead to artificial intelligence -- with no other tuning! An absurd claim if I've ever heard one; the predictive capabilities of a good text compressor are something that would no doubt be useful to an AI, but there's one hell of a lot more to general intelligence than just pattern matching and statistical algorithms for compression.

    If one really wanted to sponsor an AI prize, it would probably be much better to focus on creating, say, an effective chatbot -- something that really can predict a desirable response and pass Turing's test.

    Not this compression bullshit.
  • Re:That's cool.. (Score:3, Interesting)

    by Solder Fumes (797270) on Tuesday July 10, 2007 @04:37AM (#19810655)
    This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.

    Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away. That takes a lot more horsepower than picking up the already-sorted pieces and tossing them onto a display.
  • by Anonymous Coward on Tuesday July 10, 2007 @04:51AM (#19810709)
    I've been following the Hutter Prize with interest, having been into compression ever since reverse engineering Powerpacker on my Amiga 500 back in the good old days to understand how it worked (ah, happy memories).

    Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).

    Now depending on the text itself, prediction based on previously seen text isn't enough ... especially this enwik8 file which is more of a flat file dataset with a lot of unrelated terminology.

    Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork" ... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee". (Three subjects very close to my heart ... also my lungs, arteries, liver etc).

    Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.

    It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process ... the 100MB enwik8 file. A different file will need a different dictionary.

    Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components ... example all the timestamps are in a very verbose character style like "2007-07-10 00:00:00" ... if we can recognize that, we could find an alternative encoding, changing 19 byte string into 32 bit long (maybe even less if we understand the epoch date he is using) ... again, "wetware" has to identify and decide this encoding right now.

    Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc ... AND do all this in less than 9 hours I believe it takes for the latest compressor.

    This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.

    At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.
  • Re:Segfault (Score:3, Interesting)

    by Baldrson (78598) * on Tuesday July 10, 2007 @04:53AM (#19810717) Homepage Journal
    On a 1.73 GHz Pentium with 1.2G RAM running Cygwin after compressing with:

    PAQ8HP12 -7 enwik8.paq8hp12 enwik8

    and moving the enwik8 archive to the parent directory:

    JamesBowery@oldatlantis ~/hutter/paq8hp12
    $ time ./PAQ8HP12 -7 enwik8.paq8hp12
    100000000 enwik8: extracted
    16381959 -> 100000000 (1.3106 bpc) in 22398.08 sec (4.465 KB/sec), 941315 Kb


    real 373m19.379s
    user 0m0.031s
    sys 0m0.030s

    JamesBowery@oldatlantis ~/hutter/paq8hp12
    $ ls -al
    total 114216
    drwxrwxrwx+ 2 JamesBowery None 0 Jul 1 00:43 .
    drwxrwxrwx+ 6 JamesBowery None 0 Jun 30 14:59 ..
    -rwxrwxrwx 1 JamesBowery None 99696 May 14 09:16 PAQ8HP12.EXE
    -rwxrwxrwx 1 JamesBowery None 100000000 Jul 1 06:54 enwik8
    -rwxrwxrwx 1 JamesBowery None 16381959 Jun 30 21:09 enwik8.paq8hp12
    -rwxrwxrwx 1 JamesBowery None 465211 Jul 1 00:43 temp_HKCC_dict1.dic

    JamesBowery@oldatlantis ~/hutter/paq8hp12
    $ diff enwik8 ../enwik8/enwik8

    JamesBowery@oldatlantis ~/hutter/paq8hp12
    $

  • by Anonymous Coward on Tuesday July 10, 2007 @05:48AM (#19810955)
    The fact that a program to solve an essentially mathematical problem requires a knowledge of mathematics to create does not invalidate the theory that Computer Science is not really a branch of maths (or if it is, it only is because mathematicians made it that way).

    Get over yourselves everyone - not every computing problem is a mathematical one, some are, some aren't.
    Besides, some of the branches of Maths relevant to computing are really only "Maths" because that's what someone decided to lump it in with - a name does not define the nature of something.
  • by iapetus (24050) on Tuesday July 10, 2007 @06:04AM (#19811013) Homepage
    Which is a shame, because the weather wasn't good.
  • Very silly goal (Score:3, Interesting)

    by Ancient_Hacker (751168) on Tuesday July 10, 2007 @06:51AM (#19811159)
    This is about the silliest competition imagineable. Think:

    Compression has reached a point of diminishing returns, getting less and less return for more and more work. And at best it's asymptotically approaching the theoretical limit. You could offer a billion dollar prize and get back maybe a few percent of improvement, while making any further improvement more difficult.

    Meanwhile data storage and data transmission technology keeps improving many percent a year, with each improvement compounding on the previous ones.

    In Other Words, IMHO money would be better spent on the second area rather than the first.

  • by arivanov (12034) on Tuesday July 10, 2007 @07:13AM (#19811251) Homepage
    From the notebooks of Lazarus Long: If it can't be expressed in figures, it is not science; it is opinion.
  • by Phat_Tony (661117) on Tuesday July 10, 2007 @11:33AM (#19813843)
    That's my opinion of this. By excluding lossy compression, they're also excluding the likelihood of applicability to AI that is the point of the contest.

    Humans achieve good compression on things like encyclopedia knowledge because we don't remember the words at all. We remember the idea, and we have our own dictionary in our heads, and we re-apply words to the idea to reconstruct the entry, rather than memorizing the data. That's why we get great compression; we throw out most of the data, and just remember the "gist" of it, the argument, the facts, in an internal structure of raw ideas stored independently of the words to explain them.

    By restricting the contest to lossless compression, they eliminate the ability to use any AI-like compression techniques. The machine can not extract the ideas and then re-assign words, because it would have to be able to do so using the exact voice of each of thousands of different Wikipedia contributors. That's hopeless.

    So the entrants are restricted to clever algorithms that do endless mathematical optimizations to compress the data, a method of compression that's entirely alien to the methods of our only known intelligence. We don't remember things by figuring out clever tricks to compress the data in our own memory. We don't say "Oscar Schindler saved Jews In WWII" and then say, OK, that data had 5 spaces in it, and 4 "S's," and if I remember the positions of the spaces and the S's, I could use less memory space to store this in my head, and then just think back through the algorithm I used to take the spaces and "s's" out and put them back in where they go, and I'll have the name again, and then sit there and carefully work out in our heads what the original data must have been after our compression methods. It doesn't work that way at all. To us it apparently "just comes to us." The compression probably comes from things like remembering sounds, and then reconstructing the name's exact spelling based upon known rules of grammer. We store the name Oscar Schindler in relation to various facts regarding Jews and WWII, but we store them as ideas, and then pull the words back out, and each time someone asks us about Schindler, we'd be likely to say something similar in meaning but different in expression. So this contest is restricted to the least interesting kind of compression for intelligence; the kind that can't use it.

    Interesting compressions are things like JPEG and MP3, where they built the compression model on the human perceptual model, first saying "what about this exact data is less relevant to a human observer, that we can therefore throw away?" For JPEG's, it turns out that (among other things) we're much more sensitive to differences in color than to absolute colors, and among differences in color, we're much more perceptive in the color ranges closer to human skin tone. MIDI is actually probably closer to the compression used by human intelligence than any recorded music standard.

    Along these lines, I'd say storing the HTML formatting data exactly borders on ridiculous. It's a hugely inefficient waste of space. For instance, if you just run the HTML through one of the free online utilities that strips irrelevant data, you get the identical presentation of the data, you've only thrown out entirely worthless data. But you've already violated the contest rules. You should be able to strip the HTML entirely, as long as your compression/decompression system ends up with conveniently readable formatting in the end. Reconstructing the actual HTML in a character-identical way is so non-intelligent when you're trying to save space, it seems hard to beleive it's going to lead to intelligence.

    Regarding this contest: I'm curious what level of compression you can get if you just histogram the words and then, in order of frequency for anything with enough occurrences to save memory by using a look-up table, you assign sequential numeric values for the words in order of frequency of occurrence. Then start your data with a look-up
  • by WilliamSChips (793741) <full,infinity&gmail,com> on Tuesday July 10, 2007 @11:54AM (#19814129) Journal
    Lazarus Long is consistently wrong. He claims that peace and freedom are mutually exclusive but if you took a graph of our freedom you'd find that the greatest drops are during wartime.
  • by BillyBlaze (746775) <tomfelker@gmail.com> on Wednesday July 11, 2007 @02:45AM (#19822451)
    While allowing lossy compression might end up with way better AI, there's a logistics problem: how do you give an objective score to the precision of 100M (that's 4.76 uLoC) of paraphrased text?

Q: How many IBM CPU's does it take to execute a job? A: Four; three to hold it down, and one to rip its head off.

Working...