Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Math Science

Text Compressor 1% Away From AI Threshold 442

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
This discussion has been archived. No new comments can be posted.

Text Compressor 1% Away From AI Threshold

Comments Filter:
  • by mrbluze ( 1034940 ) on Tuesday July 10, 2007 @02:35AM (#19810173) Journal

    Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence.

    Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

    Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?

  • by tgv ( 254536 ) on Tuesday July 10, 2007 @02:50AM (#19810237) Journal
    It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon.

    And yes, that's absolute bollocks. Shannon's number was just an estimate and only applied to serial transmission of characters, because that's what he was interested in. Since then, a lot of work has been done in statistical natural language processing, and I would be surprised if the number couldn't be lowered.

    Anyway, since the program doesn't learn or think to reach this limit, nor gives a explanation how this level of compression is intrinsically linked to the language/knowledge it compresses, it cannot be called AI; e.g., it doesn't know how to skip irrelevant bits of information in the text. That would be intelligence...
  • by fireboy1919 ( 257783 ) <rustypNO@SPAMfreeshell.org> on Tuesday July 10, 2007 @02:59AM (#19810283) Homepage Journal
    The first poster on this topic had a good explanation - it seems like an AI problem, but not why.

    Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.

    We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
    So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.

    Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.
  • by mbkennel ( 97636 ) on Tuesday July 10, 2007 @03:16AM (#19810349)


    They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge.

    The apparent empirical result is that predicting which characters are most likely to occur next in a text sequence requires either

    1) vast real-world knowledge

    OR

    2) vast real-world derived statistical databases and estimation machinery

    but there can be a difference in their utility. The point of course, is that humans can do enormously more powerful things with that vast real-world knowledge in addition to symbolic estimation.

    The underlying question is whether physical natural intelligence is really just real-world derived statistical databases and estimation machinery. Modern neuroscience says,
    "depends on what the meaning of 'is' is, but it's at least halfway there."

    However would completing mathematical theorems by searching through Google (statistical pattern matching, which might sort of work for all known theorems on Google) work?

    Clearly natural intelligence includes many tasks which can be now well solved with data-oriented sophisticated statistical approaches, perhaps with equal or better performance. Modern algorithms like 'independent components analysis' now can estimate individual sources in audition, "the cocktail party effect" a problem some once thought was a clear sign of true 'intelligence'. Turns out that some sufficiently clever signal processing and nonlinear objective functions can do it---so maybe that's what neurons do too.

    The still unsolved question is whether there are some tasks which are clearly 'intelligence' where this class of methods will profoundly fail. Maybe like creating really new mathematics?
  • AI (Score:2, Insightful)

    by evilviper ( 135110 ) on Tuesday July 10, 2007 @03:41AM (#19810433) Journal

    bringing text compression within 1% of the threshold for artificial intelligence.

    I see no reason to believe AI and text compression are interchangeable.

    I can think of a few methods that would allow a computer to guess a missing word better than humans (exceeding the AI limit), and that such methods would be useless for determining a response to a question, particularly in the real world, where things like punctuation, abbreviation, and capitalization would be highly suspect to begin with.

    So I have to say the basis for this competition is flawed, and what's more, the results coming out of it are specific enough to just succeed in this competition, but be completely and utterly useless for any other (real) tasks.
  • by bytesex ( 112972 ) on Tuesday July 10, 2007 @03:45AM (#19810451) Homepage
    The problem with this approach is that there are many ways the say the same thing, and that this compression/decompression algorithm is tested using strict text-comparison only. A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.
  • Re:That's cool.. (Score:5, Insightful)

    by arun_s ( 877518 ) on Tuesday July 10, 2007 @04:31AM (#19810627) Homepage Journal
    Maybe someone could sell the whole thing in a book-sized rectangular box with a tiny keyboard and 'DON'T PANIC' inscribed in large, comforting letters in the front.
    Now that'd be cool.
  • Re:ai threshold? (Score:4, Insightful)

    by Baldrson ( 78598 ) * on Tuesday July 10, 2007 @04:57AM (#19810741) Homepage Journal
    Connectionist models are models. Any model needs to be interpreted to be understood.
  • by pitu ( 983343 ) on Tuesday July 10, 2007 @05:18AM (#19810831)

    The problem with this approach is that there are many ways [to] say the same thing

    that is an idea or a concept. Interpreting an idea or concept in different ways is meaningful
    only by its context.

    ex1 the sky is blue => it's beautifull weather (context: you're making a walk)
    ex2: the sky is blue => use #0000FF for the sky area (context: graphic work)

    this compression/decompression algorithm is tested using strict text-comparison only.A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.

    If you say, "the weather is beautifull" to an artist he may draw you yellowish-reddish sunset,
    which is not the correct interpretation of "the sky is blue" you had in mind" So the context is vital.

    I imagine a real AI would evaluate the context and predict what are the next words most likely to be put forward. If it
    succeded to translate a concept to an another in a meaningful context "the sky is blue => it's a beautiful weather let's get down the nasa shuttle"
    it would no longer be an AI but an I :)
  • by HNS-I ( 1119771 ) on Tuesday July 10, 2007 @06:21AM (#19811063)

    The question is does a mobile hand held device got enough processing power to decompress it? in a reasonable time?

    Seriously, this not inveted for mobile hand held devices. At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.

  • by hummassa ( 157160 ) on Tuesday July 10, 2007 @06:21AM (#19811071) Homepage Journal
    Anything that is science is math.
    Ok, computer programming is not necessarily a lot of maths.
    But this article is about something that is really computer science... as opposed to making a CRUD screen in VB.net, which is akin to programming a VCR.
    Parsing, compiling, linear programming, sorting, searching, indexing, compressing, walking graphs, drawing graphics, designing circuits, optimizing circuits, these are activities that are computer science and that are all maths.

    Edsger Dijkstra once said: "Computers are to computer science what telescopes are to astronomy".
  • by Yvan256 ( 722131 ) on Tuesday July 10, 2007 @08:57AM (#19811951) Homepage Journal

    At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.
    That may be, however we're talking about Wikipedia here. It's not about storing so much text that you can't go through it within a month, it's about storing everything so that you can access it as a reference.

    When you look up a word in the dictionary, it takes from 10 to 30 seconds to read the definition. But you did need the whole book/brick to do it.

  • Re:Dangerous (Score:4, Insightful)

    by UbuntuDupe ( 970646 ) * on Tuesday July 10, 2007 @09:38AM (#19812345) Journal
    Well, not always...

    American --> British

    transportation --> transport
    football player --> footballer
    subway --> tyube
    burglarize --> burgle
  • Re:Science != Math (Score:3, Insightful)

    by rcw-home ( 122017 ) on Tuesday July 10, 2007 @11:28AM (#19813763)

    If my hypothesis is that the next time I close my eyes I'll smell tulips, there's no math involved in evaluating this.

    "When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science."
    --Lord Kelvin

  • by hummassa ( 157160 ) on Tuesday July 10, 2007 @12:38PM (#19814751) Homepage Journal
    in mathematical form:

    science = math + measurements

    That's it. Science is:
    1. measure phenomena,
    2. figure out the formulas,
    3. predict new phenomena,
    4. measure new phenomena,
    5. if Ok, back to stage 3; if not, back to stage 2.

    (ok, ok, 6. (...), 7. Profit!!!, just to appease the masses)

    notice stages 1 and 4 are measurements, stages 2 and 3 are maths.
  • Re:Science != Math (Score:3, Insightful)

    by HiThere ( 15173 ) <charleshixsn@@@earthlink...net> on Tuesday July 10, 2007 @02:40PM (#19816413)
    You need to read Michael Faraday, or some of his predecessors.

    Math is a relatively late addition to science. Yes, it's proved very useful. But science happened long before they introduced math.

    Well, thinking again, this depends on what you mean by math. Leonardo used math to figure out perspective. Does this means that art depends on math? If so, then science depends on math, and so does walking across the room. And I can see a valid argument to be made along those lines, but that's not what people normally mean. If we look at what people normally mean, then science didn't depend on math until around the time of Kepler. Perhaps you want to call everything earlier engineering rather than science, but engineering depends on math just as heavily as science.

    What actually happened was that after algebra was invented, and arabic numerals, it became a lot easier to describe things in math, so people gradually switched away from describing things in ordinary language and to describing them in math. This has had both advantages and disadvantages. Certainly precision has improved. But comprehension by "ordinary folk" has declined, and not entirely because of the arcane subject matter, but also because they needed to learn a new language in order to understand what was being talked about.

    OTOH, can you imagine talking about computer programming without using "jargon"?

The Tao is like a glob pattern: used but never used up. It is like the extern void: filled with infinite possibilities.

Working...