Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Science Technology

Statistics For Data Entry: The Brave New Step 121

A reader writes:"First there was Dasher, a novel application of statistical theory that lets free texts be written using only a pointing device. Dasher works by predicting the continuations of the text being written, based on what has been written so far; there is a probability associated with each offered continuation and the presentation is designed to make it easier to choose more probable continuations. A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones. Now the same approach has been extended to writing maths. Apropos is a Javascript application (it supports IE6 & Firefox) to create mathematical expressions. It represents the math using MathML, the official XML spec for mathematics. It is definitely clunky when compared to Dasher, but better than MS Equation Editor etc. It is interesting to consider if this approach can be extended to other XML vocabularies (for example, a model for HTML that suggests the markup as you go along - a properly trained one will make it harder to create pages with blinking text, loads of images etc.), or to formal languages other than XML (e.g. programming languages). Stochastic modeling can also be used as a basis for speech recognition, with the recognizer using the model to choose a continuation when the speech signal is ambiguous or indistinct."
This discussion has been archived. No new comments can be posted.

Statistics For Data Entry: The Brave New Step

Comments Filter:
  • Like t9 (Score:4, Interesting)

    by xabi ( 620010 ) on Monday October 25, 2004 @07:57AM (#10619777) Homepage
    It seems to be the same concept as t9 [t9.com].
    • Re:Like t9 (Score:3, Insightful)

      by xabi ( 620010 )
      More info in the same dasher web site here [cam.ac.uk]
    • Re:Like t9 (Score:2, Offtopic)

      by KjetilK ( 186133 )
      Not really. I use T9 daily to write SMSes, and Dasher now and then for the coolness of it. Dasher is in Debian and I would guess in many other distros. Just try it out to feel the difference.

      Dasher is something I would really like to have on a PDA and even a cellphone. T9 is just a simple aid to write a couple of hundred charachters at most, but nothing that would help me writing longer texts.

      PDA-makers, hear this: You need to put a lot more effort into text-entry interfaces. Have a serious look at Dash

      • Yes, I also use T9 everyday. What I'm trying to say is about the way they work. Both methods are based on statistics, I mean, you begin type something and they try to finish the word ussing statistics.
    • Re:Like t9 (Score:3, Interesting)

      by a_hofmann ( 253827 )
      While the concept is the same, the application goes way further than t9. This is where I see such ideas bound to failure.

      t9 is a great technology because the vocabulary used writing SMS is pretty narrow. After entering the first few characters of a word, the contextual information in the dictionary is good enough (most of the times) to suggest the wanted word very fast. t9 is even able to dereive this information without the user specifying the exact characters but rather just one of the 3-4 on any mobile
      • Japanese phones use prediction at the grammar level (at least). You begin typing a word and it then suggests a likely candidate. After this, you have a small menu of likely "next words" or grammar particles, depending on phrases, what you have written in mails/messages before and what kind of word you just wrote. I've written entire mails just by typing one character, and selecting the rest from menu after menu.
        Of course, that's still not a very long mail, but I don't see why it should be difficult to exp
  • Old technology (Score:4, Interesting)

    by Inigo Soto ( 776501 ) on Monday October 25, 2004 @07:58AM (#10619784) Homepage
    That is hardly news. Mobile phone interfaces have been offering this kind of interfaces for years. True, they are useful, but nothing new here
    • What mobile phone offered this kind of input?? If you are just talking about predictive completion, you only have half the story. RTFA - there's a lot more to it.
    • Heck, I remember a word processor with predictive completion in shareware catalogs ca. 1985 or so. Don't remember its name, but, *sigh* another user-interface gem delayed for years by monopolist hegemonies. ~~~~
    • "That is hardly news. Mobile phone interfaces have been offering this kind of interfaces for years. True, they are useful, but nothing new here"

      Wrong. The Mobile phone interface is nothing like Dasher. It's not as fluid and as usable as Dasher. Dasher is really something that you should download and actually try before you comment on it.

      And if you're not up to downloading it, at the very least you should look at its demos (available in either animated gifs or mpeg/avi/asf movies) [cam.ac.uk].

  • by Anonymous Coward on Monday October 25, 2004 @07:58AM (#10619788)
    "You appear to be writing a letter, and here's what you're probably going to say..."
  • ...my mind about apropos is the *nix program
    "NAME
    apropos - search the manual page names and descriptions"
  • ...That this guy will GPL this software rather than start up a private company.
    Then maybe I'd get in in the next version of fedora.

    I'm so sick of *Tex.

    *sigh*
    • by Anonymous Coward
      A quick check revealed that both Dasher & Apropos are open source. Apropos does not carry any license, but the website says that code is free for anyone to use and modify...
    • Dasher is an input method, not a typesetting engine.
    • I'm so sick of *Tex

      First, it's TeX, not Tex. Secondly, TeX goes through email, and most people who care to read it unrendered very easily, so they don't need to install any dopy software just to read teo little formulas in my e-mail. Plus, TeX math notation is fast to type, and you only need to learn a page or so from the TeX manual in order to be able to use it for math. So, how is this Dasher thing better?
      • First, it's TeX, not Tex. Secondly, TeX goes through email, and most people who care to read it unrendered very easily, so they don't need to install any dopy software just to read teo little formulas in my e-mail. Plus, TeX math notation is fast to type, and you only need to learn a page or so from the TeX manual in order to be able to use it for math. So, how is this Dasher thing better?

        You're comparing apples and aardvarks here. Dasher is an input method that tries to predict what letter you'll input
    • If you're sick of TeX, give lout [sourceforge.net] a look.
  • I'm not hopeful (Score:2, Insightful)

    by Anonymous Coward

    Dasher works because there is a small number of words that are likely to follow on from where you are. The same does not apply to MathML or HTML. The most useful you are likely to get is tab-completion for tag names, attribute names, etc.

    • What we write is only predictable to the extent that it is redundant: ie when i type "tomor" into my mobile phone, if it's obvious to the phone i'm going to write "tomorrow", i could just send a msg saying "C U tomor".

      It doesn't seem to me that there's anything like as much redundancy in mathematical formulae as there is in written language. When the professor writes "X=..." on the board, it's very hard to predict the next symbol unless you know what x is in fact equal to.
      • Re:I'm not hopeful (Score:3, Informative)

        by KevinKnSC ( 744603 ) *
        You're incorrect when we say that what we write is only as predictable as it is redundant.

        There are over 90,000 words in the English language (based on number of entries in the American Heritage Dictionary), but nobody uses all of them. Good predictive data entry is not just a matter of waiting until you've typed "tomor" and concluding that you're going to write "tomorrow" because no other words begin that way, it's a matter of noticing when you get to "tom" that, based on your past word usage, the most

  • I'm hopeful that this will eventually make it into word processors like in the OpenOffice or Microsoft Office suites. Seems like the best standard faire we have is a little paperclip/dog/wizard/other nuisance asking how he can "help" make a cover letter.
    • I'm hopeful that this will eventually make it into word processors like in the OpenOffice or Microsoft Office suites. Seems like the best standard faire we have is a little paperclip/dog/wizard/other nuisance asking how he can "help" make a cover letter.

      That is a horrible idea. Burn it! Hope it never happens. Sounds like something a college droppout geek who often gets pie'd in the face would do.
  • by $RANDOMLUSER ( 804576 ) on Monday October 25, 2004 @08:05AM (#10619834)
    "Twas brillig, and the slithy toves
    Did gyre and gimble in the wabe:
    All mimsy were the borogoves,
    And the mome raths outgrabe."

    It knew he was going to say that.

    More likely, it's going to predict that someone's going to say "Let's circle back and touch base tomorrow".

  • by pkhuong ( 686673 ) on Monday October 25, 2004 @08:06AM (#10619838) Homepage
    Having used both dasher and T9, it seems to me that t9 only takes into account the keystrokes entered for each word. It then correlates them to a dictionary. Dasher, on the other hand, is based on markov chains (yes, like those word/text generators), and thus takes into account the last [n] characters. That makes it much more accurate, and, interestingly enough, should make it particularly well-suited to editing programs in most mainstream languages, since they have a lot of noise words and frequently used sequences.
  • I read, "a novel application of statistical theory" as "a novel of statistical theory" - and I was still interested!
  • Quick test (Score:4, Insightful)

    by potifar ( 87326 ) on Monday October 25, 2004 @08:13AM (#10619877)
    MathML was never really intended for writing by hand, and even if Apropos makes it easier, I can't see myself switching from (La-)TeX anytime soon. I can enter extremely complex mathematical expressions at least 20-30 times faster by typing them in TeX than I ever could do clicking around an interface like Apropos.

    MathML is a good idea in theory, but until there are good tools for writing and editing MathML, there will be very few people using it (either for publishing or for archival purposes.)

    • by Anonymous Coward
      MathML is a good idea in theory, but until there are good tools for writing and editing MathML, there will be very few people using it (either for publishing or for archival purposes.)


      Wow, this has got to be the first time in the history of the world that a math person has criticized something for being "good in theory" but not "in practice". It's math! There's nothing in it but theory!
      • Proof by contradiction, obviously! The premise that he is a "math person" must necessarily be wrong. Mathematicians are *never* inconsistent... just incomplete occasionally.
  • by hussar ( 87373 ) on Monday October 25, 2004 @08:17AM (#10619902) Homepage
    As other posters have noted, this sounds a lot like T9, which is used in cell phones for predictive text entry. T9 is a great utility, but it has happened that what I am writing is less predictable or the there is a more often used combination of letters that results from the keys I have hit. If I don't pay attention, I get the wrong word.

    I can't help but think of someone entering a mathematical equation and concentrating more on his idea than what is being written to the screen. Due to this inattention, the equation doesn't work, he figures he's just wrong, and spends hours/days to find the point at which the computer put in its prediction and not what he thought he entered. Worst case, he could abandon what would have been a great idea.

    Or, imagine this applied to writing computer programs. Say for example, you are writing a program to calculate the correct distance the probe should hold above the atmosphere so it doesn't burn up. Your cube mate distracts you briefly, and...
  • orrect strings are more probable than incorrect ones

    apparently they havnt taken my writing into account
  • I hope so, then I might actually use it! :D
  • data integrity starts w/ data entry. when data entry is reduced to "no" vs "yes-for-now-we-can-fix-it-later", the game is lost; GIGO prevails, then.
  • A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.

    They obviously didn't include many PHBs' writings in their calculations...
    I'm frequently amazed at some of the grammatical... umm... experimentations undertaken by the upper two or three levels of management in their memos -- and the speeling, good grief, the SPEELING!! Is [F7] the last great secret of our civilization?!?!

    • They obviously didn't include many PHBs' writings in their calculations...
      ... or take a look at slashdot. It's inhabitants are some of the biggest loosers on teh intarweb when it comes to spelling and grammer.
  • A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.

    Though probably college educated the writer of the above sentence has probably NOT BEEN a TA in an English class. Truly correct strings are a rare find :-)
  • by palad1 ( 571416 ) on Monday October 25, 2004 @08:26AM (#10619968)

    I did a quick test run of Dasher instead of RTFA, and as far as I understand, it works by presenting the most statistically-probable letter in the middle of the input area.

    So, by dragging a perfectly horizontal line with my mouse cursor, I was able to create the most statistically-probable sentence.

    Here goes, for Science:

    Kennedy insider&xeathGhed a noviceable. Punt.uetGrance beganic or Central believe t, space ship,' Alice, it is deleasantB.Carzone.That's luJbi

    Conspiracy theorists, area51 nuts and cypherpunks are going to be thrilled!

  • before long you'll have to write only half of your program. the other half is predicted by some neat tool.

    Or imagine the possibilities for bookwriters. You write half an the rest is predicted based on your previous works. Seems as if some authors already use such a technique ;)
  • Ahem (Score:4, Insightful)

    by kahei ( 466208 ) on Monday October 25, 2004 @08:33AM (#10620024) Homepage

    A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.

    In a rigorous, technical environment, being _usually_ correct is not enough and a statistics-based approach to ensuring correctness is not very useful.

    In an informal environment, correctness is not nearly as common as you might hope, so again a statistics-based approach may well not be as good as actually enforcing definite correctness.

  • Has anybody tried to compile (and succeeded) Dasher for my beloved Zaurus?

    Bye egghat.
    • I have just been playing eith it on my laptop and given the CPU usage here it looks like it would have to be configured for considerably less calculations to run on a Zaurus (don't know if this even possible) and this looks like it would make little more useful than the built in word completion.

      Note: I am only guessing for everyhting beyond the CPU usage on this machine, but then isn't that the way of /.

  • ...we find the unpredictable more interesting.

    And, there are no predictable new ideas. Who could've guessed that Einstein would follow the equals sign with "mc^2".
  • Why? (Score:2, Insightful)

    by Quixote ( 154172 )
    a properly trained one will make it harder to create pages with blinking text, loads of images etc.

    Why should it? What if I want to create such a page? Why should someone (or something) tell me what to say, or how to say it? And who will "train" such a thing? The Government??

    • Re:Why? (Score:2, Informative)

      by r3m0t ( 626466 )
      'Why should it? What if I want to create such a page? Why should someone (or something) tell me what to say, or how to say it? And who will "train" such a thing? The Government??'

      To make the other (more likely) options more easily available, spend a lot of time poking around for tags with smaller targets *or* type it by hand *or* change the settings to lower the effect of prediction *or* replace the training files *or* just use the damn thing since it'lol learn, nobody's telling you to do anything, and The
      • Fuck. I wrote my reply, and the grandparent still got another Insightful mod. You're all crazy! Crazy, I say!
        • Fuck. I wrote my reply, and the grandparent still got another Insightful mod.

          Obviously, you're a karma whore who's trying to work both sides of the issue.

          • What? Karma whoring, me? I thought the post you refer to (grandparent now, heh) wouldn't get modded at all. How am I trying to work both sides? Not like I made the first post in this thread.
  • Have been using this approach for decades.
  • Overuse of this technology will result in repetitive and boring prose. Yes, well-written prose does have some redundancy/predictability -- it helps the reader stay on track, reinforces key points, reminds the reader, etc. This technology will help some writers create more consistent text. Yet I fear that too many will rely too much on this crutch.

    The problem is that the best prose contains unexpected novelty such as a plot twists, new facets of a character, joke punch lines, etc. In a true "page-turn
  • The reason predictive interfaces work is that most encodings have some degree of redundancy in them. English text is about 50% redundant information, in an information-theoretic sense, and anything based on XML is going to be more so.

    To see this for yourself, pick a nice big hunk of English text and gzip it. You'll get about 50-60% compression. Now, pick a similar-sized hunk of XML and gzip it - you'll probably get 75% compression or more.

    Tools like this make using bloated, redundant encodings more tolerable by automating some of the redundancy away. It's not clear to me that this is a good thing.
    • Using this technology on source code (for instance) would be an extremely bad thing since it would encourage cut-and-paste or copy-and-mutate approaches to coding. The result would be highly regular and poorly factored source. But, I don't think anyone was actually suggesting this for program code... just a thought.
    • I was about to point out in response to "A big advantage of statistics-based interfaces is that they automatically enforce correctness..." that rather than enforce correctness they will more likely introduce common errors.

      When designing a language - be that a simple one which can be encapsulated in an XML schema for example, or even a complex natural language there is a trade off between being efficiently terse and introducing sufficient redundancy as to allow communicants to differentiate signal from noi
    • To a Lisp hacker, XML is S-expressions in drag.
      And that Lisp hacker would be wrong [prescod.net].

      The linked article neglects to mention Unicode compatibility in its list, but a good read nonetheless.
      • Sorry, Slashdot signatures are limited to 120 characters, and are meant to be short and provocative.

        I've seen that article before. It does a fairly good job of missing the point, or seeing the point and getting it backwards. XML does get one thing right - the idea that chunks of information ought to be self-describing, down to the character set level. Even Common Lisp punts on that one - the spec basically says "we require this subset of ASCII, and here's an API to manipulate whatever your implementatio
        • XML has unneeded complexity that does not give it more representational power - consider the brain-damaged distinction between attributes and sub-elements, or the way namespaces and DTD's sort-of kind-of interoperate.

          I usually treat things as:
          metadata = attribute
          related content = sub element

          You are right in that there are no hard and fast rules for what should be an attribute and what should be an element, but then I really haven't found it to be a real problem once I adopted the above.

          DTDs do suc

          • Wow, my first reply that's longer than the standard Slashdot limit. I'm honored ;-).

            You are right in that there are no hard and fast rules for what should be an attribute and what should be an element, but then I really haven't found it to be a real problem once I adopted the above.

            My heuristic for that is attributes are for metadata that has little or no structure, and is very unlikely to change. In practice, this reduces to "never use attributes" for me.

            My personal favorite is RelaxNG which most p

  • Sure it looks interesting, but I really do not see the point of swiching from LaTeX: the de facto standard for any math write-up. MathML: written for computers. TeX: written for humans to write.

    That said, I have been feeling that TeX is a bit outdated as a system, but then I discovered TeXmacs [texmacs.org]. This is a fully wysiwyg editor for TeX, where you type in TeX code and see the formatting instead of the code. I have switched to using it, and would definitely recommend it to others

  • Looks like you can train this thing by giving it large amounts of text in the language of your choice.

    I'm going to pop over to OpenOffice.org, and use their source to create a training document.

    Stay tuned for details.

    ~D
    • Re:Training (Score:3, Informative)

      by Dracolytch ( 714699 )
      Ok, OpenOffice.org proved to be too large for me to really use, so I hopped over to the GIMP instead. I grabbed a copy of their source, and created a text file that appended all of the c files I could find in one directory... About 750k.

      I took the "English with lots of punctuation", and copied the .xml file. It turns out that using their little interface for creating a language is a PITA, and just copying an existing file works pretty well. I tweaked it to change the name of the language, and point to t
  • What's the probability that all of the texts written this way will be similar?
  • MacKay's Dasher is very useful since it's a simple tactile input device. Unlike T9, which speeds up entry using a conservative keypad, text entry with Dasher is based on up/down movements, which some handicapped people are capable of that could not operate an ordinary keypad.

    The statistical properties of languages are utilized in most (successful) approaches for natural language processing [stanford.edu], from part-of-speech tagging, information extraction, syntactic parsing, machine translation to question answering; y

  • This is a test of dasher.
    I find it a bitch to get proper punctuation, nevermind capitalization, and the routine stuttery freezes are amazingly annoying. I suppose if I were incapacitated to the point that I could only type by looking around I would appreciate it alot more though.
    So I'll just call it a really cool toy that is in fact worth trying out and hope some games incorporate some of this technology at some point in the future.
    • This is a wicked cool feature. It is a reminder of how cool Linux can be. I can "type directly into the browser window using dasher!
    • Try it for ten minutes, properly. I don't get the freezes, but I have a fast computer. "Never mind" is two words and would have been easier to type. "alot" similarly. "Dasher" with a capital D is easier, too.

      What are you talking about with the capitalisation? After a full stop (question mark, etc) and a space, the yellow (capitals) box is massive.

      Why games?
      • Define improperly for me.

        I've been pointing at letters for well over ten minutes now. I've figured out the capitals box now, nearly got the punctuation sorted out.

        Why games? Because I find the lack of straightforwardness and it's adapting to be the kind of feature I'd like to see in a game.

        My box is a 1GHz Athlon, which I never figured as really slow. I'd be noticing those stuttery freezes even if they were three times shorter though, easily. Perhaps my Gentoo compile is to blame.

        Regarding my spelling,
        • As in, not seriously or without reading the beginner's thing.

          Are you turning the speed slider up? I'm near maximum... 7 I think.

          Oh yes, I've tried it on a .5GHz P3 Gentoo and this 2.5GHz AMD64 Windows (shared, dammit, can't switch) and a Tablet PC (you hover your pen above the screen... very cool). The featureset is about the same (well, you can write into other windows, etc) but I can't really definately say that the Linux code is slower than the Windows one, since the Gentoo was so low-spec. (All others
          • I've got my hands full at speed 2. I was seriously trying, of course reading manuals and stuff might be a culprit (but not over freezing I hope).

            Game wise I mean as somehow worked into the game mechanics. Just tacking it on to some existing game as is for what it does would be kinda silly.
  • Aim such a product at programmers, and you'll learn a few things about programmers.

    Correct spelling is no longer more probable than incorrect spelling. :-) Programmers as a class are notoriously poor spellers.

    Some misspellings are intentional. I knew a guy who frequently wanted to use MODE as a variable name in his COBOL programs. But MODE is a COBOL keyword and the compiler would hiss at him. So he now always spells it MOAD.

    Likewise some misspellings are due to local culture. Paw through some DEC c
    • "I knew a guy who frequently wanted to use MODE as a variable name in his COBOL programs. But MODE is a COBOL keyword and the compiler would hiss at him. So he now always spells it MOAD."

      Similarly for "list" in Lisp or Scheme. I use "lyst" since I learnt the basics off Douglas Hofstadter from "Metamagical Themas".
  • It is definitely clunky when compared to Dasher, but better than MS Equation Editor etc.

    I will be first to cheer anybody who invents a worse way of typing math than MS Equation Editor. Being better than that is not an achievement at all. Can't they simply learn TeX for their math?
  • Statistics has been used for decades in handwriting input, OCR, speech recognition, systems like T9, and other input modalities. Dasher seems pretty cumbersome in comparison to most of those.

    And the fact that it only generates "correct" input can be a real problem: names, foreign words, etc. just don't come out right.
    • No, no, no! It does not only "generate" words! It simply enlarges the areas of where you're likely to go! You can zoom around and input anything you want, really. Then it'll add what you wrote to the training text so that next time that name is easier to input.

      Much better than T9 in that respect. Hendwriting recognition isn't so much better there either (with symbols and numbers, too).
  • "3GL", third generation programming languages, were supposed to do for programming what these stat predictors do for data entry. They were menu interfaces, using syntax and grammar to offer only the valid options for the "next word" in a program. Usually with dropdown/popup menus for mousing in windows, the new computing paradigm back in the 1980s. But human expression turned out to be much less modal, and the UI always got in the way. Wake me when these interfaces have been playtested, and survive the aren
  • For math notation, no matter how good this might be, TeX is better. First, it goes through e-mail, and it's easy to read unrendered. I.e., people I send TeX notation to are guaranteed to be able to read it, without having to install software that doesn't necessarily exist for their favorite OS. Unless they are too lazy to learn the two pages of TeX documentation that list the math notation:-). Secondly, it's fast to type, and you don't need to take your hands off the keyboard. I doubt that there is any
    • Hmm. Pencil is more expressive, sure. But how are you going to e-mail your notes to people? I mean, you could e-mail graphics, but nothing beats plain text. And how are you going to intersperse your math with text? You don't propose actually writing text with the pencil, do you?
  • A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.

    Feed the entire contents of /usr/dict/words into a markov generator and you get pretty much the same thing. Random words which, whilst not having any meaning, are reasonably syntactically correct.

    http://www.fourteenminutes.com/fun/words/ [fourteenminutes.com].

  • Dasher indeed looks interesting. The heuristics remind me of the input methods for Japanese keyboards where hiragana or katakana are entered, and depending upon the context, a short list of matching kanji is presented to choose from. Elegant solutions to a complex problem.

    However, while Dasher can be compared to the JavaScript application that works with MathML, Dasher and MathML cannot be directly compared. Determining correctness would be from a program reading the DTD or schema of MathML. MathML wou
  • Claude Shannon, the father of information theory, used the idea referenced here in his famous 1950 experiment to calculate the entropy of the English language. See "Shannon Game" at, for example, http://www.math.ucsd.edu/~crypto/java/ENTROPY/ There's also an entire field, often referred to as "Natural Language Processing," which uses empirical observations of large amounts of language data (text or speech) to construct statistical models which do speech recognition, language translation, text summarizatio
    • Dasher is very different from T9. T9 is basically a lexicon lookup system, and has to be abandoned for words not in the dictionary. Dasher lets the user write any string in the language, though some are considered more likely than others. There is a good analysis of T9 here [cam.ac.uk].
    • Dasher does not constrain the writer.Dasher calculates the likelihood of symbols in the string based on the patterns it sees in the symbols of a training corpus. This allows the program to be tailored for various appl

egrep -n '^[a-z].*\(' $ | sort -t':' +2.0

Working...