Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Using The Web For Linguistic Research

Posted by timothy on Sun Jan 23, 2005 03:50 AM
from the that's-rediculous dept.
prostoalex writes "The Economist says linguists are gradually adopting the World Wide Web as a useful corpus for linguistic research. Google is used, among other resources, to research how the written language evolves and how some non-standard examples of usage become more or less acceptable (The Economist quotes the phrase 'He far from succeeded,' where 'far from' is used as an adverb). LanguageLog is a resource linked in the article, where linguists discuss current peculiarities of the English language."
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Using The Web For Linguistic Research 25 Comments More | Login /

 Full
 Abbreviated
 Hidden
More | Login
Keybindings Beta
Q W E
A S D
Loading ... Please wait.
  • by Peter Cooper (660482) on Sunday January 23 2005, @03:56AM (#11446691) Journal
    It's probably a good thing that they steer away from Slashdot as a corpus of English usage. Or, should I say, in SOVIET RUSSIA it's best Slashdot stays away from THEM! Or is it that only old people use the Internet as a corpus of the English language while pouring hot grits down a naked and petrified Natalie Portman's pants?
  • Indeed (Score:4, Funny)

    by Pan T. Hose (707794) on Sunday January 23 2005, @03:57AM (#11446693) Homepage Journal
    Indeed what their sayin is true. U can learn English very well, especially grammer readin /. frist psots. Teh intarweb seems to certainly kick arse for that sorta research. Very 1337 articel. Thx d00dz.
  • I rue the day... (Score:3, Funny)

    by sandstorming (850026) <johnsee@sandst[ ]ing.com ['orm' in gap]> on Sunday January 23 2005, @03:57AM (#11446694)
    When we might actually say words like 'lol' out aloud. Imagine a deal going down between two mining companies and the CEO of one company with a straight face, and deadly serious demeanour saying to the cameras: "Despite many thinking we pwned them in the deal, we believe it came out leet for every1"
    • Re:I rue the day... (Score:3, Interesting)

      When we might actually say words like 'lol' out aloud.

      I've heard it done. I've also heard 'roffle' (an attempt at pronouncing ROTFL I guess). Bizarre, really, since those terms are attempts to turn physical real-life actions into a verbal-only form.
    • Re:I rue the day... (Score:3, Informative)

      You may be unaware that "lol" actually is a correct word in the dutch language, meaning (having) fun.

      lol (de ~) 1 [inf.] plezier
      (taken from, www.vandale.nl, an authoritive dutch dictionary)
  • Epiphany (Score:2, Funny)

    It came to me that the English language was in deep trouble when people started saying "rotfl" and "lol" in person. There seems to be kind of a backlash brewing though, with improved email composition styles dictated by employers, and such.
  • Google does it again (Score:3, Interesting)

    by vladd_rom (809133) on Sunday January 23 2005, @04:03AM (#11446708) Homepage

    This is not the first time when Google (and search engines in general) changed how we do things.

    Nowadays copyrighters use Google to search for potential violations of their intelectual property. Plagiarism is easy to detect nowadays thanks to Google as well. Instead of using rather expensive [turnitin.com] systems in order to search for duplicate work, teachers are now one search away in distinguishing original work from the rest.

  • *BSD be dyin' (Score:2, Funny)

    by Anonymous Coward
    It be now official. Netcraft gots confirmed, dig dis: *BSD be dyin'

    One mo'e cripplin' bombshell hit da damn already beleaguered *BSD community when IDC confirmed dat *BSD market share gots dropped yet again, now waaay down t'less dan some fracshun uh 1 p

  • Be carefull thought... (Score:3, Interesting)

    by Anonymous Coward on Sunday January 23 2005, @04:23AM (#11446742)
    There are more non native speakers on the web then
    native speakers.
    In the European community the native English
    speaking persons are by far a minority. That way
    French expressions are poring into the language
    in an unstoppable way. Those expressions are then
    used by native speaking politicians and are
    broadcasted by television. That way they enter the
    mainstream of the English language.

    Regards
    • by new500 (128819) on Sunday January 23 2005, @06:40AM (#11447059)
      . . .

      Those expressions are then
      used by native speaking politicians and are
      broadcasted by television.


      Dude, it's worse, the French have already infiltrated as far as the advertising business and are using covert channels to spread some dangerous crack i heard was called La Liberte :

      http://french.about.com/b/a/081281.htm

      Slightly more seriously :

      Apart from pointing out that your use of the word native is rather presumptive of geographic origin in this big wide internet thing, i wonder if this linguistic adoption is more one way towards English since the internet. OK the French got Le Weekend, and tons of anglicised nouns, tried to ban them all and didn't manage. But i read Friday that a British pilot training firm lost a contract to a French one. The reason cited by the Asian airline was that, whilst the training had to be in English, the French trainers spoke better, clearer, more intelligble English than did the English. I can't argue with that. Sadly.
      [ Parent ]
        • You're overdramatizing. This is a process that will take hundreds if not thousands of years, even with technology helping to accelerate it. It's not like we'll wake up 10 years from now with a unified language and forget how to read today's literature!

  • by Anonymous Coward
    I've used the web for corpus linguistics research. My last big project was to look at a lot of web pages with Mexican and Chilean slang Spanish, and see if there was a difference in vocabulary usage. There was a significant difference; I could, 70% of th
  • 'Language' == spoken || written? (Score:3, Insightful)

    by adam31 (817930) <adam31.gmail@com> on Sunday January 23 2005, @04:49AM (#11446804)
    How do you even pronounce 'pwn3d' ? Google is not a tool to study speech patterns, and there's nothing to say that speech even resembles written text.

    The article addresses this in a weird way, where it first draws attention to the distinction, but once it reaches its crux, where google is used as a tool, the distinction is ignored entirely; instead it opts to focus on stranger things.

  • Popular usage != wanted usage (Score:3, Informative)

    by KiloByte (825081) on Sunday January 23 2005, @05:33AM (#11446892)
    Yes, we can record the errors made by the uneducated public (and even those done by, uhm, me). The question is: should we do that or not?

    I was pretty taken aback when a council of linguist in Poland suddenly declared some widely-chastised and not even very popular errors to be valid usage. I've been brought up in the circles of people who not only put a lot of stress to the language you use, but also cruelly point out every incorrect word or phrase you use -- and this made me quite intolerant to bad speech.

    Being but a dirty foreigner, I know that my English can sound bad in the ears of native English speakers -- that's why I sometimes ask people to correct me if they spot errors.

    In other words: some people find careless speech repulsive. Thus, we should do whatever we can to promote correct usage as opposed to legalising incorrect uses.
  • Three types of language (Score:4, Interesting)

    by Dracos (107777) on Sunday January 23 2005, @05:34AM (#11446894) Homepage

    I think that for most of the 20th century, English, and most languages in the industrialized world, was largely static, dominated by the written word which was dominated by proper grammar. Since WWII, popular culture and faster communications have increasingly exposed us to local vernaculars, mostly through radio and television. The written word lagged behind in its cultural evolution.

    Thanks to the internet (initially email, BBS's and IRC, but more widely known on the Web), we now have a hybrid of the spoken and written word: the "typed word". This form of language evolves at the same rate as the spoken word, and injects its own vernacular as a side effect of the medium: acromyn and abbreviation "words" (rofl, how r u), along with common misspellings (pwned), and mixing letters with numbers or punctuation (133t, n00b). All of these serve at least one purpose, whether as a form of super shorthand, insult, the appearance of being "cool", or are merely the result of laziness on the part of the author. Most typed-word terms don't transfer well when spoken.

    One of my hobbies is studying (European) languages and how they are related. Sometimes I worry about the damage the typed word is causing to the spoken and written word (and any proper linguist should at least be interested in the phenomenon). Luckily, most typed word expressions aren't pronounceable, and the ones that are sound absurd, because they are removed from their original context when spoken, and everyone recognizes gibberish when they hear it. How the typed word affects the written word remains to be seen. Yes both are typed now, but only the written word has a chance of going through an editorial process. I think it will take a very long time for the formal lexicon and rules of grammar to embrace, however reluctantly if ever, the typed vernacular.

  • Reminds me of "Meme Tree"... (Score:4, Informative)

    by Slur (61510) on Sunday January 23 2005, @05:49AM (#11446935) Homepage Journal
    ...which was this little program I wrote around the nascence of the internet. it took any sentence as input and kept a record of which words preceded each word, and which words followed each unique word. The idea was to build up a simple map of which words could precede or follow others completely without context. From this you could follow paths that made sentences or paths that looped forever, or paths that made no sense, and some interesting paths that made unintended sense.

    Why a tree? Language and geneology seem to have a common thread. Meaning is like genetics. Language is expressive. Information is a kind of tree whose branches grow as reality elaborates and past events accumulate. New terms need to be invented for the dynamics we perceive in reality, just as new names are given to individuals as they emerge into the world. Patterns, continuity, periodicity. Such things lie at the heart of material existence and provide the hooks for consciousness itself. Information theory is the next great frontier, along with particle physics. Already they have converged and diverged and converged again. And playing with artificial trees turns out to be a lot of fun.

    As for the "Meme Tree" program ... The next iteration built up a more discreet map by scoring proximity of unique words in sentences and inclusion in sentences together. Again, the idea was to build a simple statistical map free of any context, simply to get a sense of pure lexical association.

    The theory is that the internal consistency of these various lexical maps should roughly reflect many aspects of associative meaning. You could think of the statistical map as a Godelian bubble whose "truth" - if you will - is imposed by the laws governing the statistical associations. We don't derive the laws of language and meaning from these exercises, but we create an internally-complete map that reflects something about the nature of meaning.

    There is a practical aim as well. If you can derive the strength of equivalence and the various levels and colors of associative meaning you could in theory build a "Truth Machine" capable of answering any question with a high degree of accuracy. The result of any question could be computed as any other information retrieval problem would be.

    I never got around to having my little Meme Tree programs scrape the internet for random sentences. However, this should be a very simple thing to do. Google has had programming contests in the past - programs that use the Google database in interesting ways. Statistical analysis of language is basically what they do. Research projects on their data could provide stunning insights into the nature of information itself, its relation to language and to reality, and likely into our very nature as linguistic beings.

  • Writing in Japanese (Score:4, Insightful)

    by minairia (608427) on Sunday January 23 2005, @08:04AM (#11447234)
    I am American but have to write in Japanese for work. No matter how much one learns in school, when one writes in a foreign language, you'll hit a point of wondering if what you wrote is how native speakers say something or is even understandable. Whenever I hit a point like that, I put the sentence in question (or key fragments thereof) into a Google search. If nothing comes up, I know I have to rewrite. If only a few links come up, I know what I wrote might be a little wierd, but is at least understandable. If I get pages and pages of links, I'm golden.
    • Re:inner city teens (Score:3, Insightful)

      i countinously question my co-workers (social workers) in telling the youth what is propper and not.

      I'm glad they're telling the youth what is proper; you're clearly incompetent to do so.

      using words... is becoming more than just the normal, it is beco
        • Re:inner city teens (Score:3, Interesting)

          >His meaning is perfectly intelligible, but some language snobs (very few of whom are actually linguists and know anything much about language) pretend not to be able to understand certain accent/dialects in order to feel superior.

          Incomprehension often
      • Programmer grammar (Score:3, Insightful)

        Adding or changing characters in a literal string seems like misquoting. Traditionally in handwritten work the comma went almost directly under the quotation mark. When people shifted to typewriters and then computers, an arbitrary choice was made to put