Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?

Physicists Discover Evolutionary Laws of Language 287

Hugh Pickens writes "Christopher Shea writes in the WSJ that physicists studying Google's massive collection of scanned books claim to have identified universal laws governing the birth, life course and death of words, marking an advance in a new field dubbed 'Culturomics': the application of data-crunching to subjects typically considered part of the humanities. Published in Science, their paper gives the best-yet estimate of the true number of words in English — a million, far more than any dictionary has recorded (the 2002 Webster's Third New International Dictionary has 348,000), with more than half of the language considered 'dark matter' that has evaded standard dictionaries (PDF). The paper tracked word usage through time (each year, for instance, 1% of the world's English-speaking population switches from 'sneaked' to 'snuck') and found that English continues to grow at a rate of 8,500 new words a year. However the growth rate is slowing, partly because the language is already so rich, the 'marginal utility' of new words is declining. Another discovery is that the death rates for words is rising, largely as a matter of homogenization as regional words disappear and spell-checking programs and vigilant copy editors choke off the chaotic variety of words much more quickly, in effect speeding up the natural selection of words. The authors also identified a universal 'tipping point' in the life cycle of new words: Roughly 30 to 50 years after their birth, words either enter the long-term lexicon or tumble off a cliff into disuse and go '23 skidoo' as children either accept or reject their parents' coinages."
This discussion has been archived. No new comments can be posted.

Physicists Discover Evolutionary Laws of Language

Comments Filter:
  • Scrabble (Score:5, Informative)

    by SJHillman ( 1966756 ) on Monday March 19, 2012 @08:14AM (#39401673)

    Anyone that has played Scrabble (especially against a computer) know that there's tons of words out there that no one has ever heard of, most of which you can't even find a definition for. What the hell is a Qi? I don't know, but I can get 66 points for it.

    • Re:Scrabble (Score:5, Insightful)

      by vlm ( 69642 ) on Monday March 19, 2012 @08:20AM (#39401699)

      The problem with Qi is its about as "english language" as Shinjitai

    • Re:Scrabble (Score:5, Funny)

      by Anonymous Coward on Monday March 19, 2012 @08:23AM (#39401723)

      It's a show on BBC2.

    • Some Advice (Score:5, Interesting)

      by eldavojohn ( 898314 ) * <> on Monday March 19, 2012 @08:28AM (#39401749) Journal

      Anyone that has played Scrabble (especially against a computer) know that there's tons of words out there that no one has ever heard of, most of which you can't even find a definition for. What the hell is a Qi? I don't know, but I can get 66 points for it.

      Qi is a simple one, it's a two letter word and there are roughly a hundred two letter words accepted by TWL [] which are hackable []. Qi is also something I've seen reading Chinese philosophy so that doesn't really upset me. The ones that really get me when I play against computers or people who cheat are actually the longer ones. Recently I have seen outgnawn, aliquot, mahoes, votive, the list goes on when your friends are using websites to look up permutations [].

      You can study this stuff and memorize things like I-dumps: ziti, ilia, ixia, inion, etc. But in the end what really got my scores higher was studying the short 2 and 3 letter words and building thick crossword-like packs of words especially over TL tiles.

    • What the hell is a Qi?

      It's one of two common transliterations of a Chinese word that roughly translates as life energy. The other is Chi. Neither is a valid word under the rules of Scrabble, which restricts you to English words. Note that this doesn't prevent it from appearing in the official Scrabble dictionary, along with a large number of other words that the rules would disallow. Transliterations of Greek letters (such as pi, mu, tau) are also allowed by the Scrabble word list, but not by any reasonable reading of the ru

      • by bfields ( 66644 )

        I've heard English speakers use "qi" in English sentences, but can't ever recall hearing anyone use "quoi" on its own in an English sentence, so until we get an ascii-32 tile I think Scrabble is safe from "je ne sais quoi".

        Words are imported from other languages all the time, and it's a judgement call when to start calling them English words. For a game like Scrabble where you need a black-and-white decision in each case, that means the only way to have a complete set of rules is to agree on a dictionary.

      • Re:Scrabble (Score:4, Insightful)

        by Ihmhi ( 1206036 ) <> on Monday March 19, 2012 @11:35AM (#39403523)

        Neither is a valid word under the rules of Scrabble, which restricts you to English words.


        You're a bit wrong there. Qi and Chi would both be "loanwords", i.e. words taken wholesale from another language, usually with no change in spelling or pronunciation. Here, try some others using the official Scrabble dictionary []. I'll just throw together a short list, and you see how many of these aren't in there because they're technically not English words at all:

        hibachi (Japanese), karaoke (Japanese), cafeteria (Spanish), alpaca (Spanish), gulag (Russian), taiga (Russian), wiener (German), kraut (German), moped (Swedish), brogue (Irish).

        There's ten different words from six different languages. Only one of that list is not in there - and it will be as surprising to you which one is not in the dictionary as it was to me.

        I get what you're saying, the "je ne sais quoi" example is a good one. But there are certain words from other languages we use that have pretty much been adopted into the language, especially for concepts we really don't have or can explain as concisely. Granted, some you may have never heard - usually only marital artists could describe what a kiai or kata is, for example - but we have loads of loanwords that are in everyday use in our language. (It personally makes me cringe when people say "hibachi" (hee-bah-chee) and "karaoke" (kah-rah-o-kay) and mangle the Japanese pronunciations, but that's accents for you. The Japanes hilariously mispronounce English words sometimes too, and they certainly misuse [] our words a lot of the time as well - surely some sort of revenge for all of those trendy kanji tattoos that so many of us Westerners like getting on our bodies.)

        Incidentally, "qi" is in the Scrabble dictionary - at least according to the one on the Hasbro website (which I have linked above).

    • What the hell is a Qi?

      It is the alternate spelling of "Chi", a concept in Daoist philosophy that represents the primal energy of the universe.

      As in "tai chi". As in "qi gong". It is also sometimes spelled "ki".

      The ancient Chinese must have played a lot of Scrabble

      The Scrabble word that bothers me is "aa". I mean seriously. Who even wants to play with you any more? It's not fun when you start bringing out the scrabble dictionary. I thought we said no 2-letter words, anyway. And no, I'm not being a bab

      • Same "qi" as "qi gong", but different "chi" to "tai chi". In Hanyu Pinyin, the system that uses the spelling "qi", "tai chi" is spelled "taiji". The two words are quite different in pronunciation.
      • by bfields ( 66644 )

        "The Scrabble word that bothers me is "aa". I mean seriously."

        Why is this bothering you? Are you getting a little tense? Maybe it's time for a vacation somewhere? Hawaii is nice....

        English-speaking people live in enough different places and have sufficiently diverse cultures and interests that there will *always* be words that seem obviously common words to group X and obviously made up to group Y. If you need a hard-and-fast decision for something like a board game, then you have to choose some authori

    • Qi []
      Just sayin'...

    • by nusuth ( 520833 )

      Life essence, alternative spelling of Chi? This is /. people, get you Qi straight:

      Qi is a great lisp. It has a Turing complete, extensible type system and pattern matching like modern functional languages. It has a kernel called KI which is ported to classical lisps, clojure and javascript. Anything that has KI ported can be used for compilation of Qi compiler. The compiler generates code into host language and resulting code usually fast.

      It is currently in great flux so I don't recommend actually using it.

  • How many words are "created" by young people to replace their parents' generation's word for the same thing? I suspect that many of the "new" words are already covered, but teenagers want to sound cooler than their parents, or hide their true intentions from them.
    • I remember an episode of 'Recess', a Saturday morn cartoon from the late 90's, where the main characters made up a word to replace swearing: whomps. It wasn't long before the school board dog-piled them, saying it wasn't allowed as they considered it a swear now since all the kids were using it to curse. It was a very interesting episode.

      • "Dog-pile" is pretty recent too. Viz magazine made up a new swear word "fitbin" to put on the front cover simply to point out to WHSmiths (who also used to put Private Eye on the top shelf because they thought it was an adult entertainment magazine simply because of the name) that their in-house censorship of titles was pathetic and Daily Mail-y in the extreme.
  • 'Culturomics'? (Score:5, Insightful)

    by camperdave ( 969942 ) on Monday March 19, 2012 @08:24AM (#39401725) Journal
    'Culturomics'? You'd think that people studying words would be able to come up with a better word than that.
  • by Zaldarr ( 2469168 ) on Monday March 19, 2012 @08:27AM (#39401743) Homepage
    Please. No more portmanteaus with -onomics on the end. I automatically think of Regan.
  • Dictionary size (Score:4, Informative)

    by Ed Avis ( 5917 ) <> on Monday March 19, 2012 @08:27AM (#39401747) Homepage
    The OED has about 600 thousand words, though still this is a lot less than a million. It would be interesting to see the most commonly used word that isn't in the dictionary.
    • Gullible (Score:5, Funny)

      by mdsolar ( 1045926 ) on Monday March 19, 2012 @08:44AM (#39401869) Homepage Journal
      It's not in the dictionary. Look it up.
      • Re:Gullible (Score:4, Interesting)

        by dmatos ( 232892 ) on Monday March 19, 2012 @11:55AM (#39403811)

        True story - I once convinced my coworker that gullible was not in the dictionary. She pulled out a very old dictionary, and proceeded to look it up, only to find that, no, it was not in there as a separate entry.

        After I bit of digging, I did eventually find it as a conjugation of the verb "gull," meaning "to deceive."

    • by zill ( 1690130 )
    • Websters throws out words when they are unused ...OED does not once it's in it's in forever ...

      But a word needs to be in common printed use before it will be accepted in the OED, and is proved not to be an ephemeral word.... this probably accounts for the other 400,000 words they spotted, they will be ephemeral neologisms, common mis-spellings, and words not normally written down ...

      I suspect the most common word not in the dictionary that is in their list is either "thier" or "teh" ...

  • by Anonymous Coward on Monday March 19, 2012 @08:35AM (#39401797)

    ...Grand Unification Theory of Cosmology Proven.

  • by Cazekiel ( 1417893 ) on Monday March 19, 2012 @08:37AM (#39401807)

    My husband works for Merriam-Webster as an assistant editor/lexicographer. You wouldn't believe some of the stuff that goes on there. People will call and demand fame for a word. For example, some guy called in and said he'd been the one to come up with the word 'ginormous', and wanted credit for it. They don't seem to understand the process. MW's archives in the basement is a CIA-esque compilation of language; they'll use every collegiate they have for reference, going all the way back to the first one. Husband says it won't be long before internet-meme creations are included.

    • by zill ( 1690130 )

      Husband says it won't be long before internet-meme creations are included.

      It doesn't take an insider source to figure that out. They included "d'oh" last year, and there's no reason to treat internet-memes differently than TV-memes.

      Depending on your definition of "internet-meme" some already made it on there, for example lol [].

  • I cannot find any mention of them studying anything other than English, and if they indeed only studied English then do the same finding apply to other languages? I actually highly doubt it, especially when it comes to smaller, less-used languages. Though obviously claiming to have found some universal laws regarding all languages makes for better headlines.

    • Had you clicked the the link to the PDF provided in the summary, you'd have stumbled onto their paper -- as in "the thing we're discussing here" -- where they mention Spanish and Hebrew were also studied.
  • by cyocum ( 793488 ) on Monday March 19, 2012 @08:52AM (#39401905) Homepage

    I see this all the time (I have a PhD in the humanities and I am a software engineer) where someone from outside the field does something and claims it is a universal law but really, they just worked on English and cannot (or will not) prove that it works for other languages. Usually, these papers also lack any kind of literature review and ignore many of the problems that this would uncover. I saw one paper by a physicist that tried to use bit fields to model language change; it was just massively reductionist and couldn't explain anything at all for all the mathematical rigour.

    I go to my University's language lunch which has lots of this and scare the pants off grad students by saying "this is all very well but does this work for Japanese or Old Irish or any other language?" This usually makes their faces go white because naturally English is the ONLY language that matters and is therefore "universal".

    • RTFA, they worked on English, Spanish, and Hebrew for precisely that reason.
      • by JasterBobaMereel ( 1102861 ) on Monday March 19, 2012 @10:13AM (#39402581)

        English - An Indo-european language closely related to the Romance, and Germanic languages
        Spanish - An indo-european language one of the Romance languages
        Modern Hebrew - Hard to classify but has many influences from European languages mainly Indo-European Romance and Germanic languages

        They didn't pick a very diverse range of languages, mostly one family, of heavily related and cross influenced languages ...

        Pick something else like Yorùbá, or Mandarin Chinese ....?

        • by cyocum ( 793488 )
          If you wanted a more diverse but still very well understood set of languages, I would have gone for English, Sanskrit, and Arabic. English and Sanskrit are distantly related but far enough away that you can make good inferences out of them and Arabic because there is plenty of it out there and it is Semetic (like Modern Hewbrew without all of the loan words form Indo-European languages).
        • by zooblethorpe ( 686757 ) on Monday March 19, 2012 @11:16AM (#39403319)

          I agree with your main point, and agree that the modern Hebrew vocabulary is subject to diverse influences, including European languages.

          That said, Hebrew [] (modern or otherwise) is not that hard to classify -- it is firmly in the Semitic language grouping [], itself part of the Afroasiatic language family []. Hebrew is a cousin to Arabic, and a cousin to ancient Egyptian, Touareg, Somali, and Amharic (Ethiopian).


        • by Nemyst ( 1383049 )

          Actually, I'd have been a lot more interested to know about French, since it's one of the few languages out there that's actively curated by a central organization attempting to limit and document the language's morphing.

          I'd be curious to know whether this is actually affecting the language's evolution in any meaningful way. Considering its close ties to English and Spanish, among others, it would be fairly easy to compare them and notice the influence, if influence there is.

  • by Anonymous Coward

    So physicists have reinvented battleship curves. Congratulations! We couldn't have done it a century ago without you!

  • Irregular verbs (Score:4, Informative)

    by Oswald McWeany ( 2428506 ) on Monday March 19, 2012 @09:03AM (#39402009)

    There has been mathematical studies on how long irregular verbs might survive in the English language [] for a long time. I remember seeing the first such article a while back.

    Basically the more used a verb- the longer it will take us to be liberated from its influence. Some like the verb "to be" are so enconsced in our language that they may take many many generations to eliminate.

    Of course- this ignores any political movement to eliminate them- as countries become closer- if English remains the language of democracy- there may be a push to make English more standard. A new English without all the rule contradictions it currently has would be double-plus good.

  • by CrackedButter ( 646746 ) on Monday March 19, 2012 @09:09AM (#39402041) Homepage Journal

    I'm sure Americans will have created 8000 of those new words each year. Not content with the ones we British gave them, they wanted their own.

  • Is that it's pinning my bullshitometer against the max stop.
  • Physicists claimed the evolution of language was based on some characterization of words of vocalization pattern and energy usage, the idea being that languages which afford more efficient energy requirements to the speaker tend to survive by natural selection process, just as animals in any environment evolve physical characteristics that are specifically adapted to efficient energy usage in that environment.
  • My wife is a linguist and much of the summary sounds like stuff she learned in her classes. The only major thing that sounds new is that he has put a large portion of Google's scans through a computational linguistics algorithm to put hard numbers to what they already believe. I know a lot of Computational Linguists come from other fields out side of traditional linguistics, but if this guy has become a computational linguists I would think it would be more appropriate to label him as one instead of what
  • Google Books is notoriously inaccurate, especially with dates. I don't know if it's enough to throw their data off, but I wonder if the researchers realize this.

  • []

    Don't spend the whole day on it.

  • Published in Science, their paper gives the best-yet estimate of the true number of words in English—a million, far more than any dictionary has recorded (the 2002 Webster's Third New International Dictionary has 348,000) with more than half of the language considered 'dark matter' that has evaded standard dictionaries (PDF).

    Umm, no. The phrase "true number of words in English" is sufficiently ill-defined to make the question meaningless. There are two ways people think about whether something is

  • Where over 90% of vertebrates have probably been discovered and cataloged, only a few percent of insects, worms etc. may have. A combination of statistics and data mining estimates about 7 million total species.
  • If well good part of the richness of a language is because there were isolated regions with no fluid communication with the others speaking the same language in the past, internet pushing a common culture is adopting a lot of words and concepts from other languages into any language by now. If that process could be controlled or directed (i.e. mass media, main internet sites, etc) could be used to push concepts and word meanings useful to improve a culture, like with this example []. Not sure if we could do in
  • by wilgibson ( 933961 ) on Monday March 19, 2012 @10:57AM (#39403091)
    When taking "History of the English Language" last year as part of my graduate work, the professor I studied under was part of the Middle English Dictionary Project. It was interesting to speak with him on the life and death of words after the printing press, and I remember him giving a 30 to 50 year estimation for a word to cement itself or become rare. It doesn't really seem like this is anything new.
  • Tempest in a teapot (Score:5, Informative)

    by pjpII ( 191291 ) on Monday March 19, 2012 @11:25AM (#39403395) Homepage

    Speaking as a linguist (working on my Ph.D.) this is something of a tempest in a tea-pot. The most relevant use would be for glottochronology [] - a field that's largely been abandoned by anyone seriously working on historical linguistics because of the various problems involved with that approach, including what the authors of the paper find, that the rate of word loss is not constant over time. They have a better idea of the rate of word loss, which could help improve glottochronology, but the method has a lot of flaws regardless.

    Also, the question they're asking - how do words change over time, in terms of coining, becoming current, and becoming obsolete - really isn't a question historical linguists are that concerned about. Historical linguists are much more interested in how the forms of words change over time (phonological change), or how their function changes over time (grammaticalization), whereas the coinage and loss of words isn't often so important, especially on the large scale statistical level. Furthermore, this type of model probably handles languages with phenomena like avoidance speech [] poorly, since that would change how and why words are kept or lost.

    Their language sample is at heart a convenience sample - they happened to have access to lots of data in those three languages, and it is largely written data. Spanish and English are both related languages with very similar cultural contexts, while Hebrew is a strange choice in that is has an ancient history, but only quite recent revitalised usage. Whether most spoken interaction (which is what linguists tend to be more interested in) has even a tiny subset of the total number of words they are talking about is an open question and would be better tested against corpora with a large quantity of spoken data such as the British National Corpus or the International Corpus of English.

    It's an interesting study, but if it hadn't been written by physicists I'm not sure if it would have ended up in Diachronica or the Journal of Historical Lingiustics, much less Science. Their "statistical rules" are interesting, but really not of any great use to wider linguistic inquiry. I think its import is really just exaggerated by the fact that science editors read Science and NOT most linguistics journals, and therefore they think it's really impressive.

  • by segfault_0 ( 181690 ) on Monday March 19, 2012 @03:41PM (#39406543)

    Poorly worded title, I don't see any laws, theories, or other predictive content.. just some analysis.

Remember to say hello to your bank teller.