Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Why the Cloud Cannot Obscure the Scientific Method

Posted by CmdrTaco on Thu Jun 26, 2008 07:43 AM
from the because-of-science-dude dept.
aproposofwhat noted Ars Technica's rebuttal to yesterday's story about "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete." The response is titled "Why the cloud cannot obscure the Scientific Method," and is a good follow up to the discussion.
+ -
story

Related Stories

[+] Google Begat the End of the Scientific Method? 387 comments
TheSauce writes "In a fairly concise one-pager from Chris Anderson, at Wired, the editor posits that all of our current (or now previous) models for collecting data are dead. The content is compelling. It notes that we've entered the Age of the Petabyte — where one can collect immense amounts of data that are paradigm agnostic. It goes on to add a comment from the head of Google's R&D, that we need an update to George Box's maxim: 'All models are wrong, and increasingly you can succeed without them.' Have we reached a time where all of our tool-sets are now made moot by vast clouds of information and strictly applied maths?"
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by Bandman (86149) on Thursday June 26 2008, @07:45AM (#23947485) Homepage


    Because a datasource isn't a process?

  • missing link (Score:4, Insightful)

    by lhorn (528432) <(lho) (at) (ffi.no)> on Thursday June 26 2008, @07:51AM (#23947555)
    http://arstechnica.com/news.ars/post/20080625-why-the-cloud-cannot-obscure-the-scientific-method.html [arstechnica.com]
    I like the fact that the web and search/aggregate engines may combine vast amounts of data in ways we now
    cannot imagine - it expands the field for new scientific research enormously. Replace science? No.
  • Crack cocaine makes you stupid.

    Oh, you were talking about the "information cloud" the crackheads at Wired always talk about. Never mind.

  • by Anonymous Coward on Thursday June 26 2008, @07:58AM (#23947621)

    Latest addition to bullshit bingo cards:

    CLOUD

  • by Hoplite3 (671379) on Thursday June 26 2008, @08:01AM (#23947653)

    I'd say that the models are the science. They're how you explain your data. They provide evidence that the experiments make sense, and they guide you by making predictions you can test.

    Moreover, SIMPLIFIED MODELS are good science. Understanding which details can be omitted without impacting the predictive ability of your model shows you know which effects are important and which aren't.

    • I agree, but... (Score:4, Insightful)

      by wfolta (603698) on Thursday June 26 2008, @08:37AM (#23948129)

      What you say is true, Hoplite3. The big issue I see is how people define "model". My guess is that quite a few unfortunately define it as "I got 3 asterisks in the significance test", whether the "model" (say, linear regression) makes sense or not.

      I forget where I read it, but I've been studying linear regression, and there was a fascinating example were if they'd have used linear regression techniques on the early "drop the canonball and time it's fall" data, they would have come up with a nice, highly-significant linear regression for gravity.

      Then there is the whole issue of explanation versus prediction. Something can be predictive while providing no explanation, and perhaps that's where the petabyte idea is going: who cares about explanation if prediction is accurate enough? (Not my philosophy, BTW.)

      • Re:I agree, but... (Score:5, Interesting)

        by Hoplite3 (671379) on Thursday June 26 2008, @08:58AM (#23948415)

        Yes, I think that prediction without explanation is fascinating, but I don't know if it's what I like about science :) Have you ever heard Lenard Smith speak? I saw him at SAMSI, but his MSRI talk is online and is roughly the same. He's a statistician who works in exactly this.

        Some fancy-pants technique he has is better at predicting the future behavior of chaotic systems (like van der Pol circuits or the weather) than physical models. But he also points out that these predictions don't tell you what type of data to collect to make better predictions, and that they don't generalize. One nice "model" he has can predict the weather at Heathrow better than physical weather models (from the same inputs: wind speed, temperature, pressure, etc), but it's useless for predicting the weather in Kinshasa until the model is re-trained.

        I think these types of data analysis tools will be very important in the future, but they won't replace the explanatory power of models. Just like how scientific computing is useful, but never replaced actual experiments.

      • Re:I agree, but... (Score:5, Insightful)

        by aurispector (530273) on Thursday June 26 2008, @09:01AM (#23948449)

        Thank you. Sure, there's a ton of data out there, but how was it collected? What statistical methods were used to analyze the data? How did you select the data set you're analyzing? Nothing I understand about science really applies to data mining a so-called "cloud". Prediction without explanation is just observation. Observation in and of itself is not science. You might have data, but is it the right data?

        I see all this petabyte stuff as interesting and even as a valuable adjunct to real science, but a basic requirement of science is reproducibility and you can't reproduce the data collection.

  • by tist (1086039) on Thursday June 26 2008, @08:14AM (#23947809)
    A large source of data that has a correlation does not somehow imply causation. Even if it works under some conditions (or even all conditions). The science happens when the causation is determined and then applied.
    • Yup. Mathematicians gushing about clouds and implying they have made science obsolete need to have that branded on their butts then be sent back to the mathematics department. They've already done quite enough giving us string theory (look! its internally consistent! it sounds cool! ergo its real!)

    • Actually there is a statistical concept "causation" as well.

      So yes, correlation does not imply causation. The reverse is through, though, causation implies correlation. There is only one mathematical relation between "things that correlate" and "causes" that supports this outcome : intersection. All causes correlate.

      So you only need another mathematical property of causation, take the intersection of the concepts and there you'll have a much more precise source for causation.

      You could also simply take the t

      • If correlation occurs with a temporal shift, it is trivially simple to separate cause and effect.

        I have to disagree with that -- it's kinda correct, but I think it oversimplifies and misses some situations. (Note that I'm talking about the general case, not your solar output example in particular.)

        As one example, imagine someone without an understanding of the physics of weather discovered that, at least 10 minutes prior to the arrival of any major thunderstorm, all birds in a particular forest stopp
    • by eli pabst (948845) on Thursday June 26 2008, @09:13AM (#23948661)
      You're exactly right. In fact if anything, science has started moving *away* from the kind of purely computational and statistical correlations that you get through data mining. Granted they are extremely important for generating hypotheses, but journals are much less likely to accept a paper without some kind of experimental validation.

      The large scale genetic association studies are a great example. There was a day that you could publish a paper solely describing a correlation between a variant in gene X and its association with disease Y. However, because of the way we do statistics in science, sooner or later you'll find a statistically significant correlation simply due to chance alone. In fact the epidemiologist John ioannidis wrote an article [plosjournals.org] about this (that I believe appeared on Slashdot as well). Now you're often required to show some kind of experimental validation that there is a biological basis that verifies the statistical correlation. The scientific method is not going away anytime soon.
      • by damburger (981828) on Thursday June 26 2008, @08:38AM (#23948141)
        Wrong - imply has a very specific meaning to mathematicians and scientists. 'A implies B' means that if A is true, B MUST be true also.
        • Re: (Score:2, Interesting)

          Fine. I'll try to restate my point using more specific language.

          The fact that correlation does not imply causation isn't nearly as troublesome as the volume of "Remember folks correlation!=causation" would have us believe; lacking other evidence, it is a reasonable assumption to start with.

          • Re: (Score:3, Interesting)

            But nobody said that here, so your whole point is a strawman. I think its safe to assume that nobody on /. thinks correlation!=causation because that would make all science impossible.
              • Whatever. You waded in saying

                Of course correlation implies causation
                and then when I pointed out the flaw in your argument you backtracked. Nobody said correlation!=causation, now saying

                People say it all the time.
                just makes you sound like you won't admit you are wrong.
          • The correlation != causation tag is usually applied because either:

            1. There are obvious confounding factors the article fails to mention
            2. There's a good chance the direction of the arrow of causation is incorrect. e.g. just because fireman tend to be where you see big fires, doesn't mean they cause them. Or perhaps less obviously, aluminum doesn't cause Alzheimer's, it builds up in the brain as a consequence of Alzheimer's. Statistical inferences are only as good as the data available to you, and you need theo
        • ... and correct if you ask some logicians and linguists. For us, imply [stanford.edu] means "something meant although not said but (through different mechanisms) conveyed" and entail [wikipedia.org] means "if A is true, B MUST be true also".
      • In science, the phrase usually used is "correlation does not imply a specific causation." It does, of course, imply some correlation and most of modern science is noticing correlations and testing for causation.

  • by gopla (597381) on Thursday June 26 2008, @08:16AM (#23947817)

    All models are wrong, but some are useful.

    We still need scientific methods to develop useful models and understand and refine the existing models. When Newton defined his mechanics that was the state of the art in his era, and now we have progressed to quantum mechanics which might be refined tomorrow.

    But mere observation of some phenomena is not sufficient to postulate the behaviour in a changed condition. A scientific model and its rigorous application is required for this. Correlations drawn from the cloud cannot substitute it.

    gopla

  • The point of the last story was horribly miscommunicated. There were two main points. The first is that data is expanding in such scope that hierarchal organization systems don't work and that the second is we're approaching a time where the method or analysis of data to show causation will come from correlation, because you can determine all the variances due to the fact that all the variables have been accounted for. Look at the human genome project or folding at home.

    I don't think this is completely tr

    • by phobos13013 (813040) on Thursday June 26 2008, @08:39AM (#23948161)
      You seem to be missing a fundamental flaw in the argument. No matter how many parameters you account for a) you can never account for ALL parameters of this system we call life (if for no other reason, there may well be some we dont know about yet!), and b) most importantly, even if you DO have all the parameters and the results show a correlation, there is no logical jump one can make that says it is the cause of the observed behavior.

      Truly what yesterday's article was saying is that causation or correlation is meaningless if you have a mimic of the real world in the form of a collection of data. You don't need a model that is accurate or valid or anything. You just need to run the data in the exact replica of reality. This is the simulacrum. The first problem is that data does not just run itself. At the least it needs an algorithm to be processed to a result. Thats the model, without its just useless data, which has been mentioned already yesterday in comments. But second, the problem with even ATTEMPTING such an idea is that you lead yourself into a situation where you "predict" the future and then operate to become that future thus destroying the creative nature of humanity and become the self-fulling prophecy of machine code!

      Keep in mind i speak mostly of social sciences that try to pattern human behavior. For hard sciences, etc., all you have done is created a simulation of reality, but it tells you nothing about the reality. It merely mimics it. There is no insight into creating a map the size of the United States, at best it is a work of art.
  • by Angostura (703910) on Thursday June 26 2008, @08:20AM (#23947863)

    In general I'm right behind the rebuttal. However John Timmer chooses a very bad real-life example as his rebuttal champion.

    He asks: ...would Anderson be willing to help test a drug that was based on a poorly understood correlation pulled out of a datamine? These days, we like our drugs to have known targets and mechanisms of action and, to get there, we need standard science.

    These days we may like our drugs to have these attributes, but very often they don't. There are still quite a few medicines around that clearly work and are prescribed on that basis, but for which there is only the haziest evidence as to how exactly they work.

    The good thing about the scientific method, however is it gives us a framework to investigate these drug's actions - even if the explanation is still currently beyond us.

    • You're right about the medicine example. It's odd that medicine has an incredibly rigorous statistical process before approval, yet many medicines are basically black boxes.

      Look at statins (cholesterol medication), which are one of the most widely-prescribed medicines in the world -- and which I take. There's a legitimate question as to whether their main effect is to reduce cholesterol levels, or whether it's actually a specific kind of anti-inflammatory which happens to reduce cholesterol levels.

      Or how ab

      • He makes statements about treatments, causes, and outcomes as if they were God given truths proven to the world beyond all doubt. In truth medicine seems to this mathematician as a field governed sooley by statistical correlation with next to no concern over (a) what is the actual cause is, (b) testing the hypothesized cause in any meaningful way. I've read study after study that goes through a wonderful presented statistical analysis to conclude that such and such drug works well at treating such and suc
        • Re: (Score:3, Interesting)

          In truth medicine seems to this mathematician as a field governed sooley [sic] by statistical correlation with next to no concern over (a) what is the actual cause is, (b) testing the hypothesized cause in any meaningful way. I've read study after study that goes through a wonderful presented statistical analysis to conclude that such and such drug works well at treating such and such symptom; they then close with a couple of paragraphs as to why (they think) the drug is working often not using an qualifier

  • by phobos13013 (813040) on Thursday June 26 2008, @08:20AM (#23947865)
    Truly, the whole reason someone like Mr. Anderson could claim the end of science because of data is that he is a writer, a thinker, and large part businessman. Businessmen do not think about Science and how to use it to come with a method that produces a conclusion. He uses information to come up with ways to illicit a reaction in people. So to him data is more important than science because he uses it for his purposes. That is marketing, and the "science" of marketing has almost always been that way.

    Mr. Anderson was not prescient in any way, he was just speaking his perspective. The only thing is we must be careful to even consider his proposition as a valid reality worth pursuing. Not for true scientists, but from a social perspective, or it will truly be the end of science. There are some in power as it is already attempting to make this happen.

    That said, I almost consider responding to yesterday's article as falling for the argument. But, since it hit the /. this article is as cogent a rebuttal as one can make.
    • to come up with ways to illicit a reaction in people

      elicit == v. evoke; illicit == adj. illegal

      BTW, it seemed obvious to me that he equated data discovery with scientific discovery, which is a big mistake. Adding to the sum of human knowledge is not the same as adding to the sum of human understanding, and using datamining and other automated tools for correlation determination does not in any way increase understanding.

      Data discovery is about increasing knowledge. Scientific discovery is about increasi

  • by damburger (981828) on Thursday June 26 2008, @08:24AM (#23947919)

    And can back up this rebuttal with a practical example. I am a physicist, I know sod all about blood samples, or proteins, or cancer. I get a pile of mass spec data (about a billion data points or so on some days) and through binning, background subtraction, and a string of other statistical witchcraft I produce a set of peaks labeled according to intensity and significance.

    This does not make me a cancer researcher. This data has to go back to the cancer guys and they have to pick out the Biomarkers and thus develop new diagnostic tests, based on principles that I don't understand. I am master of the information but entirely blind as far as the science is concerned. Same goes for google.

  • Duh! (Score:5, Insightful)

    by es330td (964170) on Thursday June 26 2008, @08:26AM (#23947941)
    When I read the original article my thought was that someone was just trying to write something to get noticed. The Scientific method, IMHO, is all about a person or group of persons using a logical process to determine the vailidity of an idea. Observing massive amounts of data can reveal relationships that may not have been noticed in other ways, but at the end of the day the process of "I think X, I wonder if it is true", the heart of the scientific method, can no sooner become obsolete than we can stop being human. The questions of What, Why and How are so fundamental to humans as humans that nothing short of total omniscience will ever replace the logical process represented by the scientific method.
  • traditionally, science forms its hypothesis, and performs an experimentum crucis to test the hypothesis; rinse & repeat. it seems to me that 'the cloud' refers to a hitherto statistically huge number of samples of data points from which to extract our knowledge of the world -- a sort of broad collection of facts derived from constantly and systematically varying the experimental conditions -- an exploratory experimentation. goethe outlines a method of Exploratory Experimentation in the essay The experim [rsarchive.org]

  • by starfire-1 (159960) on Thursday June 26 2008, @08:46AM (#23948261)

    I have always viewed this debate in the context of scientist vs. engineer. That is one who views data as "good and true" vs. "good enough". That's not a slam on engineers (I am one), but a reflection of the balance between the two. A scientist that never applies theory sits in an empty room. An engineer who build things with out science, sits in a cluttered room surrounded by useless objects.

    I do find interesting though that the advent of "google data" may indicate a flip in order of the two disciplines. Historically (IMHO) science has led engineering. A theoretical breakthrough, provable by the scientific method, may take years to give birth to a practical application. Now, with enormous piles of data and the knowledge that "good enough" is often good enough, we may be creating useful objects that will take science many years to explain and model.

    The biggest issue and omission in both of these pieces is that this "cloud" of data does not represent "truth" (as the scientist may seek), but rather a summation or averaging of the "perception of truth" as seen by the individual authors. The cloud, therefore, is only as useful as human's ability to divine truth without the scientific method.

    My two cents. :)

    • Re: (Score:3, Insightful)

      I have a theory that some of the best engineers are scientists, and some of the best scientists are engineers.

      Scientists often need to build crazy stuff to figure things out, and engineers often need to figure things out to build crazy stuff. Because they are each result oriented, they don't get hung up on the things that someone in field would.

  • by mlwmohawk (801821) on Thursday June 26 2008, @08:54AM (#23948355)

    I have a problem with the google generation, sure, they can parrot facts and find things in an instant, as can any slashdotter I'm sure, but knowing something is not the same thing as understanding something.

    I coworker asked me yesterday "how do you call a C++ class member function from C [or java]?" The question is an example of pure ignorance.

    If they "understood" computer science, as a profession, this would be a trivial question, like how do I or can I declare a C function in C++. The second question is what google can help you with while having to ask the first question means you are screwed and need to ask someone who understands what you do not. Not understanding what you do for a living is a problem.

    How programs get linked, how environments function, virtual machines vs pure binaries, etc. These are important parts of computer science, just as much as algorithms and structures. You have to have a WORKING knowledge of things, i.e. an understanding.

    Google's ease of discovery eliminates a lot of the understanding learned from research. Now we can get the information we want, easily, without actually understanding it. IMHO this is a very dangerous thing.

    • Wow, one of the best postings I have read for months.

      Although I wouldn't call it "very dangerous", you are so right about the difference between, what you call, knowing and understanding. Raw data and number crunching is only one step towards understanding. Interpretation of the data and in the end really grasping the problem and hopefully a solution are something different.

      Theories may have gone wild in some sciences in the sense that theorizing is overvalued compared to data munching, but theories and

  • Petabyte technology suggests new avenues of scientific investigation, but doesnt end science or older alternative ways of doing things. The clever thing is to be first to discover the new possibilities.
  • by GodWasAnAlien (206300) on Thursday June 26 2008, @09:44AM (#23949107)

    Science and openness go together.
    Without openness, we all are reinventing private wheels, which we destroy the plans to when there is no profit.
    If you work in software, consider for a moment how scientific your work is, considering the work of other companies doing similar work.

    This Clouds thing is the "billion monkeys/humans typing on keyboards" model.
    Yes, it really can work (with humans).
    But, as with science, the chaos development model only works with openness.

    Of course, organized science along with a little chaotic development work work even better.

    There are forces in our society that do not like any open model. The Microsoft's, the MPAA, the RIAA. These type of organization thrive from closed models. More copyright controls, more DRM, longer copyright and patent terms.
    These forces would prefer to own,control and close science and clouds of data. They are unaware of the inevitable impact of such actions.

    In a free capitalist society, we are naturally driven my contrary forces.
    A desire to hide discoveries, to maximize profits, even at the expense of innovation.
    A desire to share discoveries, to contribute to society and for credit.

    While it is possible to profit when ideas are shared,
    It is more difficult to contribute to society by hiding information indefinitely.

  • There are coefficients we use in models that we don't fully understand in the physical world. We obtain those coefficients through empirical data. To rely solely on those models for design ignores the fact that those coefficients may change for any reason in the real world, because we don't fully understand what factors influence them.

    In my experience this only applies to certain sciences. Most of my experience with such systems is in the area of fluid mechanics, and thermochemistry. Models can save y
  • Links need thought (Score:3, Interesting)

    by FlyingBishop (1293238) on Thursday June 26 2008, @01:50PM (#23953719)
    I had a nice example of the complete inadequacy of google's thought-agnostic approach to links browsing around looking for information on samba and fuse under linux. Google's ad bars, completely misinterpreting the context, offered links to fuse boxes, as in wiring, and Samba lessons, as in dancing. But then, maybe I'm not giving Google enough credit. It might have actually recognized the pointlessness of trying to market software to a Linux user, and took the obvious step of throwing in some complete non sequiturs in the hopes of catching something of value.
  • by frogzilla (1229188) on Thursday June 26 2008, @03:10PM (#23955733)
    Wasn't this all demonstrated 100 years ago by Francis Galton [wikipedia.org] and an Ox? What's new is that there are more data points and better techniques to identify interesting correlations. Probably this is what we do internally anyway. All of our sensory input is correlated and the interesting bits are filtered out by specific algorithms trained by evolution. What is fascinating to many are the times when these algorithms are spectacularly wrong.