Google Begat the End of the Scientific Method? 387
TheSauce writes "In a fairly concise one-pager from Chris Anderson, at Wired, the editor posits that all of our current (or now previous) models for collecting data are dead. The content is compelling. It notes that we've entered the Age of the Petabyte — where one can collect immense amounts of data that are paradigm agnostic. It goes on to add a comment from the head of Google's R&D, that we need an update to George Box's maxim: 'All models are wrong, and increasingly you can succeed without them.' Have we reached a time where all of our tool-sets are now made moot by vast clouds of information and strictly applied maths?"
Ahem (Score:5, Insightful)
WTF?
English, ---, do you speak it?
We had this coming (Score:1, Insightful)
WTF indeed (Score:5, Insightful)
I saw the article yesterday, but it was so WTFey I just moved on...definitely not Slashdot submission material (especially being a Wired article).
Definitions (Score:3, Insightful)
They may lead from one to the other but they are not all the same thing.
So... (Score:5, Insightful)
Re:Definitions (Score:5, Insightful)
How bout no (Score:5, Insightful)
Um, no. Claims like this demonstrate a lack of understanding of what a model is.
From the perspective of physics, the universe is just a massive amount of data--more data than any single human can comprehend at once. But thanks to the models of Newton we have a set of relatively simple equations that describe, generally, the way bodies in the universe interact. The model is not perfect, but it is useful.
Likewise, Google uses a very explicit model to describe the universe of the web: some pages are more relevant to a given search query than others, and these pages will generally be more 'popular' among other important pages. Again, the model is not perfect, but it is useful.
The fallacy is that somehow what Google is doing is a paradigm shift. It's not. It's just applying the same kind of scientific method to a type of data that hadn't existed before.
What, I think, the article is really trying to say is that Google's data is so massive and complex that we can't ascribe any explanation to the results it gives us. First of all, that is false, because the PageRank algorithm in its simplest form does give us a very explicit explanation (popular pages generally return better results). But even if it were true, Newton faced the same kind of accusations when people called his model of the universe 'Godless' and claimed, for example, that he decribed how gravity works without actually explaining "why" it works like it does. And that accusation is always with science. There are always more questions raised than answered. This is nothing new.
Don't rule science out it. (Score:5, Insightful)
The article is utter nonsense. But it's such a rambling mess it's hard to know where to start picking it apart. Perhaps the best is when he presents as an example of this new "model-free" approach with a program which includes "simulations of the brain and the nervous system". Uh, hello... a simulation IS a model.
Re:Ahem (Score:5, Insightful)
I used to think that I could translate most dialects of bullshit into english but this threw me off guard. The most reasonable explanation is that Chris Anderson is a tool and doesn't know what he is talking about.
For example, data is now "paradigm agnostic". Seriously, wtf? When was data ever not "paradigm agnostic" and when did we develop the need for a term to describe it. Data is data. It is raw, and unanalysed, and as such the notion of a paradigm is completely irrelevant.
My Start menu has been Googled (Score:5, Insightful)
For example, for years I would pride myself on my well-tended Windows Start menu. I'd create base categories for my application folders like Hardware, Games, and Internet, and move applications into those folders to keep my Start menu manageable. I blogged about this procedure [demodulated.com] and included a screenshot.
Now that I'm using Vista I have little need to be so organized. I rarely have to navigate manually to an application folder thanks to the embedded search box on the Start menu. So now my Start menu is a huge clutter, but so what? I see that exercise as futile as dusting the cardboard boxes in the attic.
What question do you ask the data. (Score:5, Insightful)
No. (Score:5, Insightful)
Second, in my experience with large sets of data, you can do all kinds of math to them to bring out interesting relationships but someone with domain expertise is going to have a much better insight into what the data is saying than someone who doesn't. It seems the peak of hubris to think that the techniques taught in every science (social, hard, or otherwise) are worth nothing compared to massive amounts of data. How do you know where to get the data from? How do you apply the data?
I don't think it's quite time to throw out "correlation != causation". In fact, I think now more than ever we need to be able to understand underlying phenomena behind the data precisely because there is so much of it. With so much data, coincidental correlation is going to happen quite often I'm sure.
And, of course, the ultimate reason we need to understand things is for, you know, when the cloud's not there.
Wrong (Score:5, Insightful)
Re:Ahem (Score:2, Insightful)
"For example, data is now "paradigm agnostic". Seriously, wtf?"
Just look at the creation evolution controversy, to see how data is not 'paradigm agnostic'. Each claim the others data is unsound by the paradigm's umbrella it falls under.
Interesting, ranty, and wrong (Score:5, Insightful)
Re:Definitions (Score:5, Insightful)
Biggest Data Collector LHC relies on Models (Score:5, Insightful)
I thought this was a joke at first. One thing to think about is that the biggest data collector of them all, the Large Hadron Collider, which fits the frame given perfectly - delivering terabytes of data in huge data sets is just the opposite of the described scenario. Models are crucial to actually picking what data is actually recorded. In fact a large part of how good the LHC data will be will be in using models to select what events to capture. The way the data is captured is of course also based on long effort and knowledge from previous detectors. This isn't just randomly, or even generically selectively gathering data and then analyzing it. This is targeted data gathering based on complex scientific theories. There have been shouting matches at what to tag for collection based on what people think is important for a given theory - and these will happen again.
As our collection abilities rise exponentially, the the storage and analysis abilities are not exponentially growing, even though they are increasing at a fast rate! I would argue exactly the opposite of what this article said. We are going to be more and more dependent on our current scientific theories to even be able to choose appropriately the rich data that new sensors and techniques will let us collect. That is we are more and more dependent on our scientific theories when we get data not less. Did we even know to get methylation data when sequencing a genome. How about some other "ylation". Without background theory and experience we wouldn't even know some of that stuff was there to collect!
WTF, be serious (Score:3, Insightful)
This is nonsense pure and simple.
One needs to acquire facts. Now these "facts" can come from your own research or, in the age if the internet, someone else' data, but they still need to be collected and verified.
The *only* advantage that google provides is a more efficient way of sharing and finding facts. Not even all facts, those that are popular and topical are what you'll most likely find.
Historical information, from when newspapers only used dead trees, can be very difficult to find on the internet unless someone else did the research first.
Just to clarify (Score:5, Insightful)
To avoid the same fate as the GP, let me clarify that by WTFey I specifically meant that the article was full of fluff, light on details and generally pointless...which makes me think "WTF." The closest thing to a point I could get from the article was "Nice big blobs of data can be useful, and statistical data based on said blobs could replace the results of scientific research." Mmmkay.
A sensational headline leading to a rather pointless article consisting mostly of fluff: WTF.
Re:Don't rule science out it. (Score:5, Insightful)
The article does not make a compelling point. It keeps saying that we can give up on models (and science), because now we just have lots of data, and "correlation is enough." What utter BS. Establishing a correlation is not enough. Even if it is predictive for the given trend, it doesn't allow us to generalize to new domains the way a well-established scientific model does. If an engineer is designing a totally new device, that goes above and beyond what any established device has done, what data can he draw upon? If there is no mountain of data, he must rely on the tried-and-true techniques of engineering/science: use our best models, and predict how the new device/system will behave.
The article actually makes this point perfectly clear when it says: Indeed. Merely having tons of data doesn't actually give you insight into what you have measured. You must distill the data, pull out trends, and construct models. I just don't see how have mountains of data about a species, but still being unable to answer simple questions about it, is superior to conventional science (which can answer questions about the things it has discovered).
A deluge of data and data-mining techniques is a boon to science. But I don't see the benefit of giving up on the remarkably successful strategy of constructing models to explain the phenomena we've observed. I somehow doubt that having 20 petabytes of data on electron-electron interactions is more useful than having a concise theory of quantum mechanics.
The Paradigm is the Data Subset (Score:5, Insightful)
For example, to detect stress you might traditionally measure heartbeat, skin conductivity, pupil dilation.
In the "petabyte age" you throw in the number of times the subject uses the letter 's'; how frequently they use the 'reload' button on the browser; what colour of pants they wore last tuesday; Pepsi vs. coca cola; the number of times they picked their nose in 1997 and any and every other bit of data you have on the subject.
In the "petabyte age", most of the data you sift through will show no correlation, but you have a much better chance of finding the unexpected if indeed, there is some unknown factor out there.
Re:Ahem (Score:5, Insightful)
Re:What question do you ask the data. (Score:4, Insightful)
Exactly. The "deluge of data" is a useful tool, no doubt about it. But Google doesn't make the job of collecting and analyzing data irrelevant any more than the advent of the telescope made the skills and knowledge of astronomers obsolete.
I particularly love this line from TFA:
For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising -- it just assumed that better data, with better analytical tools, would win the day. And Google was right.
(Applied) science at its best! "The culture and conventions of advertising" are basically folk wisdom, and folk wisdom is often right but more often wrong. Google took a scientific, unbiased view of how to move bits around and make money with them: start with as few preconceptions as possible, analyze the data, see what happens.
Re:Ahem (Score:2, Insightful)
I saw it as saying "With so much data, you can use that as a base for preliminary research."
You then research those interesting things in traditional ways, but you have started with some sort of insight.
If you have enough images of the sky and stars, you can use the images to look for interesting things first, and then jump on a telescope or satellite when you have something solid to look for.
But to be sure, the author was selling Google is the Answer pretty hard. The application of math to problems is never a bad idea, they are doing it pretty well. And with the evolution of computers, more data and more processing are naturally going to occur.
Re:My Start menu has been Googled (Score:3, Insightful)
Now that I'm using Vista I have little need to be so organized. I rarely have to navigate manually to an application folder thanks to the embedded search box on the Start menu.
If you're going to take your hands off the mouse to run an app, why not just pop open a console and start it from there? I have no use for any sort of start menu, I have a console. It's certainly more flexible than a search bar, you can pass arguments or file names(with wild cards even) to the application.
Re:Ahem (Score:3, Insightful)
Old-way: develop physical model of how we think things work, test a few cases, refine model. New way: collect a huge relevant data set, mine the data for interrelationships, make a correlation. Correlation models replace scientific models. no more need for the hypothesis testing.
Re:The Paradigm is the Data Subset (Score:5, Insightful)
Don't you run a much higher probability of finding high correlation by chance?
I can expect to find a result that matches my model to 95% certainty about 5% of the time in random data. You can correct for this, but it's against human nature because people like to see the face of Mary in toast.
Learning how to look for correlation in huge uncontrolled data sets will require a new paradigm... or it will ultimately be useless and even perhaps, unsuccessful.
Re:The Paradigm is the Data Subset (Score:3, Insightful)
The example Anderson uses in fact shows this. Ventner had to have a model of an ecosystem within which he posits the existence of organisms. Through testing (statistical analysis), he finds them. Thus 1) ecosystems house organisms and 2) there are organisims we don't yet know about.
Seems like the scientific method to me.
I've never heard something so ridiculous (Score:3, Insightful)
Google used reams of data to get good at advertising and marketing, so Wired is using this ability to predict the end of SCIENCE?
Do they not realize the difference between these things? Advertising is extremely hand wavy and vague in the best of circumstances - I would argue that Google's offerings aren't really better than any other method, they're just cheaper for advertisers, and have a much larger base than normal.
I'm honestly astounded at this.
Re:Ahem (Score:5, Insightful)
I'm glad slashdot linked it. I read this the other day and had no idea what to make of it. After the first 20 comments I see I'm not completely retarded.
"Paradigm agnostic" (Score:3, Insightful)
An unknowable paradigm? Interesting.
Re:Interesting, ranty, and wrong (Score:4, Insightful)
A thought-provoking piece written by someone who neither understands the scientific method nor Google. Who doesn't understand the difference between a Theory and a model. Who still doesn't get correlation!=causation. Who probably has never had to actually analyze any substantial amount of data before. And who has clearly been raised on a self-important intellectual diet consisting of too much Buckminster Fuller, Kurtzweil, Frank Tipler, and Derrida.
And he works at Wired magazine? You don't say.
Predicitive power? (Score:2, Insightful)
Take for example, the orbit of the earth around the sun. Suppose we collected a whole bunch of data on the orbit of the earth around the sun. Sure, we'd be able to predict what the orbit is going to be, based on past data. But it gives us no other insight. Whereas, when we use the theory of gravity (and rotational motion and conservation of angular momentum etc . .
Because we can now turn to, say, Jupiter and the sun. Even if there is no data collected on how Jupiter orbits the sun, we can use the predictive power of our theories, that we have tested on the earth-sun system, to say how Jupiter is going to orbit.
That's a simple example, but you can imagine much more complicated situations. If we simply have correlation, we may be able to say that X is going to do Y based on previous behavior, but if I ask you how something new and unexpected is going to behave, we can get no answer until we take data . . . because we don't know *why* anything happens. And that's why we're never going to replace theories with statistical analysis of data.
There's a place for both. Obviously, just statistics can be very successful (google, for example), but, at least in science, it's not sufficient.
Re:How bout no (Score:3, Insightful)
But thanks to the models of Newton we have a set of relatively simple equations that describe, generally, the way bodies in the universe interact. The model is not perfect, but it is useful.
You are aware that the Newtonian Physics model breaks down when you are talking about traveling close to the speed of light?
Although, most of the time we are dealing with things that aren't traveling so fast, but there are many scenarios in physics that we need a different model for.
I think what the Googlite is advocating is that for very complex systems (like weather systems, financial, blackholes, LHC etc) which do not go well with our standard models, will need (pause for effect) new models.
Why? Because there is so much data that its hard to follow the scientific method because chances are you'll never get the same situation again for repeatable in a lab (like weather conditions) because there is infinite amount of data that could be gathered on these complex systems.
Take the LHC Computing Grid [wikipedia.org] for example. The amount of data gathered from that experiment maybe astronomical and it could be quite possible that once you get to that scale on the atomic level that you can never have exact conditions each time (of course it maybe the opposite but we won't know until they turn the thing on for a run on what happens to matter and energy when you do what they plan on doing).
I am not saying that everyone should throw out the scientific model, but I agree with the article that a new model needs to be created for complex systems. After all... We still don't have a 100% accurate model of weather prediction other than a few days at a time.
Re:Ahem (Score:5, Insightful)
Information doesn't want to be free. But when it isn't, neither are you.
Re:The Paradigm is the Data Subset (Score:3, Insightful)
Yes. The more data you collect, the more likely any two things will be correlated slightly. With millions or billions of data points, you would be shocked to find a variable that does NOT correlate significantly with everything else. That's why "correlation" or "significance" alone becomes less useful and we need to a) report effect size measures to get a better sense of how important the correlation actually is and b) continue to use our heads (and not always give blind trust to the cloud) to determine which correlations are useful and which ones are fluff.
A correlation that helps place internet ads .0000002% more efficiently might matter to Google but likely doesn't further human understanding or refine our thinking in any practically appreciable way. And because EVERYTHING is correlated at that point, I suppose there are an infinite number of variables we could use to refine our model. I think the only paradigm shift here is that it would take an army of AIs to sift through and bring some meaning to all that noise, and an army of AIs would probably be doing other things with their time. ;p
Comment removed (Score:2, Insightful)
Re:Just to clarify (Score:3, Insightful)
Re:Don't rule science out it. (Score:3, Insightful)
Once upon a time cars were pretty simple. The most effective way to fix a car that had broken was to find a mechanic. This was a man trained in the models of how cars work. He would sift through the collection of parts (data) in the car until he noticed an anomaly that he would charge you outrageously for.
Now cars have become so complex that these models are no longer needed. Instead you can just examine the millions of cars that either work or don't work right there on teh interweb. One you find a correlation between your car and another car you can then fix the difference without knowing anything about models of "how cars work"!
Err, maybe that analogy was a little too accurate as it has made his argument sound shit?
Re:Just to clarify (Score:4, Insightful)
Quite so, the article was dead wrong.
Having that much data allows for science that wouldn't have happened otherwise, but it doesn't allow us to forget about sound scientific principles. I for one don't want to die because the pharmaceutical company and my doctor thought that a correlation with safety was enough, without doing the experiments to verify. I could die either way, but correlation just isn't enough in many cases. Statistics don't prove or disprove anything, ultimately science is about understanding things the way that they are. Statistics can't do that.
If you can collect and store 100 pieces of information about a test subject for 200,000 test subjects at 150 points in time, you can do a huge amount with that. But, the data still needs to be interpreted, verified and placed into a verifiable model.
It doesn't really surprise me that Google would be handling search the way that they do, considering how borderline impossible it is to search for certain things unless you already know what you want. Searching for answers to software bugs ought to be straight forward, but Google seems completely incapable of sanely coping with version numbers without a lot of work.
Comment removed (Score:2, Insightful)
model selection (Score:2, Insightful)
i'd never heard the term "model selection," so thanks for pointing that out. it looks like there really is some good literature to read on the subject.
the process described by the model selection sites i skimmed still doesn't adress what i was getting at, though. "choosing a model from a set of potential models" is only conceivable when your set of potential models (and set of variables to potentially be modeled) is well bounded.
to put it another way, take the smartest model choosing algorithm you can find, hand it a pile of data, and say "what do you make of that, smart guy?" i'm willing to bet that the answer is going to be along the lines of "wtf?" unless there is some sort of context or metadata provided along with the data to give the algorithm a hint of what it's looking for. am i looking for covariance between scalar values among regularly organized groups? am i looking for white rabbits in the image data from a camera? is this ascii or ebcdic or 8-bit PCM data? you can argue that these questions are trivial, that no algorithm can be *that* general, but that is precisely my point: all known algorithms require significant narrowing down of the problem space by human hands before they can begin to produce useful output.
if you had an algorithm that took *truly* semantics-free data in one end and spit models of regularly occuring features out the other end, you'd be halfway to general AI.
Re:Just to clarify (Score:3, Insightful)