The Importance — and Limits — of Very Large Data Sets 17
New submitter kodiaktau writes "A recently presented paper discusses how large data sets can improve learning algorithms, but points out that researchers still need to account for bias and incompleteness before drawing conclusions. The paper also goes into the need for responsible business practices to manage these data sets. 'There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.' The full paper is available through SSRN. Of particular importance is their assertion that even huge data sets can and will be affected by filters or the analyst who is interpreting it. '[Study co-author Kate Crawford] notes that many big data sets — particularly social data — come from companies that have no obligation to support scientific inquiry. Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.'"
Re: (Score:2, Funny)
How sure are you your data-set is adequate to make that determination?
Re: (Score:2)
There's lots of data (Score:2)
This is a problem with most data! (Score:4, Insightful)
From the blurb:
Even if you're using data from public institutions you still may have to pay for it (to cover staff time to procure the data--especially if you're asking for something they don't normally provide, which is quite often). While there won't be any limitations on what you can do with the data once you have it, because of lack of knowledge of their own data/bases the provider may simply provide you with incomplete or likely inaccurate data anyway.
So yeah, welcome to the world of using data. Move along, nothing to see here.
Re: (Score:2)
And even if you collect it yourself, if you're at an educational institution, you likely have to comply with IRB (institutional review board) rules if it involves people.
They often don't like you looking for certain types of patterns, or using the data in a way that might harm the people you're studying.
There's medical privacy rules, general privacy rules, etc. And even when not dealing with people, there's lots of moral issues in how you use the data. (and there's moral issues in sharing data -- some gro
At least there IS very large social data sets (Score:3)
At least there IS very large social data sets.
Most sociologists today tend to describe the world using 'deep' interviews of 36 people in the surroundings of the campus, because that way they will get the result they wish to get.
A cynic description, yes, but not too far the truth. So, it is good to see there IS large data sets, somewhere.
Re: (Score:1)
IS a set, ARE sets... Doesn't saying what you wrote out loud trigger any warning sirens? Also, are you trying for "a cynic's description" or "a cynical description" ?
Also, it would be nice if you had ended "A cynic description, yes, but not too far the truth" with a rationalization. Like, "based on my experience as a graduate assistant working in the sociology dept" or "based on my own exhaustive research" or even "based on what the voices in my head are telling me." Just how far from the truth is "too far"
Forget about bias and incompleteness for a moment (Score:2)
This statement
'There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.'
is not about bias and incompleteness. The person who is looking at the data needs to have the necessary concepts and it's a bad idea to call that bias. The data won't do the thinking for him(her). They've just found 3 new exoplanets in old Hubble data. The data hasn't changed and ha, but the people who are looking at them have.
Not a surprise (Score:2)
Those that claim a large dataset is all you need are typically bad scientists that happen to have access to such a dataset. Large datasets eliminate one thing, namely noise (random variations). Large datasets can be just as biased, incomplete and contaminated with data you do not suspect of being in there as small datasets. They are not in any way a better approximation of "the truth" than smaller datasets.
But every good scientist knew that anyways.