Forgot your password?
typodupeerror
Open Source IT Science

Open Data Needs Open Source Tools 62

Posted by Soulskill
from the stop-trying-to-fork-reality dept.
macslocum writes "Nat Torkington begins sketching out an open data process that borrows liberally from open source tools: 'Open source discourages laziness (because everyone can see the corners you've cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it's a great training ground for skills development. I see no reason why open data shouldn't bring the same opportunities to data projects. And a lot of data projects need these things. From talking to government folks and scientists, it's become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there's a poor chain of provenance for the data so it's impossible to figure out what's trustworthy and what's not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don't share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to. So lately I've been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?'"
This discussion has been archived. No new comments can be posted.

Open Data Needs Open Source Tools

Comments Filter:
  • eclipse? (Score:4, Informative)

    by toastar (573882) on Tuesday March 09, 2010 @01:32PM (#31416114)

    Is Eclipse not open source?

  • Re:eclipse? (Score:4, Informative)

    by Monkeedude1212 (1560403) on Tuesday March 09, 2010 @01:38PM (#31416228) Journal

    Who modded him offtopic?
    Eclipse has an open source Data Tools Platform [eclipse.org]

  • Open Street Map (Score:3, Informative)

    by Anonymous Coward on Tuesday March 09, 2010 @01:40PM (#31416260)

    I perfect example of collaboration with a massive dataset:

    http://www.openstreetmap.org/

  • Use Open Standards (Score:5, Informative)

    by The-Pheon (65392) on Tuesday March 09, 2010 @01:46PM (#31416332) Homepage

    People could start by documenting their data in standardized formats, like DDI 3 [ddi-alliance.org].

  • by viralMeme (1461143) on Tuesday March 09, 2010 @01:50PM (#31416412)

    > Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about ..

    Wikipedia isn't an open source project, it's an online collaborative encyclopedia. Mediawiki [mediawiki.org] on the other hand is the software project that powers Wikipedia.

  • Parent not a troll. (Score:4, Informative)

    by aristotle-dude (626586) on Tuesday March 09, 2010 @02:43PM (#31417068)
    Having lots of eyes looking at code is no substitute for using tools like what coverity on your software along with test driven development. Humans can easily miss problems with code that a tool or smoke test can uncover.
  • by GrantRobertson (973370) on Tuesday March 09, 2010 @03:58PM (#31418142) Homepage Journal

    Perhaps there is a skillset worth defining here - some offshoot of library sciences?

    That offshoot is called "Information Science." Most "Library Science" programs now call themselves "Library and Information Science" programs. There is now even a consortium of universities that call themselves "iSchools." [ischools.org] In my preliminary research while looking for a graduate program in "Information Science" it seems as if the program at Berkeley [berkeley.edu] has gone the farthest in getting away from the legacy "Library Science" and moving toward a pure "Information Science" program.

    I personally think that the field of "Information Science" is really where we are going to find the next major improvements in the ability of computers to actually impact our daily lives. We need entirely new models of how to look at dynamic, "living" data and track changes not only to the data but to the schema and provenance of that data. That is how "data" becomes "information" and then "knowledge." I won't write my doctoral thesis here, but suffice it to say that simply squeezing data into a decades old model of software version control is not quite going to cut it. In software version control you don't have as much of a trust problem. Yes, you do care if someone inappropriately copies code from a proprietary or differently-licensed source. However, you don't have as much incentive for people to intentionally fudge the code/data one way or another. In addition, data can be legitimately manipulated, transformed, and summarized to harvest that "information" out of the raw numbers. This does not happen with code. Yes, there is refactoring, but with code it is not as necessary to document every minute change and how it was arrived at. With data, the equations and algorithms used for each transformation need to be recorded along with the new dataset. In addition, the reason for those transformations and the authority of those who did the transformation.

    Throw into the mix that there will be many different sets of similar data gathered about the same phenomena but with slightly different schemas and different actual data points which will all have different provenances but will need to be manipulated in ways to bring their models into forms that are parallel to all the other data sets associated with those phenomena while still tracking how they are different ... and you will see that we don't just need a different box to think outside of, we need an entirely different warehouse. (You know, the place where we store the boxes, outside of which we will do our thinking.)

    Many of the suggestions posted here are a start, but only a start.

  • by Bazman (4849) on Tuesday March 09, 2010 @06:07PM (#31419936) Journal

    Looked at the CKAN software (www.ckan.net)? They run their own knowledge archive,a nd the software also powers the UK data.gov.uk site. RESTful API and python client.

  • OpenDAP (Score:3, Informative)

    by story645 (1278106) <story645@gmail.com> on Tuesday March 09, 2010 @06:57PM (#31420580) Journal

    The main point of the openDAP [opendap.org] project is to facilitate remote collaboration on data, and there are already a few organizations that use it to share data. I've used the python variant for NetCDF files and found it pretty happy and the web interface is clean. The best part of the OpenDAP project is probably that the data doesn't need to be downloaded/copied to be processed, which is really important for anyone who can't afford the racks of harddrives some of these datasets need.

Computers will not be perfected until they can compute how much more than the estimate the job will cost.

Working...