Open Source Experiment Management Software? 122
Alea asks: "I do a lot of empirical computer science, running new algorithms on hundreds of datasets, trying many combinations of parameters, and with several versions of many pieces of software. Keeping track of these experiments is turning into a nightmare and I spend an unreasonable amount of time writing code to smooth the way. Rather than investing this effort over and over again, I have been toying with writing a framework to manage everything, but don't want to reinvent the wheel. I can find commercial solutions (often specific to a particular domain) but does anyone know of an open source effort? Failing that, does anyone have any thoughts on such a beast?"
"The features I would want would be:
- management of all details of an experiment, including parameter sets, datasets, and the resulting data
- ability to "execute" experiments and report their status
- an API for obtaining parameter values and writing out results (available to multiple languages)
- additionally (alternately?) a standard format for transferring data (XDF might be good)
- ability to extract selected results from experimental data
- ability to add notes
- ability to differentiate versions of software
- automatically run experiments over several parameters values
- distribute jobs and data over a cluster
- output to various formats (spreadsheets, Matlab, LaTeX tables, etc.)
- provide a fancy front-end (that can be done separately - I'm thinking mainly in terms of libraries)
- visualize data
- statistical analysis (although some basic stats would be handy)
Experience (Score:5, Insightful)
I also did lots of comp sci empirical experiments. My experience is that the tools used for experimenting itself is very ad-hoc and not easily scriptable. Most of the times we are required to tend the hour-long experiments to see what happened on the output and decide what to do next. And... the decision is often times not clear cut. Some sort of heuristic is needed. Not to mention about the frustations when the errors occur (especially when the tool is buggy, which is very often in research settings). So, considering this, what I would do is to construct a script and do the experiments in phases. Run it and see the result several days after.
I also noticed that from one experiment to another is sometimes radically different that I would doubt it is easily manageable.
Re:Experience (Score:3, Interesting)
What comes to mind when I think about experiment management software is unit testing software. Correct me if I'm wrong, but when you run empirical software experiments, you are essentially unit testing the software.
Something like Python, Perl, or TCL (probably Python-- powerful, easy to read) should suit you ideally. Other options include Ma
Re:Experience (Score:2)
Re:Experience (Score:4, Interesting)
Sorry, but I must disagree. Most of the times, research experiment != unit testing.
To illustrate: Take for example a data mining project. The first phase is data preparation -- which is easily scriptable. But how to prepare the data is different story. We must examine the raw data case by case to decide how to treat it. For example: When to discretize and using what method (linear scale, log scale, etc), when to reduce dimensionality, etc etc. This requires human supervision.
Even after we do the data prep, we look at the result. If the cooked data contains too much loss of information due to prep stage, we have to do it again using different parameters. This is painful.
Then, next on the pipeline: What algorithm to use. This is, again, depend on the characteristics of the cooked data. You know, some "experts" (read: grad students) will determine it using some "random" heuristics of their mind given some reasonable explanations.
If after the result is out and is not desirable, we might go back for different algorithm or choose different data prep parameters, and so forth...
Given this settings, I doubt that there is a silver bullet for this problem...
SpecTCL (Score:2)
The high energy folks also have a similar set of packages (as other nuclear labs probably do).
Re: Experience (Score:1)
> You left Ruby off the scripting language list.
Oh, the Humanity!
Object Modeling System (Score:5, Informative)
http://oms.ars.usda.gov/
How about this (Score:2)
A provenance server might handle the recording of queries, results etc. Not sure how many good open source ones there are.
Piracy is Your Only Option (Score:5, Funny)
2. It is impractical for you to continue writing your own software.
3. You cannot find open source software.
-------
Conclusion: Steal commercial software! -)
Re:Piracy is Your Only Option (Score:3, Insightful)
4. Profit!!!!
Sorry, I've been reading slashdot too much and must append such an item to all lists I encounter. :P
And it's not stealing, it's copyright infringement. ;)
Seriously, though, I think using commercial software still won't cover all the bases. Alea said, "I can find commercial solutions (often specific to a particular domain)..." which I would assume means that there don't appear to be any general-purpose experiment packages.
As some others have already posted, 'experiments' c
Sharing is your best option (Score:1)
2.
3.
conclusion: share your software, start a new project , see if other people are willing to help out.
Perl (Score:1, Insightful)
I find it to be an excellent language for maintaining data.
Re:Perl is only useful for maintaining your job (Score:5, Funny)
Re:Perl is only useful for maintaining your job (Score:2)
I have used it extensively for research projects (most of my work involves nonlinear optimization models), gluing together disparate applications and sources of data, and it has worked splendidly.
I also use C when it is appropriate and Java when it is appropriate. Frankly, Perl has time and again proven its worth and has been (for me) more often than not the right choice.
As you say, Perl syntax is looser than more strongly typed languages, bu
Re:Perl is only useful for maintaining your job (Score:1)
Funny, C gives you enough rope to shoot yourself.
Re:Perl is only useful for maintaining your job (Score:1)
admit it, you don't know what all of those "$"s and "@"s mean, and you are afraid of them... they might sneak up on you in the night and make you think....
Perl is great for quick and dirty hacks, bugzilla was written using perl, which says a lot, I know of corporate projects written (and work!) in perl.
So you heard someone says "this is an unmaintainable language" and from there on you chatter the mantra..
And what
dependencies (but not make) (Score:3, Interesting)
Ideally, I'd type make paper and it would start from the beginning stages of the experiment and go all the way through creating the paper. Moreover, if anything died along the way, I could fix the problem, type make again, and it would more or less pick up where it left off, not re-running things it had already done (unless they were affected by my fix).
But after playing with this for a few days, I became convinced that make wasn't up to snuff for what I wanted. I have these sort of `attribute-value' dependency constraints. From one raw initial dataset, I create several cross-validation folds, each of which contains one training set and a couple varieties of test set. the filenames might look like
Now suppose that the way I actually run an experiment involves passing a test set and the corresponding training set to the model I'm testing, a command like: Since, however, I have to run this over several folds (and other variations that I'm glossing over), I'd like to write an 'implicit rule' in the Makefile. This involves pattern-matching the filenames. But it's a very simple pattern-matching: you get to insert oneYou might be thinking, you could do
but then I have to copy this rule several times for each sort of test set, even if the command they run is the same.The underlying problem, I think, is that the pattern-matching in make's implicit rules is too simple. What I would rather have is some kind of attribute-value thing, so I could say something like
where fileid corresponds to 'base.fold0' and whatever other file identifying information is needed.This notation is sort of based on a natural language attribute-value grammar.
Anyway, if anyone has any suggestions as to this aspect of the problem, I would be grateful
Re:dependencies (but not make) (Score:2)
Have you tried automake? (autotut [seul.org], autobook [redhat.com])
Re:dependencies (but not make) (Score:4, Interesting)
I just wrote two functions. (I wrote them in the shell, but if I were doing it again, I'd probably do it in perl.) construct() simply makes a file if it is out of date (see example below). Construct() is where are of your rules go: it knows how to transform a target filename into a list of dependencies and a command.
It uses a function called up_to_date() which simply calls construct() for each dependency, then returns false if the target is not up to date with respect to each dependency. If you don't do anything very sophisticated here, up_to_date will only be a few lines of code.
"construct" will basically replace your makefile. For example, if you did it in perl, you could write it something like this:
sub construct {
local $_ = $_[0]; # Access the argument.
if (/^base\.model(.)\.fold(\d+)\.test(.).run$/) {
@dependencies = ("base.fold$2.test$3.test",
"base.fold$2.train");
if (!up_to_date($_, # Output file.
@dependencies, # Input files.
"model$1")) { # Rerun if prog changed, too.
system("model$1 @dependencies > $_");
}
}
elsif (/^....$/) { #
}
}
What you've gained from this is a much, much more powerful way of constructing the rule and the dependencies from the target filename. Of course, your file will be a little harder to read than a Makefile--that's what you pay for the extra power. But instead of having many duplicate rules in a makefile, you can use regular expressions or whatever kind of pattern matching capability you want to construct the rules.
R? (Score:4, Informative)
Re:R? (Score:1)
Re:R? (Score:1)
Ant, with some tweaking. (Score:5, Interesting)
What I would end up doing is setup an Ant build file for each experiment, under each algorithm.
And then you can update property files, using a quick shell script, or something along those lines at the end of the data set, as well as having build/run times that Ant can retrieve for you. Good solution, and you aren't reinventing the wheel.
Requires Java, which depending upon your ideology is either a good thing or a curse.
Idea (Score:2)
Good Luck
Re:Idea (Score:2)
To easily store arbitrary datastructures, try xl2 serialization (java).
AppLeS? (Score:2, Informative)
See here [sdsc.edu] for other projects from the GRAIL lab at SDSC and UCSD.
Uh-huh (Score:5, Funny)
Ne3D H3lp WIt M4H H4x0RiN!!!!!
I mean, let's face it, much of what modern hacking closed-sourced software consists of is throwing a variety of shit against a variety of programs in a variety of configurations and seeing what breaks and then following up to make an exploit out of it.
While this probably isn't the case here, it's very hard to read that note and not snicker just a tiny, tiny bit . . .
Re:Uh-huh (Score:2)
Re:Uh-huh (Score:1)
Or at least, I read the story and immediately thought of applications to my own projects (I'm a Research Assistant at my University, and I'm a little tired of writing Perl scripts to batch long jobs with combinatorial arguments).
If such a tool exists, I too would be interested in it. I think it rash to assume that the poster is looking for exploit automation.
Oh that's easy.... (Score:5, Funny)
But what you are looking for, sir, is the cheap labor commonly known as a Graduate Student
In fact, I'm afraid to report that you are a bit behind the times in this department as these "Graduate Student" devices are quite common at universities and research labs.
Re:Oh that's easy.... (Score:3, Funny)
Fun with your new head! (Score:2)
http://catalog.com/hopkins/text/head.html [catalog.com]
Re:Oh that's easy.... (Score:5, Funny)
That does have it's advantages though you should be cautious. In my experience those models often have a large number of bugs in their systems and tend to be a lot more likely to pick up viruses as well.
This shouldn't be a problem for most operations but ocassionally if you try to interface them with your other components you may find your other systems becoming infected as well. In extreme cases you may also find interfacing with these systems can cause additional child processes to be created. These child processes are extremely hard to get rid of, early on you may be able to simply kill them but this command becomes extremely impratical after a few months of operation. These processes are known to take up huge amounts of resources and maintainance and often take the better part of 2 decades to subside (they're still present but resource demands drop considerably). Of course many of these risks can be alliviated by using a proper wrapper class while working with this "graduate student" systems.
Re:Oh that's easy.... (Score:1)
ROOT (Score:5, Informative)
We experimental high-energy physics folk have been using it (and PAW) for some time. It offers scripting and histogramming and analysis and a bunch of other features. And it's open source. Check it out.
Re:ROOT (Score:1)
suggest jdb for managing individual experiments (Score:4, Informative)
I've been very happy using jdb (see below) to handle individual experiments, and directories and shell scripts to handle sets of experiments.
JDB is a package of commands for manipulating flat-ASCII databases from shell scripts. JDB is useful to process medium amounts of data (with very little data you'd do it by hand, with megabytes you might want a real database). JDB is very good at doing things like:
For more details, see http://www.isi.edu/~johnh/SOFTWARE/JDB/.
A project with similar goals (Score:1, Informative)
http://sourceforge.net/projects/pythonlabtools/
Open-Source-Experiment Management-Software? (Score:1)
FBI's Carnivore.
(Well, that's the way the headline parsed out for me the first time I glanced at it...)
Define a common meta-data set. (Score:2)
Ok, you take the management piece into a meta-environment like web e-commerce. Each iteration produces a transaction, essentially a line in a table containing the common meta-elements and then you perform your management via linked queries on this data set ala Napster.
If all of your data engines are connected (Intranet), the o
tcltest (Score:3, Informative)
Eclipse as testing platform (Score:1)
Its a good platform for managing a collection of custom ant build scripts if you decide to go that direction (assuming your in java of course...)
If you'd prefer something more specialized, the plugin architecture isn't bad and could save some time with interface work. Especi
Need Open Source data reduction too... (Score:1)
Re:Need Open Source data reduction too... (Score:1)
Matfud
Re:Need Open Source data reduction too... (Score:1)
Sounds like High Energy Physics (Score:4, Informative)
And the "middleware" you need are the GNU tools gluing together the specialized programs that do the specific things you want.
We have been using unix for a long time, and many of us prefer the combination of small targeted tools philosophy rather than a single monolithic package.
I will repeat, and you can stop reading now if you want. The GNU tools, unix, and specialized scriptable programs are already the "middleware" you seek.
If you are just missing some of the tools in the middle, here are the ones used in HEP. You might find more appropriate ones closer to whatever discipline you work in.
All the basic unix text processing tools and shells.
bash. csh. Perl. grep. sed. and so on.
Filename schemes ranging from appropriate to clever to bizarre.
(See other posts here)
Make it so that all the inputs you want to change can be done on the command line or with an input steering text file.
Same tools combined with some simple c-code to produce formats for spreadsheets or PAW or ROOT or whatever visualization or post-processing thing you need done. Has ntuple and histogram support automatically, which might be all you need.
Almost always I choose space delimited text for simple output to push into PAW, ROOT, or spreadsheets. I keep a directory of templates to help me out here.
Some people use full blown databases to manage output. For a long time there have been databases specific to the HEP needs. I recently have started using XML-style data formats to encapsulate such things in text files if the resulting output is more complicated than a single line. You mention XDF, sure, that sounds like the same idea.
CONDOR (U Wisconsin) has worked nicely for me for clustering and batch job submission when I need to tool through 100 data files or 100 diffrent parameter lists on tens of computers. The standard unix "at" is good enough in a pinch if you play on only 5 computers or so.
HEP folks use things like PAW and ROOT (find them at CERN) which contain many statistical analysis things and monstrous computation algorithsm. Or at least ntuples, histograms, averages, and standard deviations. You could go commercial or the gsl here if you prefer such things.
CVS or similar to take care of code versions.
Don't forget to comment your code.
We write our own code and compile from fortran or c or c++ for most everything else.
Output all plots to postscript or eps.
LaTeX is scriptable.
And use shells, grep, perl to glue it all together. Did I mention those already?
I get a good night's sleep more often than not.
And decide what to do next after coffee the following morning.
This is where you put your brain, and if you have done the above well enough, this is where you spend most of your time.
The answer I get each morning (as another post suggests) is always so suprising that I need to start from scratch anyway.
I bet that is what you are doing already. Probably no monolithic software will be as efficient as that in a dynamic research environment.
What did I miss from your question?
Oh, yes. Get a ten-pack of computation notebook with 11 3/4 x 9 1/4 inch pages (if you print things with standard US letter paper). And lots of pens. And scotch tape to tape plots into that notebook. Laser printer and photocopier. Post-it notes to remind yourself what you wanted to do next (or e-mail memos to yourself). Maybe I should have listed this first.
Good luck.
Re:Sounds like High Energy Physics (Score:1)
I'd emphasize that using a scriptable graphing/postprocessing program (I used to use gnuplot and octave, there are many interesting options more widely documented now) is really key.
Nothing like starting a script and being able to walk away from it for the afternoon, or the night, or the weekend...
Re:Sounds like High Energy Physics (Score:2)
I agree that a "monolithic" solution isn't going to do much for you. That's why I'm thinking more in terms of middleware, something that will help me bind these tools together. I envision that some scripting would be necessary to bind in new tools and experimental software for each new project,
Re:Sounds like High Energy Physics (Score:1)
I think you're asking for a very powerful, very well-designed IDE with good integration with configuration management, software instrumentation, etc.
Re:Sounds like High Energy Physics (Score:1)
But... Isn't that YOUR Job? (Score:1)
Just Kidding
But if your determeined to let the computer do the work, perhaps some form of Genetic Algorithm could be applied here. If you can define you domain into something that can be broken down well enough and tested for selection criteria there are lots of tools and research available. If you have an API to
schema (Score:2, Informative)
Table: experiments
----
exprmntID
exprmntWhen
exprmntDescr
outcome
Table: params
----
paramID
exprmntRef
paramName
paramValue
Table: dataSet
----
dataSetID
filePath
datasetDescr
isGenerated&nbs p ;
CRC
Table: dataSetUsed
----
exprmntRef
dataSetRef
Table: softwareVers
Re:Clear TOS (Score:2)
Re:Clear TOS (Score:1)
I never was much of a fan of schema inheritance. Most examples I have seen were based on bad designs IMO. And funky datatypes decreases porting and sharing of the data to other DB's.
Plus, I think they wanted something "lite" in the DB department based on one comment, and Postgre has a bit more of a learning curve.
you need... (Score:2, Funny)
configuration management, build scripts, etc... (Score:2, Informative)
management of all details of an experiment, including parameter sets, datasets, and the resulting data
This can be handled by an ad-hoc database, a flat file in most cases. If you were a Windows power user, you'd spend an hour or two putting together something in Access for it.
ability to "execute" experiments and report their status
make with a little scripting, or whatever you use as a build system.
an API for obtaining parameter values and writing out results (a
that's what UNIX is there for (Score:5, Informative)
Distribution of jobs, running things with multiple parameter values, etc., all can be handed smoothly from the shell. This is really the sort of thing that UNIX was designed for, and the entire UNIX environment is your "experiment management software".
Re:that's what UNIX is there for (Score:2, Interesting)
What is it intrinsically about Windows that makes it "no good for this sort of thing"? Windows provides all the system services you need to do these tasks, and all the tools you mention are available natively for Windows. Come to think of it, they're avai
Re:that's what UNIX is there for (Score:2)
Yes, but are they built-in?
GNU just works out-of-the-box. Windows, UNIX and any other commercial solution just seems to miss the whole point of having a computer. Processing data. Which is why they forget to include a spreadsheet and a database and a compiler and scripting languages and various other tools that are a requirement for this data processing we all love to do so much.
Re:that's what UNIX is there for (Score:1)
Yes, it's inconvenient for a Windows user to download and install the extra applications (a better shell, all the command-li
Re:that's what UNIX is there for (Score:2)
Windows standard edition alone can not do what you're talking about. You have to have their professional edition with SQL, Excel, etc. This is several hundreds of dollars in additional software on top of an expensive "professional" version so you have a chance at getting stability. You wouldn't dare attempt to manage important data on a home version of a Microsoft OS. When your standard RedHat 9 download includes EVERYTHING you need and is
Re:that's what UNIX is there for (Score:2)
So, you are saying that you have never actually run large scale computational experiments, and you don't actually have any recommendations for how to run them on Windows, but based on a list of Windows "system services" you think it should be pretty good for that.
Well, I have run large
Re:that's what UNIX is there for (Score:1)
What I meant by "system services" was services offered to user programs by
Re:that's what UNIX is there for (Score:2)
Re:that's what UNIX is there for (Score:1)
Re:that's what UNIX is there for (Score:2)
I played with Xmouse when I was using Windows, and I agree with your assesment. My wife has messed with various virtual-screen switchers in Windows, and all of them seem to have problems with some "too clever" apps, just like Xmouse. I've seen a lot of folks at my university play with various X "servers" (well, mostly Hum
Re:that's what UNIX is there for (Score:1)
As for messing with Windows to make it like UNIX, that depends on what you mean. Trying to make the GUI behave more like X is a fruitless endeavor. But I use zsh and Python every day. Windows' glass-TTY sucks for cut and paste, but it works just fine for everything else.
Re:that's what UNIX is there for (Score:1)
Make is fantastic for organizational purposes. makefiles are basically a language for describing dependencies. If you have a directory-based structure, it works wonderfully. You just have to make sure your stuff is orthogonal
I've been working on this (Score:1)
John
ossa ools! (Score:2)
Its not strange, for example, for me to use python to generate the actual program runs, the shell to actually manage the run and move the input/output data files, then any of several graphics programs to handle the output (and often output graphs are done automatically as the programs run).
This gives me a pile of flexibility which is often useful. For instance, when doin
Re:ossa ools! (Score:1)
Those of you that are finding XML useful for this sort of thing, what tools and ideas are you using?
Re:ossa ools! (Score:2)
Then I use xslt (again, ad hoc) to digest this and produce output (often in html these days) that I can look at.
I could do it all in other ways - the advantage of xml is that I can describe the markup somewhere so later on I'll not forget what the data actually was, and that I can use XSLT wit
Experimental tools (Score:2, Informative)
A wellthought example on how to setup your code for experimental work is the lemur [cmu.edu] toolkit from CMU. This toolkit has a concept of "parameter" files that is very handy
SMIRP (Score:2, Informative)
It started out as a very simple system that didn't act as much more than a set of tables with some simple linking structures. On top of that is an alerting system, (so you can track new experiments being done) a full text index, bots for automating certain procedures, and a system for transferring data to Excel.
What's surprising is that for the most part, the underlying structure stayed exactly the same even though we've been running all the oper
there are many projects developing such software! (Score:4, Informative)
Funding agencies in the USA (NSF, NIH) and Europe have recently decided to target the construction of such software, and many competing projects have been given grants, most of which involve the production of open source software.
Relevant keywords are "eScience", "Experimental Data Management", "Experimental Metadata", and to some extent "Grid Computing".
Here is a paper which lays out the program of research [semanticgrid.org].
I work for one such NSF & NIH funded project [fmridc.org] at Dartmouth College. We're developing such a tool [fmridc.org]: Java-based, completely open, available at sourceforge, currently in alpha, to be released for fMRI use in July, but designed from the start to be generalizable for all of experimental science. This is built on top of a pre-existing framework [stanford.edu] for semantic data management and modeling from Stanford.
I'll try to list some of the features relevant to your needs:
Finally, I would like to stress that our project is one of many, and that if it doesn't meet your needs, within a year there will be many competing "eScience" toolkits.
You may contact me for more information by reversing the following string: "ude.htuomtrad@exj".
I Develop This Kind of Software (Score:3, Informative)
The Computer Aided Engineering (CAE) world has much the same problem you do.
They model their products with several different analysis codes, each with its own input and output format. This generates a gob of data, and is currently managed in ad hoc ways, is not easy to integrate with other results and wastes the time of lots of engineers.
The product we've come up with to manage both the models, the process for executing the models, and the data generated by running the models is a software framework called CoMeT [cometsolutions.com] (Computational Modeling Toolkit).
We are also capable of managing different versions of the model, parameter studies, and some basic data mining. The whole thing is scriptable with Scheme.
Unfortunately, we are a commercial software company, and the software is still under development, although everything I mentioned above can currently be done. We are mostly working on a front end now, although we still need to make a few improvements to the framework and add support for many analysis codes.
The reason I'm replying to this is that your list of requirements is a perfect subset of ours. We are aiming our product at CAE in the mechanical and electrical domains (Mechatronics).
I know, it's not free, but we feel we've done some very innovative things and it has taken several people many years of low pay to get this far. We really want to make some money off it eventually....
If you want more information check out the web-site or email me here. We're in need of proving this technology in a production environment so maybe we can work something out.
-Craig.
Might be suitable? (Score:2, Informative)
less tools (Score:1)
I think someone told all the computer scientists that there's a theoretical way to write a program that does everything. I'm a computer scientists and it's clearly impossible to generalize that greatly.
In fact it is so dangerous, the general purpose OS is the root cause of our software is the number one cause of our downward spiral in software quality. It took BeOS a short time to write their general purpose OS. I think it's silly to think it's that monolithic of a project tha
ExpLab (Score:2, Informative)
I'm in precisely the same situation as Alea, so I read the suggestions here with considerable interest.
I'd like to mention ExpLab [sourceforge.net].
Though I haven't used ExpLab yet, these folks have been associated with other very high quality work (CGAL) so I expect good things. Here are three goals they list for the project:
workflow (Score:2)
Ralf
Experiment Management Software (Score:1)
Re:Welcome (Score:1)
The rumor is, it's something called **work**.
From the OP: I have been toying with writing a framework to manage everything, but don't want to reinvent the wheel.
Seems to me that the OP is more than capable of doing the work, but he is smart for trying to find an existing solution. The rumor is, it's something called **working smarter**, not **working harder**. :-)
Re:Welcome (Score:2)
rubbish (Score:2)
Again, if such a solution existed, it would already be in place. This whiner complained about the amount of work, that clearly comes with the job. He just wants to go home earlier...don't we all. Nothing smart in that.
Smarter? that's funny (Score:2)
Re:Welcome (Score:2)
Re:Satania is a good choice. (Score:1)
Re:Satania is a good choice. (Score:1)