Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Science

Open Source Experiment Management Software? 122

Alea asks: "I do a lot of empirical computer science, running new algorithms on hundreds of datasets, trying many combinations of parameters, and with several versions of many pieces of software. Keeping track of these experiments is turning into a nightmare and I spend an unreasonable amount of time writing code to smooth the way. Rather than investing this effort over and over again, I have been toying with writing a framework to manage everything, but don't want to reinvent the wheel. I can find commercial solutions (often specific to a particular domain) but does anyone know of an open source effort? Failing that, does anyone have any thoughts on such a beast?"

"The features I would want would be:

  • management of all details of an experiment, including parameter sets, datasets, and the resulting data
  • ability to "execute" experiments and report their status
  • an API for obtaining parameter values and writing out results (available to multiple languages)
  • additionally (alternately?) a standard format for transferring data (XDF might be good)
  • ability to extract selected results from experimental data
  • ability to add notes
  • ability to differentiate versions of software
In my dreamworld, it would also (via plugin architecture?) provide these:
  • automatically run experiments over several parameters values
  • distribute jobs and data over a cluster
  • output to various formats (spreadsheets, Matlab, LaTeX tables, etc.)
Things I don't think it needs to do:
  • provide a fancy front-end (that can be done separately - I'm thinking mainly in terms of libraries)
  • visualize data
  • statistical analysis (although some basic stats would be handy)
The amount of output data I'm dealing with doesn't necessitate database software (some sort of structured markup is ok for me), but some people would probably like more powerful storage backends. I can see it as experiment management 'middleware'. There's no reason such software should be limited to computer science (nothing I'm contemplating is very domain specific). I can imagine many disciplines that would benefit."
This discussion has been archived. No new comments can be posted.

Open Source Experiment Management Software?

Comments Filter:
  • by Anonymous Coward on Saturday April 19, 2003 @08:49PM (#5766650)
    I'm also an empirical computer scientist, and another aspect I would look for is handling dependencies. Make is the standard tool for doing this, but it's not up to this task.

    Ideally, I'd type make paper and it would start from the beginning stages of the experiment and go all the way through creating the paper. Moreover, if anything died along the way, I could fix the problem, type make again, and it would more or less pick up where it left off, not re-running things it had already done (unless they were affected by my fix).

    But after playing with this for a few days, I became convinced that make wasn't up to snuff for what I wanted. I have these sort of `attribute-value' dependency constraints. From one raw initial dataset, I create several cross-validation folds, each of which contains one training set and a couple varieties of test set. the filenames might look like

    base.fold0.testA.test base.fold0.testB.test base.fold0.train
    Now suppose that the way I actually run an experiment involves passing a test set and the corresponding training set to the model I'm testing, a command like:
    modelX base.fold0.testA.test base.fold0.train > base.modelX.fold0.testA.run
    Since, however, I have to run this over several folds (and other variations that I'm glossing over), I'd like to write an 'implicit rule' in the Makefile. This involves pattern-matching the filenames. But it's a very simple pattern-matching: you get to insert one .* (spelled %) in each string, which corresponds to the same thing. Given that, there's noway I can specify the command I have above.

    You might be thinking, you could do

    %.modelX.testA.run : %.testA.test %.train
    but then I have to copy this rule several times for each sort of test set, even if the command they run is the same.

    The underlying problem, I think, is that the pattern-matching in make's implicit rules is too simple. What I would rather have is some kind of attribute-value thing, so I could say something like

    { fileid=$1 model=modelX test=$2 filetype=run } : {fileid=$1 test=$2 filetype=test } { fileid=$1 filetype=train }
    where fileid corresponds to 'base.fold0' and whatever other file identifying information is needed.

    This notation is sort of based on a natural language attribute-value grammar.

    Anyway, if anyone has any suggestions as to this aspect of the problem, I would be grateful

  • by Xerithane ( 13482 ) <xerithane.nerdfarm@org> on Saturday April 19, 2003 @08:50PM (#5766653) Homepage Journal
    We do something that almost parallels this, and we still haven't had the time to complete the Ant setup. The basic gist of it is that Ant has properties files that can contain any number of parameters, along with embedded XSLT functionality. This allows Ant to generate new build.xml files (The Ant build file) and run it, on the fly, given a set of user-entered commands, environment variables, or file parameters. The parameter files are easy to modify and update, and combined with CVS you can even do version control on the different experiments.

    What I would end up doing is setup an Ant build file for each experiment, under each algorithm.

    Algorithm/experiment_dataset1.properties
    Algori thm/experiment_dataset2.properties

    And then you can update property files, using a quick shell script, or something along those lines at the end of the data set, as well as having build/run times that Ant can retrieve for you. Good solution, and you aren't reinventing the wheel.

    Requires Java, which depending upon your ideology is either a good thing or a curse. :)
  • Re:Experience (Score:3, Interesting)

    by jkauzlar ( 596349 ) on Saturday April 19, 2003 @08:55PM (#5766685) Homepage
    I agree with the parent post after giving the problem a little thought. There may tools available, but I think what you need is to set up scripts for your experiments.

    What comes to mind when I think about experiment management software is unit testing software. Correct me if I'm wrong, but when you run empirical software experiments, you are essentially unit testing the software.

    Something like Python, Perl, or TCL (probably Python-- powerful, easy to read) should suit you ideally. Other options include Make utilities like make or Ant (w/ JUnit would work great!).

    With any of these you could make use of any existing command-line or scriptable utilities for conversion or producing data files or database data.

    Just my 2 cents. Hope this helps.

  • Re:Experience (Score:4, Interesting)

    by robbyjo ( 315601 ) on Saturday April 19, 2003 @09:52PM (#5766876) Homepage

    Sorry, but I must disagree. Most of the times, research experiment != unit testing.

    To illustrate: Take for example a data mining project. The first phase is data preparation -- which is easily scriptable. But how to prepare the data is different story. We must examine the raw data case by case to decide how to treat it. For example: When to discretize and using what method (linear scale, log scale, etc), when to reduce dimensionality, etc etc. This requires human supervision.

    Even after we do the data prep, we look at the result. If the cooked data contains too much loss of information due to prep stage, we have to do it again using different parameters. This is painful.

    Then, next on the pipeline: What algorithm to use. This is, again, depend on the characteristics of the cooked data. You know, some "experts" (read: grad students) will determine it using some "random" heuristics of their mind given some reasonable explanations.

    If after the result is out and is not desirable, we might go back for different algorithm or choose different data prep parameters, and so forth...

    Given this settings, I doubt that there is a silver bullet for this problem...

  • by Anonymous Coward on Sunday April 20, 2003 @12:00AM (#5767351)
    I ran into this problem when I was in graduate school, too. What I eventually did was to abandon make because of the limitations you are running into, and construct a special-purpose experiment running utility that would know about all the predecessors, etc. It turned out not to be too hard, actually. However, if you don't know perl or another language that gives you good pattern matching and substring extraction capability, then this will be very hard to do.

    I just wrote two functions. (I wrote them in the shell, but if I were doing it again, I'd probably do it in perl.) construct() simply makes a file if it is out of date (see example below). Construct() is where are of your rules go: it knows how to transform a target filename into a list of dependencies and a command.

    It uses a function called up_to_date() which simply calls construct() for each dependency, then returns false if the target is not up to date with respect to each dependency. If you don't do anything very sophisticated here, up_to_date will only be a few lines of code.

    "construct" will basically replace your makefile. For example, if you did it in perl, you could write it something like this:

    sub construct {
    local $_ = $_[0]; # Access the argument.

    if (/^base\.model(.)\.fold(\d+)\.test(.).run$/) {
    @dependencies = ("base.fold$2.test$3.test",
    "base.fold$2.train");
    if (!up_to_date($_, # Output file.
    @dependencies, # Input files.
    "model$1")) { # Rerun if prog changed, too.
    system("model$1 @dependencies > $_");
    }
    }
    elsif (/^....$/) { # .. Check other patterns. ...
    }
    }

    What you've gained from this is a much, much more powerful way of constructing the rule and the dependencies from the target filename. Of course, your file will be a little harder to read than a Makefile--that's what you pay for the extra power. But instead of having many duplicate rules in a makefile, you can use regular expressions or whatever kind of pattern matching capability you want to construct the rules.
  • by ExoticMandibles ( 582264 ) on Sunday April 20, 2003 @10:34AM (#5768574)
    If you are using Windows, switch to UNIX. Windows may be good for starting up MS Office, but it is no good for this sort of thing. If you absolutely must use Windows for data analysis, stick your data into a relational database or Excel spreadsheets.

    What is it intrinsically about Windows that makes it "no good for this sort of thing"? Windows provides all the system services you need to do these tasks, and all the tools you mention are available natively for Windows. Come to think of it, they're available for OS/2, QNX, Mac OS X, and nearly every other desktop operating system out there. One could erase every mention of UNIX-specificness from your post, and not only would your post still hold true, it would be an improvement. Your knee-jerk UNIX advocacy, nestled in and disguised as helpful advice, is a disservice to the original poster.

    Suggesting that the original poster must be using UNIX in order to get their work done is wrong in several senses of the word; it is not factual, and it is irresponsible. On the contrary--I am certain that their current choice of operating system is entirely up to the task. He or she should feel absolutely no onus to switch.

All seems condemned in the long run to approximate a state akin to Gaussian noise. -- James Martin

Working...