doit: a Python alternative to make

In an earlier post, I demonstrated the wonderfulness that is having a build-system and introduced the venerable make utility. You can use it to add plumbing to your analysis pipeline that keeps track of which analysis steps have been run for which subjects and whether some scripts have changed and should be run again. When properly implemented, it can save a lot of head aces, especially when deadlines are approaching, by making sure that everything is always ‘up do date’.

However, make is not the easiest tool to wrap your head around. The syntax is archaic, we needed to add “phony” targets and use a custom made, magical-looking function to iterate over subjects. So let me introduce you to another tool, called doit, which can be easier to use, especially if you are familiar with Python syntax.

Installation

Unlike make, it is unlikely that doit is pre-installed on your system. To install it, make sure you have Python installed and its package manager pip. If that is the case, you can run:

$ pip install doit

Preparing the analysis pipeline

In order to exploit doit to its fullest, you should organize your analysis pipeline so that you can easily perform a single step of the pipeline on a single subject. For example, have functions that have names like filter_data() that take the name of a subject as an argument. In the example analysis pipeline used in this tutorial, I’ve organized everything into scripts that can be run from the command line and take the subject name as a command line argument. See the make tutorial for the why and how.

In my example, I have 3 analysis steps that need to be performed on each of 10 subjects (named A through J) and a “grand average” step that combines the data from all subjects and computes a summary statistic. I’ve organized them into the following files:

data.py
filter.py
subject_mean.py
grand_average.py

You can download them here. To run, for example, the filtering step of the analysis on subject C, you can do:

$ python filter.py C

Writing a dodo file

Similar to make‘s Makefile, you tell the doit tool about your pipeline in a file called dodo.py. As it is a regular Python code file, it uses the Python syntax, so it pays off to read up on it if you’ve never used it before.

The doit tool deals with units it calls ‘tasks’. Tasks are created in the dodo.py file by writing functions which name starts with task_. These functions return one or more dict objects that describe the task(s) to be executed.
In our case, a task would be running a single step of the analysis pipeline on a single subject. We can then put this in a loop to generate a task for each subject.

This all sounds complicated, but it actually isn’t that bad when you take a look at the code. So, without further ado, here is how we can write a task that performs the first step of the analysis (running the data.py script) on a single subject:

def task_data():
    """Example: run step 1 for a single subject"""
    return {
        'file_dep': ['data.py'],   # dependencies
        'targets': ['A-data.txt'], # files produced
        'actions': ['python data.py A'],
    }

As you can see, it is a function, named task_data that returns a Python dictionary object that contains a description of the task to be performed. The fields of the dictionary are:

  1. file_dep: a list of files that are needed to perform the action. If any of these files change, the analysis step should be re-run.
  2. targets: a list of files that are produced by the analysis step.
  3. actions: a list of actions to perform to actually do the analysis step. In this case, we run the command python data.py A.

We can now run:

$ doit

To perform all the actions we defined in the dodo.py file. In this case, it will run the data.py script. The output will be:

.  data

The dot means that the action called data was executed. If we run doit again, the output changes to:

-- data

The two dashes mean that doit detected that the data.py script hasn’t changed and A-data.txt is up to date, so there is no need to perform the action again. We don’t need to worry about doit redoing lengthy computational steps unnecessarily!

Let’s modify the dodo.py file so that the first analysis script is run for all the subjects, using a loop:

# List of all the subjects
subjects = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']


def task_data():
    """Step 1: generate some data for the subject"""
    for subject in subjects:
        yield {
            'name': subject,
            'file_dep': ['data.py'],
            'targets': ['%s-data.txt' % subject],
            'actions': ['python data.py %s' % subject],
        }

The yield keyword, if you are not familiar with it, is like a return that doesn’t exit the function. The loop keeps running and the function keeps ‘throwing up’ dict objects until the loop finishes. Such a function is called a generator. So in our case, the generator function yields a sequence of tasks: one for each subject. Note that I’ve also added a name field to each ask. This is so the tasks are called data:A, data:B data:C and so forth. You can see this when you run doit now:

$ doit
.  data:A
.  data:B
.  data:C
.  data:D
.  data:E
.  data:F
.  data:G
.  data:H
.  data:I
.  data:J

It now runs the data action for each subject. To verify that we don’t do unnecessary computations, we can run doit again and check that it skips everything that is up to date:

$ doit
-- data:A
-- data:B
-- data:C
-- data:D
-- data:E
-- data:F
-- data:G
-- data:H
-- data:I

You now have enough knowledge to complete the dodo.py file to include all the steps of the analysis pipeline:

# List of all the subjects
subjects = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']


def task_data():
    """Step 1: generate some data for the subject"""
    for subject in subjects:
        yield {
            'name': subject,
            'file_dep': ['data.py'],
            'targets': ['%s-data.txt' % subject],
            'actions': ['python data.py %s' % subject],
        }


def task_filter():
    """Step 2: low-pass filter the data at 30 Hz to clean it"""
    for subject in subjects:
        yield {
            'name': subject,
            'file_dep': ['filter.py',
                         '%s-data.txt' % subject],
            'targets': ['%s-filtered_data.txt' % subject],
            'actions': ['python filter.py %s' % subject],
        }


def task_subject_mean():
    """Step 3: Compute the mean for each subject"""
    for subject in subjects:
        yield {
            'name': subject,
            'file_dep': ['subject_average.py',
                         '%s-filtered_data.txt' % subject],
            'targets': ['%s-mean.txt' % subject,
                        '%s-std.txt' % subject],
            'actions': ['python subject_average.py %s' % subject],
        }


def task_grand_average():
    """Step 4: Compute grand average mean"""
    return {
        'file_dep': (
            ['grand_average.py'] +
            ['%s-mean.txt' % s for s in subjects] +
            ['%s-std.txt' % s for s in subjects]
        ),
        'targets': ['result.txt'],
        'actions': ['python grand_average.py'],
    }

To create the list of dependencies for the grand average step, I make use of some Python syntax you may not be familiar with. I generate lists of files for all subjects using a feature called list comprehension and I concatenate the resulting lists using the + operator.

Running parts of the pipeline

With the above script, we can run the entire pipeline with the plain doit command. We can also specify which task (or series of tasks) to run from the command line. The doit tool can provide us with a list of available ‘targets’ to run:

$ doit list
data            Step 1: generate some data for the subject
filter          Step 2: low-pass filter the data at 30 Hz to clean
grand_average   Step 4: Compute grand average mean
subject_mean    Step 3: Compute the mean for each subject

For example, if we only want to perform the filtering step:

$ doit filter
-- data:A
-- data:B
-- data:C
-- data:D
-- data:E
-- data:F
-- data:G
-- data:H
-- data:I
-- data:J
-- filter:A
-- filter:B
-- filter:C
-- filter:D
-- filter:E
-- filter:F
-- filter:G
-- filter:H
-- filter:I
-- filter:J

It now performs all the steps up to the filtering step, skipping everything that is up to date. If we really only want to do the filtering step, we can add the -s flag to disable the dependency checking:

$ doit -s filter
-- filter:A
-- filter:B
-- filter:C
-- filter:D
-- filter:E
-- filter:F
-- filter:G
-- filter:H
-- filter:I
-- filter:J

We can also select tasks to be executed using a glob like syntax (as long as we include as least one *). Say we only want to do the analysis for subject C:

$ doit '*:C'
-- data:C
-- filter:C
-- subject_mean:C

Note that we have to use quotes ('*:C') or otherwise the shell will try to expand the wildcard instead of passing it as a command argument to the doit program.

Finally, here’s how to run specific steps of the analysis for specific subjects:

$ doit '*[data, filter]:[A, B, C]'
-- data:A
-- data:B
-- data:C
-- filter:A
-- filter:B
-- filter:C

Take note that a * must always be present if we are using globbing.

Would you like to know more?

Check the full documentation of the doit tool.