Tuesday, September 17, 2013

Using Make for reproducible scientific analyses

The two tools that I use most frequently to manage my research are Git and Make. I've been using Git for years, but Make for only the past year or so - strangely enough, I learned about it at a Software Carpentry workshop at which I was one of the instructors. Since then, it's become a key weapon in my computational arsenal. In this post, I'll try to sell you on using Make for your own research projects.

What is Make? Make is an "automated build system." In other words, it's a software tool for automating the process of building one or more files from one or more other files they depend on. A "Makefile" contains a set of "recipes" that specify how all of your desired final products - figures, processed data files, documents, etc. - should be produced as well as their dependencies, or what files are needed to create them.

To many C or C++ programmers, Make is used for compiling code. This can often be messy, and may require other tools to get it to work correctly all the time - tools that generate very ugly looking, difficult to understand Makefiles. This has resulted in a lot of people thinking that Make itself is complicated (before learning more about it, I was one of them.) However, Make is conceptually very simple, and "how to produce a set of files from a set of other files" is a general problem that scientists run into every day.
Learning Make was simple and using it netted me three major benefits:
  1. Automation reduces cognitive load. When I come back to a project that has a Makefile, I don't need to think about which program I need to run with which file and which options to get my desired result - I simply type "make" or "make [filename]" and it handles everything for me. Previously I might've had a script like "do_everything.py" that would, well, do everything - having a Makefile instead means I don't even need to look for such a script. I know in advance how I'm going to repeat the analysis. It also means that other people can check out my code, see that there's a Makefile, and effortlessly repeat my analysis, often without even requiring instructions.
  2. Make automatically handles dependencies. When you have a complicated workflow, changing one file means that other files downstream may also need to be updated. But which ones? Without Make, I need to know which files depend on the file I changed, which files depend on those files, etc. This is another thing that Make will figure out for you, saving you the time and effort of thinking it through yourself.
  3. Makefiles provide a roadmap to your project. The best way to start a project is to map out your workflow - what input files do I start with, how do they need to be processed, what will be the end results. The more specific you can be with this roadmap, the better. This is precisely what a Makefile gives you - but in addition to simply documenting your project, it's also something that can be run.
After all, the spirit of laziness is the spirit of innovation!

An example. Here's a brief example of a hypothetical scientific workflow, and how it can be improved with Make.

My scenario: I have ten FASTA files containing DNA sequences. I want to align them using muscle, convert the alignment to another format, then produce a figure that visualizes the alignments.

What I have: ten FASTA files (homo_sapiens.fasta, pan_troglodytes.fasta, gorilla_gorilla.fasta...), a Python script (visualize_alignments.py), and muscle. My Python script requires the alignment to be in Phylip format, but muscle outputs in FASTA format.

To manually generate these figures, I would use muscle to align each FASTA file individually; use the BioPython library to convert those alignments from FASTA to Phylip; then call my python script to generate the figure. This would take 3 individual steps per FASTA file for a total of 30 steps - and as I add more DNA sequences, the number of steps will go up. Instead I'll show you how to do everything with just a single word, "make."

First, let's learn the anatomy of a Make "recipe." A Makefile is a collection of recipes, each of which specifies how to create a single "target:"

    target: dependencies...
        command

The target is the file to be produced; dependencies is a list of files it needs; command is the command that turns those dependencies into the target.

Back to our example. Working backwards, my end result is a figure visualizing the alignment for a given species. The target is the figure. To generate this figure, I first need the alignment files and the script to generate the figure. Once I have those, the command is simply "python visualize_alignments.py [alignment file]." So the first recipe would look like this:

    homo_sapiens_alignment.png: homo_sapiens.phy visualize_alignments.py
        python visualize_alignments.py homo_sapiens.phy

Note the target to the left of the colon, dependencies on the right, command below. The command needs to be indented with a tab - spaces won't work. Also, it might be strange to think of my figure "depending" on the script that produces it. What I really mean is that if my visualization script changes - if I change the figure label, or colors, or something - I need to re-build the figure. More on that later.

Since we're doing this ten (or potentially many) times, we can make this recipe even more general by changing a few parts using some ugly but useful Make syntax:

    %_alignment.png: %.phy visualize_alignments.py
        python visualize_alignments.py $<

You can see two changes here. First, in the target and dependencies, I've replaced "homo_sapiens" with %, the wildcard character. I'm telling Make that to make any file that ends in _alignment.png, I need a file of the same name that ends in .phy. I've also added $< in the command - this is short for "the first dependency," or, in this example, the .phy file. Think of the < as an arrow, pointing to the leftmost dependency.

We've written one recipe. Now I'll add two more, one to convert the alignment from FASTA (.aln) to Phylip (.phy) using biopython, and another to produce the alignment from the input file using muscle:

    %_alignment.png: %.phy visualize_alignments.py
        python visualize_alignments.py $<

    %.phy: %.aln
        python -c "import Bio.AlignIO as aio; aio.convert('$<','fasta','$@','phylip')"

    %.aln: %.fasta
        muscle -in $< -out $@

One more syntax element has been added here: $@ stands for "the name of the target."

What I have now is sufficient to generate figures individually - I can type "make homo_sapiens_alignment.png" and the entire workflow will be run, producing the figure. Here's where automatic dependency resolution becomes useful. If I run "make homo_sapiens_alignment.png" the figure will be created - if I run the same command a second time, nothing will happen. This is because Make sees that none of the dependencies have changed, and therefore nothing needs to be done. If I make a change to one file - change some of the plotting code in visualize_alignments.py, or manually edit the alignment file, for example - Make will re-run only the necessary portions of the workflow, instead of the entire thing!

Our workflow is still ten steps, one per figure - better than 30, but not yet ideal. We can make one final improvement that will allow us to simply type "make" and generate all ten figures automatically. We'll create a rule called "all" at the very top:

    all: $(patsubst %.fasta, %_alignment.png, $(wildcard *.fasta))

This "patsubst" function produces a list of files. For every file called [something].fasta (my input files), the list will contain [something]_alignment.png (the corresponding figure). By default, running "make" without the name of a target will run the recipe at the very top (in this case, "all"), and that recipe is now dependent on all of the figures we want to produce, so they'll all be created automatically if they need to be.

Conclusion.
While there are some ugly bits - the unfamiliar and alien syntax, particularly - Make is, essentially, quite simple. If you're new to Make, I would suggest avoiding the advanced syntax for now. Simply use filenames for your recipes, and you can add in more generality later by taking advantage of Make syntax elements one at a time. Remember: a Makefile is just a list of steps that need to be taken to turn your input files into the output you want. Used in this way, it is a powerful tool that can improve reproducibility and save you time, effort, and headaches.

Wednesday, July 31, 2013

ESA 2013

I'm about to head to Minneapolis for a week to attend the annual meeting of the Ecological Society of America, the first I've attended. Here's where I'll be participating:


  • I'm an instructor at a Software Carpentry workshop for ecologists on Sunday along with Ethan White and assisted by Elita Baldridge. It's a condensed, 1-day workshop (instead of the typical 2) and we'll focus on using the shell, regular expressions, and version control. I'll be teaching about Git and GitHub and how they can benefit ecologists
  • I'm giving a 5-minute "Ignite" talk on Tuesday morning (a 5-minute talk with 20 slides that advance automatically every 15 seconds) about the EcoData Retriever (recently published in PLoS ONE).
  • I'll be staffing the Dryad booth in the exhibit area during some of the exhibit/poster sessions and breaks. Come talk to me about archiving your data and learn about how your journal can integrate Dryad into the article submission and review process.
The other members of Weecology (my former lab group at Utah State) will be doing some interesting things this year as well - check the link for details.

Sunday, June 30, 2013

iEvoBio 2013

I attended iEvoBio in beautiful Snowbird, UT last week. iEvoBio is a satellite conference to the larger annual Evolution meeting that lasts for two days and focuses more explicitly on informatics (although apparently Evolution now has its own informatics session.) I presented a software demo of PhyloCommons, which is now in an alpha release. I got some great feedback on usability and have a clear direction forward. Here are my brief, unorganized impressions of the conference:
  • No one is sure what the "i" stands for. It's like the "i" in iPhone, right? So what the hell does that i stand for??
  • There was representation from a wide array of informatics initiatives and labs. EOLiDigBio, and BiSciCol seem very promising, and before the conference I hadn't even heard of the latter two. Among many peoples' favorite talks was by Xiao Xiao (of Weecology at Utah State), which was on using big data to test macroecological theory - very different from most of the other talks, but really interesting and very well-suited for this audience (Xiao even won a travel award for having one of the top student talks! Pretty awesome for someone who probably doesn't consider herself an evolutionary biologist.)
  • There were some common themes that came up repeatedly in talks, giving a clear idea on what many people have independently decided are pressing problems in biodiversity informatics. Using and linking identifiers is a big one.
  • The number of attendees was fairly small: Arlin Stolzfus says he counted 40 on the first day. There was a very unfortunate clash with some Evolution phylogenetics sessions - which most iEvoBio attendees are interested in. The second day had more, although I didn't count how many.
  • The afternoon of the second day was set aside for Birds of a Feather sessions, where participants broke into small groups and discussed topics of mutual interest, then reported back. It was great to be able to interact on a more personal level with other attendees. The downside is that people tend to be interested in hard problems, the kind that aren't resolvable in an hour of discussion - so the conclusion coming out of many groups perhaps wasn't as satisfying as people had hoped it would be going in. I suppose that's unavoidable due to the nature of these hard problems. Still, these were very enlightening discussions.
  • This was only my second conference. I was surprised to find that some people gave a talk at Evolution, a talk at iEvoBio, and a software demo at iEvoBio, on the same or similar topics. For some reason, I had assumed that you had to pick one. I'll look to be more actively involved in future conferences I attend.
  • There was a lot of talk about gender diversity in science - highlighting a setting with a roughly equal gender split, and in which the 6-person organizing committee and two keynote speakers consist entirely of women. An interesting comment that came out of a BOAF group on the topic was to avoid the all-too-common conflation of "womens' issues in science" with when to start a family, etc. Three problems with this: (A) not all women are looking to have kids soon (or possibly ever) so these discussions are irrelevant to these women (who, of course, have plenty of other issues they'd probably like to discuss); (B) these are parents' issues in science, not womens' issues - plenty of men take an active role in the raising of their children; and (C) this only serves to perpetuate unfortunate stereotypes of "women as baby factories" and therefore is counter-productive.
  • Carol and 2-year-old Ruben got to come with me, which was great! They played at Snowbird on day 2 and we hung out in Utah for an extra day before returning home. Pro tip: never, ever, choose the red-eye flight when flying with a toddler. Not worth the $100 savings. To make things even worse, our flight was delayed a full hour.
  • Snowbird is beautiful, and I had a nice chance to return home. Next year it'll be in Raleigh, NC, so I'll certainly be attending.

Sunday, May 26, 2013

A simple Python phylopic.org wrapper

PhyloPic.org is an online repository of species silhouettes. I love the idea, I think it could be very useful to jazz up figures, and they have a fantastic search feature and API. This morning I wrote up a simple Python wrapper of the API - enter one or more taxon names, download thumbnail images for each one.

In [1]:
import phylopic

phylopic.download_pics('Turdus migratorius', 'Anolis carolinensis', 'Panthera leo', img_format='thumb.png')
Downloading pic for Turdus migratorius... done!
Downloading pic for Anolis carolinensis... done!
Downloading pic for Panthera leo... done!
In [2]:
imshow(imread('Turdus migratorius.thumb.png'))
Out[2]:
<matplotlib.image.AxesImage at 0x37a6c10>
In [3]:
imshow(imread('Anolis carolinensis.thumb.png'))
Out[3]:
<matplotlib.image.AxesImage at 0x36cf250>
In [4]:
imshow(imread('Panthera leo.thumb.png'))
Out[4]:
<matplotlib.image.AxesImage at 0x36fc9d0>

Currently you can download either a low-res thumbnail or an SVG (the default, not so ugly and pixel-y, but I couldn't find a good way to embed it into an IPython notebook.)

Code available on GitHub

Wednesday, May 15, 2013

Blogging with IPython: distances in higher dimensions

This is my first experiment blogging with an iPython notebook, which allows easily embedding interpreted code and figures. I've found this to be a fun and convenient way to write up blog posts that might contain some technical ideas or data, potentially turning this blog into an open lab notebook. Most of this is just me toying around and experimenting with probability distributions; it may or may not be new to you.

The question

As part of a research idea I'm playing with, I'm thinking about characterizing "average members" of populations where several traits are measured, as well as the average difference between two individuals in a population.

Let's assume for simplicity that we're dealing with normally distributed functional traits. Many traits in nature happen to be approximately normally distributed; the central limit theorem also guarantees that means of many aggregate measures tend to approach normality. If you measure the trait values for many individuals, you'll get a distribution like this:

In [1]:
import scipy.stats as s
data = s.norm(0,1).rvs(100000)
hist(data, 100)
show()

The mean of this distribution is zero. But what is the average distance from the center? It's not zero; that's the minimum distance from the center. The average distance for this data is:

In [2]:
hist(abs(data), 100)
show()
In [3]:
mean(abs(data))
Out[3]:
0.79749462465259524

...about 0.8.

Now, what happens as I increase the number of dimensions? Let's assume that we're measuring two independent traits in each individual, both normally distributed. This becomes a two-dimensional Gaussian distribution, with its peak in the center:

In [4]:
data = s.norm(0,1).rvs((2, 100000))

hexbin(data[0], data[1])
cbar = colorbar()
cbar.set_label('point density')

So, what is the average distance from the center now?

In [5]:
mean([(x**2 + y**2) ** 0.5 for x, y in zip(data[0], data[1])])
Out[5]:
1.2517969486248142

As the number of dimensions increases, the average distance from the center also increases. It's hard to visualize this as the number of dimensions continue to increase, but it turns out the average distance from the center increases predictably, with the square root of the number of dimensions.

In [6]:
import random

MAX_DIMENSIONS = 200

avg_distances = []
for d in range(1, MAX_DIMENSIONS):
    data = s.norm(0,1).rvs((d, 100))
    distances = [sum([x**2 for x in point])**0.5 for point in zip(*data)]
    avg_distances.append(mean(distances))
    
xs = range(1, MAX_DIMENSIONS)
ys = avg_distances

scatter(log(xs), log(ys))

m, b, r, p, se = s.linregress(log(xs), log(ys))

print 'slope=%s, intercept=%s, r^2=%s' % (m, b, r**2)

plot([min(log(xs)), max(log(xs))], [b+min(log(xs))*m, b+max(log(xs))*m])
xlabel('no. dimensions')
ylabel('avg. distance from center')
    
show()
slope=0.516456510273, intercept=-0.0787972583541, r^2=0.998092351213

We've looked at the "average" member of a population. My next question is, what is the average difference between two members of the population?

In one dimension and two dimensions:

In [7]:
data = s.norm(0,1).rvs((2, 100000))

hist(abs(data[0] - data[1]), 100)
show()
In [8]:
mean(abs(data[0] - data[1]))
Out[8]:
1.1254983464319279
In [9]:
d = 2
data = s.norm(0,1).rvs((2, d, 100000))

distances = [sum([x**2 for x in point]) ** 0.5 for point in zip(*(data[0] - data[1]))]
hist(distances, 100)
show()
In [10]:
mean(distances)
Out[10]:
1.7730613234364447

And finally, how does this change as the number of dimensions increases?

In [11]:
avg_distances = []

for d in range(1, MAX_DIMENSIONS):
    data = s.norm(0,1).rvs((2, d, 100))
    
    distances = [sum([x**2 for x in point]) ** 0.5 for point in zip(*(data[0] - data[1]))]
    avg_distances.append(mean(distances))

xs = range(1, MAX_DIMENSIONS)
ys = avg_distances

scatter(log(xs), log(ys))

m, b, r, p, se = s.linregress(log(xs), log(ys))

print 'slope=%s, intercept=%s, r^2=%s' % (m, b, r**2)

plot([min(log(xs)), max(log(xs))], [b+min(log(xs))*m, b+max(log(xs))*m])
xlabel('no. dimensions')
ylabel('avg. distance between two individuals')
    
show()
slope=0.516526929956, intercept=0.268035406832, r^2=0.998039284826

The mean distance between two individuals also seems to increase with the square root of the number of dimensions and seems to be between one and two times the average distance from the center.

Conclusions

IPython notebooks let me write some fast Python code to explore something I'm curious about in minutes and post the whole thing to a blog instantly. Very cool, and I'm going to keep using this in the future. I wish I hadn't put off trying this out for so long.