Saturday, December 29, 2012

What incentives are there to maintain software in academia?

Just read an article in PLoS Comp. Bio. called "Ten Simple Rules for the Open Development of Scientific Software" by Andreas Prlić which was linked by Karthik Ram. A Twitter discussion followed, in which 140 characters was not enough to be sufficiently expressive. Let me start off by saying that I think this was a fantastic article. I'm 100% in agreement and think that these are some important points to make. I start with this caveat because I'm about to dwell on one suggestion that I had a negative reaction to.

From rule 10, "science counts:"

As scientists, the software we write is primarily a means to advance our research and, ultimately, achieve our scientific goals. Whilst the development of software for the consumption of others aligns well with other processes of scientific advancement, it is the science that ultimately counts. Scientific software development fulfils an immediate need, but maintenance of code that is no longer relevant to your own research is a serious time sink, and will rarely lead to your next paper, or secure your next grant or position.

 The author hits on the unfortunate practical reality that time spent on software development that doesn't result in widely-recognized deliverables such as publications or grants is essentially time wasted, and will be inversely correlated with your chances of success as an academic.

The troubling part is that this is an extraordinarily short-sighted view of the value of software. Outside of academia, large communities of developers frequently and happily contribute to open source projects for which they receive no tangible benefit. The rewards developers receive vary from education and experience to networking and recognition to simply having fun. Sometimes extrinsic rewards eventually present themselves, and beyond a certain level of growth money becomes increasingly necessary to keep a large project going (see the "Money" chapter from Producing Open Source Software.) Still, popular open source projects such as Linux and Python have value that far outweigh the modest amounts of money that have been funneled into them, and they're still developed largely by unpaid (sometimes anonymous) volunteers.

Scientific software is important, and even very specialized software should be more widely available and used more often. Replication is one of the cornerstones of the scientific method. I envision a future where results and figures from papers are easily replicable upon publication and where people (reviewers especially) are in the habit of checking each others' work. This is already being done on small scales - see Weecology on GitHub for some excellent examples. The problem is this: a scientist who develops code for a single analysis and makes their code publicly available is doing it to benefit the broader scientific community. But code rots over time. Inevitably, when code makes the jump from a single user to many, problems will be discovered. Thus, the benefit provided by open source software is directly related to the effort spent responding to users and maintaining code. And for most projects, this effort has a very low probability of providing the author of the code with an additional grant or publication, so there's little incentive to do it. (There are notable counterexamples - massive projects such as DataONE for which there's already funding for long-term development and maintenance and which tend to result in multiple publications and presentations for those involved.)

So, my question is this: what can be done to provide incentives for the development and maintenance of important scientific code?

Monday, December 17, 2012

Does gun ownership (A) increase violence or (B) deter violence? C: none of the above.

In the wake of a terrible tragedy, talks about gun control are at the forefront of today's political stage. Of course, both advocates and opponents of gun control point to the Connecticut shooting as a validation of their own viewpoint. It's unfortunate that occurrences like this only seem to polarize us further. We can't rely on anecdotes or emotion to solve this problem. So, what does the data say?

Using data from the Guardian on gun ownership by country, I evaluated the hypothesis that higher rates of gun ownership either (A) lead to increased gun violence (as believed by the left) or (B) actually work to deter gun violence (as believed by the right.) Note that without an experimental manipulation (which is difficult to do due to the many factors that would need to be controlled for, not to mention very questionable ethics), we can identify correlations but it's difficult to really say anything about causation.

First, I compared the total number of civilian-owned guns in each country to the total number of gun-related homicides. The results, unsurprisingly, show a strong positive correlation, most of which can be explained by population: more populous nations will tend to have more homicides and more guns. (In these figures, the size of each data point indicates population.)

To control for population, I compared the rate of gun ownership (per 100 people) to the rate of gun-related homicides (per 100,000 people) and the results were surprising. There's a weak positive relationship between the rate of firearm ownership and the rate of firearm-related homicide (p=0.25), which doesn't strongly support either side's claims:

I suspect that there are other cultural, political, and socioeconomic factors that far outweigh gun ownership as predictors of gun violence, and that both sides in this debate potentially have valid points. In some situations, the presence of guns may deter violent crime. In others, it may enable violent crime.

We can all agree on one thing: we want there to be less mass shootings in America. When considering what policy changes will move us toward that goal, it is absolutely essential that we rely on evidence instead of either emotions or anecdotes.

Additionally, since "guns don't kill people (people with guns kill people)", maybe gun control is less important to curing the modern epidemic of mass-shootings than improving access to and understanding of mental healthcare.

The data and code I used to produce these figures is available on GitHub, and you're free to use them however you like. Feedback is welcome.

Edit: Someone did some additional analysis, uncovering a couple interesting correlates of overall homicide rates (including those unrelated to guns): GDP and income inequality. See it on Reddit.

Friday, December 7, 2012

Hour challenge 12/7: zot, a command-line Zotero client

The last hour challenge was fun, but it was also a dismal failure - honestly, I had been envisioning nanote for a long time prior to developing it, and an hour was just not enough time to build in all the functionality I wanted. I've started using nanote in place of nano and have continued to build in additional functionality, and will probably continue to do so for a while. End result: I now understand how to write a program with the curses library, and my notes are much more organized than they were a week ago.

I use Zotero all the time to manage papers and books. Today I'll be developing a command line interface to Zotero. (There's not already one of these? Really?)

I'll be using the pygnotero library to interface with Zotero. To avoid the GPL, I'm not going to use pygnotero - instead, I'll just interface directly with Zotero's sqlite database. And, for fun, I'll use SQLAlchemy, which I've never used and should really learn more about. I want my client to be able to search (by author, title, citation, tags, etc.), add notes to papers, and output bibliographies (and potentially the text from PDF articles using pdfminer? I'm going to keep thinking about this.) It'll be called zot - short, memorable names for command line tools are always a good thing. I'm going to design it to pipe output to itself, i.e. "zot search ecoinformatics | zot bibliography" to generate a bibliography for all articles on ecoinformatics.

I'll begin sketching out planned functionality while I eat lunch at 12:00 (eastern time), start coding at 1:00 and hope to be finished no later than 2:00. Code will be available at The final result will be available on the Python Package Index as soon as it's relatively functional.

Update (2:00): my command line client can currently search for articles by title, author, or publication. I'm going to go for another hour to see if I can finish up.

Update (3:00): after two hours, I'm calling this finished. To try it out:

    pip install zot
    zot path /path/to/your/Zotero/directory/
    zot search --author Brown | zot bib