Friday, February 15, 2013

Hour challenge: NCBI taxonomy tree

Today I'm planning a more science-y and useful hour challenge. Over the course of one hour, I'm going to transform the NCBI taxonomy from a tabulated text dump into a Newick tree which can be manipulated by phylogenetic tools. While other people have done this and Newick strings created from the NCBI taxonomy can be downloaded on external sites, the taxonomy is constantly updated, so it would be nice to have a reproducible process to update the tree whenever necessary.

I have other things going on and I haven't decided exactly when I'm going to do this today, but it'll happen. I think there's a good chance that this is the first one I actually finish in an hour, too. When I decide on the timing, and when I complete the project, I'll update this post with links. As always, everything will be done on GitHub so you can watch me tackle this live if you have nothing better to do.

While we're on the subject of the NCBI taxonomy...

Update 1: Busy day. The plan is to get started at 7 PM Eastern. So, theoretically, I should be finished by 8.

Update 2: Started at 7:15, and finished promptly at 8:15, so this was a success. This actually required fixing a bug in the BioPython Newick writer, as node labels in Newick trees weren't being quoted when they contained invalid characters such as spaces or parentheses. So, in addition to the 989,621-node NCBI Newick tree, I also generated a bug fix for BioPython.

The code is available at:


  1. nice, how long does the python script take to run?

    1. After downloading the files, it took about a minute and 20 seconds.

    2. so are there plans to provide this via an API? Its such a large tree, R no likey as you would guess. Tried to read in python:

      In [1]: import Bio.Phylo as bp

      In [2]: from Bio.Phylo import Newick

      In [3]: tree ='path/to/ncbi_taxonomy.newick', 'newick')

      but lots of errors....

      Is this how you read in newick trees in Python?

    3. You're doing everything right - the error reading in the tree should go away once BioPython accepts my pull request. Still, the whole tree is really too large to be very useful as is.

      Step 2: right now I'm working on putting up a web server backed by the RDF treestore, and I just converted the Newick string into RDF (almost 2 GB.) So, once we launch (hopefully within the next month or two) you'll be able to get a subtree by providing a list of taxa, and it should be very fast.