Biologists are about to have access to all the genetic data we could ever want. Unfortunately, once we have that data, we have to figure out where to put it—and some way to sift out the bits that answer the questions we want to answer.
That’s the first day of the NESCent workshop in next-generation sequencing methods in a nutshell.
Brian O’Connor, who gave the morning lectures, framed the immediate future of biology as a race between technologies for collecting genetic sequence data and technologies for storing and analyzing that data. Moore’s Law is that computer processor speed (really, the number of transistors packed into a single processor chip) doubles about every two years; Kryder’s Law is that computer storage capacity roughly quadruples in the same amount of time. But in the last few years, and for the foreseeable future, DNA sequence collection capacities are growing on the order of ten times every couple years.
In other words, there may very well come a day when the cost of storing and using a genome (or genomes!) belonging your favorite study organism will exceed the cost of obtaining those data.
O’Connor suggests that one major way to stave off the point where computing capacity limits data collection and analysis will be to use more “cloud” systems—remote servers and storage. Lots of institutions have their own servers and computing clusters. I’m already working with data sets too big to carry, much less process, on my laptop; I filter out the subset of sites I want on the server where the data is stored, and download (some) of that smaller data set for local work.
However, high-capacity computing facilities need a lot of lead time, and infrastructure investment, to scale up. That isn’t practical for individual projects. In such situations, and for researchers at institutions that don’t have their own high-capacity computing resources, commercial services may become a major alternative.
In the afternoon, we got started with one such alternative, Amazon EC2, or “elastic cloud computing.” Yes, that’s Amazon as in Amazon.com, the place where you buy used textbooks. It’s possible to rent processing capacity and storage from Amazon, and the services are provided in such a way that when you need more, you can just request it. “Instances” running on Amazon’s computing facilities can run Unix or Windows—you can interact with an instance via a remote desktop-type interface such as NoMachine’s NX system—and will run any program you’d care to have chew its way through your data.
All of this, of course, assumes you have the budget. It’s not clear to me how easy it’d be to estimate computing needs ahead of time for grant-writing purposes; but on the other hand, you can probably expect that whatever estimate you come up with will likely go that much further when you finally start working a year later. Over beer at the end of the day, Karen Cranston, the Informatics Project Manager for NESCent, told me that Amazon’s pricing is close enough to that of the high-capacity computing facility at Duke University that it’s often worthwhile to use EC2 for short-term, high-volume projects simply because it’s so quick and easy to bring new resources to bear.
As a not-yet faculty member, the cloud means I can plan to do genome-scale work even if I end up at an institution without the on-campus resources to build its own cluster. That’s potentially pretty liberating. ◼