Forecast: Cloudy

Posted: November 30th, 2009 | Author: Shiran Pasternak | Filed under: Big Data | Tags: , , | 4 Comments »

David Dooling, of the Genome Center at Washington University and the blog PolITiGenomics, has written a thoughtful post about the nexus of cloud computing and bioinformatics. Dooling does a fine job summarizing the cloud computing and Big Data conversation that’s happening in bioinformatics circles, so I won’t repeat it here.

The post essentially argues that bioinformaticians are — to mix metaphors — too quick to jump on the cloud computing band wagon. It often makes more sense — and cents — to analyze genomic data locally (or privately). His cost-benefit analysis is spot on, but applies to the state of the art. The state of bioinformatics is, however, changing very rapidly and the scale (pun very much intended) is likely to tip.

Cloud and High-Performance Computing

Cloud Computing has become a bit of an overused catch-phrase, especially in bioinformatics. In the sense that it is a service, cloud computing is extremely essential for the life sciences. It allows researchers to integrate data sets and infer latent relationships among them. Reference data sets like the human genome are made ubiquitously available on “the cloud,” and scientists are then able to query those data sets seamlessly against newly-procured experimental data.

The term “cloud computing” has been liberally applied to high-performance computing (HPC). Biological data is complex and requires fairly sophisticated algorithms to interrogate it in a reasonable timeframe. Fortunately, many bioinformatics use-cases, such as read-mapping, are embarrassingly parallel and hence scalable. The problem can be rapidly chopped into fragments, each analyzed in parallel, and then combined in the end. Other problems whose solution rests on complex interdependent data structures, such as de novo genome assembly, cannot be as easily decomposed.

Cloud computing has also become synonymous with the MapReduce framework and Hadoop, its primary open-source implementation. This is somewhat unfortunate because Hadoop provides other benefits that make it suitable for high-performance distributed computing. CloudBurst and Crossbow, both developed in the same University of Maryland group, are the first published bioinformatics applications to use Hadoop. They smartly and effectively demonstrate the utility of MapReduce and cloud computing in bioinformatics. They leverage important Hadoop features such as task management, configurability, rapid prototyping, and fault-tolerance. But they have certain limitations. After reading the papers, I remain unconvinced that they are more efficient or scale better than an embarrassingly-parallel pipeline. In Crossbow, read-mapping is relegated to the mapping step while SNP-calling is relegated to the reduce step, but I’m not sure that this is the best fit of the MapReduce paradigm to the problem. Also, while the correctness of the pipelines is well-demonstrated for the data sets used, I doubt that their component applications (e.g., Bowtie, the unfortunately-named SOAPsnp) are universally applicable. Making the read-mapping and SNP-calling configurable would be a step in the right direction.

The point is that CloudBurst and Crossbow are heralded as HPC applications (even though their performance is similar to other existing pipelines) when in fact their primary appeal is to cloud computing (in the strict definition), namely in that they are Software-as-a-Service (SaaS) or Platform-as-a-Service (PaaS). In the HPC sense, Dooling is correct to point out that analyzing data locally makes more economic sense in the near term. In the cloud computing sense, it is hard to measure the benefits, since many of them are intangible (for example, distributed access to biological data and analytics).

IT Resources

According to Dooling's numbers, "unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster." What his numbers don't take into account is the overhead of running a (possibly single node) cluster. While the fixed cost of purchasing computer equipment might be manageable, especially compared to chemical reagents, the operational costs of running a data center are substantial. Computer equipment needs to be continually serviced, be it for software, security, or kernel patches, or for unscheduled maintenance. In addition, energy costs for running a data center are high and expected to increase in the near future.

Data Storage and Localization

Another major consideration for analyzing data, be it locally or on the cloud, is its shuttling and storage costs. An Illumina sequencing run can easily generate hundreds of gigabytes of compressed data. That data comprises both sequenced reads and quality information. The Crossbow paper reported that transferring 103 Gb of sequence data to the Amazon EC2 cluster for processing cost $28, which is about a third of the cost to process it ($85). It also took 1 hour and 15 minutes for the transfer. That effectively increases the wall time for a Crossbow job, especially when compared to a locally-run instance (and that could be an optimistic estimate, since many of the smaller research groups — poised to be beneficiaries of cloud computing services like Crossbow — don't have access to local servers with significant upload bandwidth).

On the other hand, the cost of marshaling data to and from the Amazon EC2 cluster as well as of long-term storage on the Amazon Simple Storage Server (S3) already reflects the total data center costs that Amazon has to bear. Long-term storage of data in a local data center is expensive, especially when thinking about backup requirements and data availability. Data storage costs should not be separate from computer equipment purchases or the aforementioned IT resource considerations.

Sequencing Scale-up

Next-gen — or rather now-gen — sequencing platforms like 454, Illumina, ABI SOLiD, and Helicos, are pushing the bioinformatics and biodata envelope. More and more emphasis is placed on data wrangling and management. And these sequencers are becoming more affordable (e.g., improved protocols require less reagents) as the data they produce is getting larger (e.g., longer high-quality reads). But more than improved technology, the niche for these sequencers is growing rapidly. It didn't take long to realize that these sequencers can be used for cheaper genotyping over microarray-based approaches. But now with RNA-seq and ChIP-seq applications, it's becoming apparent that sequencing is not just for genome projects anymore.

In the meantime, the next generation (Gen. 3) of sequencing platforms, like Pacific Biosciences' SMRT technology or the IBM nanopore sequencer are already on the horizon, promising higher and faster throughput. They are likely to bring yet another disruptive shift in DNA interrogation.

As Dooling aptly puts it, "the more sequencing data people get, the more they want." As sequencing costs get lower and yields get higher, their niche in the scientific marketplace will grow. More and more research initiatives will rely on sequencing, at the very least for post-hoc validation. And as the data grows at an exponential rate, small research groups will discover when it's too late that they aren't equipped to handle the deluge. Cloud computing, and especially cloud storage, may in the very near future provide an option for long-term persistent storage of raw biological data. There will be some incurred costs (data upload and storage fees), but those costs will be miniscule compared to the local IT infrastructure that will be required to manage the data.

As things stand, the role of cloud computing in bioinformatics is fairly limited and undefined. There isn’t a complete suite of analytic utilities available for non-developers to manage and analyze their data on the cloud. As proposed by Dooling, it might make sense for a researcher to process all their data on a cheap commodity computer they can order on NewEgg.com.

But as things stand, change is the only constant. Bioinformatics is very much in flux, as the life sciences are quickly becoming a data science. In the future, small research projects will not only rely on high-throughput sequencing, but on bringing multiple systems-wide data sets to bear. Researchers will not be able to hire or train a developer as an afterthought. IT infrastructure is an absolute necessity at the inception of the project. The question is, where does it make most sense to manage IT resources, locally or remotely? The answer to that question will depend on an institution’s existing infrastructure. Large IT departments with a mature data center might be able to handle the capacity. But for small groups it make more sense to outsource the IT infrastructure to a growing, scalable data center known as the cloud.

Applications like Crossbow illustrate that cloud computing can play a very important role in the life sciences. Over time robust cloud-based analytics and visualization tools will be made available not only to programmers but directly to scientists. Ubiquitous access to both data and tools will promote both good science and collaboration. That is a primary promise of cloud computing. A secondary promise is that high-throughput analytics will allow users to interrogate the data more rapidly. MapReduce frameworks obviate the need to develop robust pipelines and let users concentrate on specific queries. As more MapReducible bioinformatics tools such as Crossbow are brought online, the scientific paradigm is likely to shift from hypothesis-driven to hypothesis-generating.


4 Comments on “Forecast: Cloudy”

  1. 1 Deepak Singh said at 8:29 pm on November 30th, 2009:

    [...] a great blog post about cloud computing and bioinformatics, Shiran Pasternak pretty much summarizes many of the points I was making in my talks at [...]

  2. 2 Ryan Schenk said at 11:19 pm on November 30th, 2009:

    Great article, well thought-out and written. In the past months, “cloud” has become such a buzzword that it makes me bristle when I hear it. However, we have reaped the benefits of “cloud computing” at the MBL/WHOI Library.

    In this case, when I say “cloud” I don’t mean paying for EC2 nodes; I mean a dynamically scalable group of virtualized machines that happens to run on our hardware. Similar concept to EC2 nodes, it’s just that we have the hardware in-house. We create and destroy these virtual machines at will, as our needs arise and abate, without having to notify any of our applications.

    In this way, we are forced into architecting our applications in a dynamically scalable, “cloud” friendly way. In doing so, we’ve cut our processing times down tremendously. For instance, I can now process every citation in MEDLINE – a task that formerly took days, if not weeks – down to under half an hour. In another instance, I was able to generate MeSH keyword profiles for every species in the Encyclopedia of Life – searching and analyzing PubMed for every 1.9 million species – in about a week.

    In the future, we are hoping to architect a system in which our servers are application-aware, so resources can automatically be created, destroyed, and diverted from background processing jobs to front-end web serving as various demands rise and fall. We can currently do this manually, but we would rather let our machines organize themselves to best do out bidding.

    Despite the buzzwords, this is cool stuff.

  3. 3 fak3r said at 1:28 pm on December 4th, 2009:

    [...] that was prompted by this blog post by David Dooling (Genome Project) http://www.warelab.org/blog/?p=307 [...]

  4. 4 PolITiGenomics » Blog Archive » Head in the clouds said at 7:16 pm on January 10th, 2010:

    [...] Shiran Pasternak over at Plant Tech Tonics says What his numbers don’t take into account is the overhead of [...]


Leave a Reply