<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Plant Tech Tonics</title>
	<atom:link href="http://www.warelab.org/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.warelab.org/blog</link>
	<description>The Ware Lab Blog</description>
	<lastBuildDate>Thu, 03 Dec 2009 21:02:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Resubmitting failed SGE array tasks</title>
		<link>http://www.warelab.org/blog/?p=328</link>
		<comments>http://www.warelab.org/blog/?p=328#comments</comments>
		<pubDate>Tue, 01 Dec 2009 23:03:09 +0000</pubDate>
		<dc:creator>Shiran Pasternak</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[high-performance computing]]></category>
		<category><![CDATA[sge]]></category>
		<category><![CDATA[shell-scripting]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=328</guid>
		<description><![CDATA[I can&#x27;t seem to find a straighforward mechanism of resubmitting specific tasks in a Sun Grid Engine (SGE) array job, so I rolled my own. There&#x27;s not much to it, but it&#x27;s ideal for small array jobs where the failed tasks can be specified by hand.

Let me back up a bit. We&#x27;re into grid computing. [...]]]></description>
			<content:encoded><![CDATA[<p>I can&#x27;t seem to find a straighforward mechanism of resubmitting specific tasks in a Sun Grid Engine (SGE) array job, so I rolled my own. There&#x27;s not much to it, but it&#x27;s ideal for small array jobs where the failed tasks can be specified by hand.</p>
<p><span id="more-328"></span></p>
<p>Let me back up a bit. We&#x27;re into grid computing. We have a boat-load of data from various plant genomes that we like to go to town on. On our campus we happen to have a 2,000 core compute farm (known as BlueHelix) and we use <a href="http://gridengine.sunsource.net/" title="Sun Grid Engine">Sun Grid Engine (SGE)</a> for job management. One nice feature of SGE is the ability to submit array jobs. An array job is a convenient SGE semantic for grouping tasks that are very similar. You submit an array job once and SGE spawns individual compute tasks that inherit the environment and run parameters of the job. Tasks are identified by their index number. A simple array job submission looks like this:</p>
<pre>
$ qsub -N job_name -t 1-10 ./script.sh
</pre>
<p>The job is named <tt>job_name</tt> and is composed of 10 tasks. While the job is queued or running, it&#x27;s very easy to track and manipulate that job using its chosen name. For example, to find the status of the array job, run:</p>
<pre>
$ qstat -j job_name
</pre>
<p>SGE also lets you specify command-line parameters on special lines within a submission script that are only interpreted at job runtime. So you can submit an array job by running</p>
<pre>
$ qsub ./script.sh
</pre>
<p>where the first few lines of <tt>./script.sh</tt> might look like this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#!/bin/sh</span>
&nbsp;
<span style="color: #666666; font-style: italic;">#$ -N job_name</span>
<span style="color: #666666; font-style: italic;">#$ -t 1-10</span></pre></div></div>

<p>Each spawned task is assigned a unique environment variable, <tt>$GE_TASK_ID</tt>, that identifies its index in the array. It is this variable that distinguishes one task from the next (it can be used programmatically inside <tt>./script.sh</tt>). This presents a slight complication because not all jobs can be naturally indexed by number. How can an array job be used to process a batch of arbitrarily-named files, for example? It is the submitter&#x27;s responsibility to dereference the task ID to a meaningful value that can be processed by a given task.</p>
<p>Today I had a specific task: Dump all the maize chromosomes, soft-masking repeats, from an <a href="http://www.ensembl.org/" title="Ensembl">Ensembl</a> database. Maize has 10 chromosomes, and an additional 11th chromosome we call UNKNOWN, where a small number of unanchored <a href="http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome">BAC clones</a> are arbitrarily concatenated. I&#x27;ve written a handy script in the past that uses the EnsEMBL API to dump any specified region, with the option to soft- or hard-mask repeats. Dumping the ~2 Gb maize genome serially takes a good number of hours. With repeat-masking (the genome is more than 80% repetitive), it also requires a lot of memory. Naturally, we can use parallelization to speed up the dump. The simplest approach (not necessarily the fastest, or the best use of a 2,000 node cluster) is to divide the genome by chromosome. The job array would submit 11 tasks. Of course, the tasks aren&#x27;t normalized, as the chromosome lengths <a href="http://www.maizesequence.org/Zea_mays/Location/Genome" title="MaizeSequence 4a.53  - Whole genome - Karyotype">vary rather wildly</a> (The largest chromosome, 1, is more than twice the size of the smallest real chromosome, 10).</p>
<p>Within the submission script, we can use the <tt>$GE_TASK_ID</tt> as is to pass into the dump script the chromosome being dumped by the task. This is all good for 10 out of 11 chromosomes. We just need to handle the special case of the unanchored chromosome. The submission script looks like this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#!/bin/sh</span>
&nbsp;
<span style="color: #666666; font-style: italic;">#$ -N job_name</span>
<span style="color: #666666; font-style: italic;">#$ -t 1-11</span>
&nbsp;
<span style="color: #007800;">chr</span>=<span style="color: #007800;">$GE_TASK_ID</span>
<span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #007800;">$chr</span> <span style="color: #660033;">-eq</span> <span style="color: #000000;">11</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span> <span style="color: #000000; font-weight: bold;">&amp;&amp;</span> <span style="color: #007800;">chr</span>=<span style="color: #ff0000;">&quot;UNKNOWN&quot;</span>
&nbsp;
.<span style="color: #000000; font-weight: bold;">/</span>dump.pl <span style="color: #660033;">--region</span> <span style="color: #007800;">$chr</span></pre></div></div>

</p>
<p>Not very generalizable but illustrates how the task ID can be used to dereference meaningful values (a good general approach is to have task IDs be line numbers in a control file where each line contains the meaningful value).</p>
<p>A problem I ran into is that, using the default submission parameters, 5 out of 11 chromosome tasks failed because they ran out of memory (ironically, the largest chromosome was dumped successfully, probably because it ran on a high-memory node). I wanted to resubmit just those failed tasks without rerunning the 6 completed tasks. But as I said in the beginning, I didn&#x27;t find a straightforward way to resubmit those 5 jobs. The <tt>-t</tt> parameter does not accept a delimited list of tasks, unfortunately (it does let you do other nifty things, such as specifying one range, e.g., <tt>3-8</tt>).</p>
<p>The solution I came up with took advantage of another cool SGE feature that allows you to pass in a script through standard input. This lets you write a wrapper script that does some preprocessing before printing (typically via a heredoc) a shell script to standard output. Submitting then looks like this:</p>
<pre>
$ ./submit.sh | qsub
</pre>
<p>For this illustration, I had the following chromosomes fail to dump: 2, 4, 6, 7, and 8. To generally specify specific tasks to run, I used a bash array variable that held the failed chromosomes, then used the array length to pass in the array job parameters. The final script looks something like this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#!/bin/sh</span>
&nbsp;
<span style="color: #007800;">chrs</span>=<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #000000;">2</span> <span style="color: #000000;">4</span> <span style="color: #000000;">6</span> <span style="color: #000000;">7</span> <span style="color: #000000;">8</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
&nbsp;
<span style="color: #c20cb9; font-weight: bold;">cat</span> <span style="color: #cc0000; font-style: italic;">&lt;&lt;SCRIPT
#!/bin/sh
&nbsp;
#$ -N dump_chromosomes
#$ -t 1-${#chrs[*]}
&nbsp;
chrs=(`echo &quot;${chrs[*]}&quot;`)
&nbsp;
chr=\${chrs[\$SGE_TASK_ID-1]}
&nbsp;
./dump.pl --region \$chr
SCRIPT</span></pre></div></div>

<p>This isn&#8217;t that pretty, but it gets the job done, no pun intended. Some sigils (<tt>$</tt>) had to be escaped to avoid interpolation. That&#8217;s because we have to distinguish between variables that are interpreted as the wrapper is executed and variable names that remain intact so they can be interpreted later when the task is submitted.</p>
<p>The first declaration creates an array of chromosomes that need to be re-dumped (this can easily have been the command-line argument list, which would allow you to run the job as <tt>./submit.sh 2 4 6 7 8</tt>). The script then prints another script where the job array parameter is reconstructed using the length of the array (<tt>${#chrs[*]}</tt>). A slight inconvenience is that we had to also reconstruct the same array within the generated script so that the task can use it to dereference its corresponding chromosome.</p>
<p>This approach no longer treats the <tt>$GE_TASK_ID</tt> as incidental chromosome names, but rather dereferences meaningful values out of some array (<tt>$chrs</tt>, in this example). This is one illustration of how job array task IDs can be dereference in a general way. Submitting jobs by a generated script also allows you, as a sanity check, to see the script that would be submitted (by not piping it into <tt>qsub</tt>).</p>
<p>So what have we learned today? Besides the fact that the maize genome is huge and mostly repetitive, there are various sleek ways to submit jobs to a compute farm using SGE. The SGE <tt>qsub</tt> command is very flexible and provides powerful ways of submitting &mdash; and resubmitting &mdash; parallel jobs.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D328&amp;linkname=Resubmitting%20failed%20SGE%20array%20tasks">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=328</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Forecast: Cloudy</title>
		<link>http://www.warelab.org/blog/?p=307</link>
		<comments>http://www.warelab.org/blog/?p=307#comments</comments>
		<pubDate>Mon, 30 Nov 2009 18:00:59 +0000</pubDate>
		<dc:creator>Shiran Pasternak</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[high-performance computing]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=307</guid>
		<description><![CDATA[<p>David Dooling, of <a href="http://genome.wustl.edu/" title="The Genome Center at Washington University">the Genome Center at Washington University</a> and the blog <a href="http://www.politigenomics.com/" title="PolITiGenomics">PolITiGenomics</a>, has written <a href="http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/trackback" title="PolITiGenomics  &#187; Blog Archive   &#187; Bioinformatics and cloud computing">a thoughtful post about the nexus of cloud computing and bioinformatics</a>. Dooling does a fine job summarizing the cloud computing and Big Data conversation that&#8217;s happening in bioinformatics circles, so I won&#8217;t repeat it here.</p>

<p>The post essentially argues that bioinformaticians are &#8212; to mix metaphors &#8212; too quick to jump on the cloud computing band wagon. It often makes more sense &#8212; and cents &#8212; to analyze genomic data locally (or privately). His cost-benefit analysis is spot on, but applies to the state of the art. The state of bioinformatics is, however, changing very rapidly and the scale (pun very much intended) is likely to tip.</p>
]]></description>
			<content:encoded><![CDATA[<p>David Dooling, of <a href="http://genome.wustl.edu/" title="The Genome Center at Washington University">the Genome Center at Washington University</a> and the blog <a href="http://www.politigenomics.com/" title="PolITiGenomics">PolITiGenomics</a>, has written <a href="http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/trackback" title="PolITiGenomics  &raquo; Blog Archive   &raquo; Bioinformatics and cloud computing">a thoughtful post about the nexus of cloud computing and bioinformatics</a>. Dooling does a fine job summarizing the cloud computing and Big Data conversation that&#8217;s happening in bioinformatics circles, so I won&#8217;t repeat it here.</p>
<p>The post essentially argues that bioinformaticians are &mdash; to mix metaphors &mdash; too quick to jump on the cloud computing band wagon. It often makes more sense &mdash; and cents &mdash; to analyze genomic data locally (or privately). His cost-benefit analysis is spot on, but applies to the state of the art. The state of bioinformatics is, however, changing very rapidly and the scale (pun very much intended) is likely to tip.</p>
<p><span id="more-307"></span></p>
<h2 id="cloud_and_high_performance_computing">Cloud and High-Performance Computing</h2>
<p>Cloud Computing has become a bit of an overused catch-phrase, especially in bioinformatics. In the sense that it is a service, cloud computing is extremely essential for the life sciences. It allows researchers to integrate data sets and infer latent relationships among them. Reference data sets like the human genome are made ubiquitously available on &#8220;the cloud,&#8221; and scientists are then able to query those data sets seamlessly against newly-procured experimental data.</p>
<p>The term &#8220;cloud computing&#8221; has been liberally applied to high-performance computing (HPC). Biological data is complex and requires fairly sophisticated algorithms to interrogate it in a reasonable timeframe. Fortunately, many bioinformatics use-cases, such as read-mapping, are <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel" title="Embarrassingly parallel - Wikipedia, the free encyclopedia">embarrassingly parallel</a> and hence scalable. The problem can be rapidly chopped into fragments, each analyzed in parallel, and then combined in the end. Other problems whose solution rests on complex interdependent data structures, such as de novo genome assembly, cannot be as easily decomposed.</p>
<p>Cloud computing has also become synonymous with the <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a> framework and <a href="http://hadoop.apache.org/">Hadoop</a>, its primary open-source implementation. This is somewhat unfortunate because Hadoop provides other benefits that make it suitable for high-performance distributed computing. <a href="http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=CloudBurst">CloudBurst</a> and <a href="http://bowtie-bio.sourceforge.net/crossbow/index.shtml">Crossbow</a>, both developed in the same University of Maryland group, are the first published bioinformatics applications to use Hadoop. They smartly and effectively demonstrate the utility of MapReduce and cloud computing in bioinformatics. They leverage important Hadoop features such as task management, configurability, rapid prototyping, and fault-tolerance. But they have certain limitations. After reading the papers, I remain unconvinced that they are more efficient or scale better than an embarrassingly-parallel pipeline. In Crossbow, read-mapping is relegated to the mapping step while SNP-calling is relegated to the reduce step, but I&#8217;m not sure that this is the best fit of the MapReduce paradigm to the problem. Also, while the correctness of the pipelines is well-demonstrated for the data sets used, I doubt that their component applications (e.g., <a href="http://bowtie-bio.sourceforge.net/" title="Bowtie: An ultrafast, memory-efficient short read aligner">Bowtie</a>, the unfortunately-named <a href="http://soap.genomics.org.cn/soapsnp.html" title="SOAP :: Short Oligonucleotide Analysis Package">SOAPsnp</a>) are universally applicable. Making the read-mapping and SNP-calling configurable would be a step in the right direction.</p>
<p>The point is that CloudBurst and Crossbow are heralded as HPC applications (even though their performance is similar to other existing pipelines) when in fact their primary appeal is to cloud computing (in the strict definition), namely in that they are Software-as-a-Service (<a href="http://en.wikipedia.org/wiki/Software_as_a_service" title="Software as a service - Wikipedia, the free encyclopedia">SaaS</a>) or Platform-as-a-Service (<a href="http://en.wikipedia.org/wiki/Platform_as_a_service" title="Platform as a service - Wikipedia, the free encyclopedia">PaaS</a>). In the HPC sense, Dooling is correct to point out that analyzing data locally makes more economic sense in the near term. In the cloud computing sense, it is hard to measure the benefits, since many of them are intangible (for example, distributed access to biological data and analytics).</p>
<h2 id="it_resources">IT Resources</h2>
<p>According to Dooling&#x27;s numbers, &quot;unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster.&quot; What his numbers don&#x27;t take into account is the overhead of running a (possibly single node) cluster. While the fixed cost of purchasing computer equipment might be manageable, especially compared to chemical reagents, the operational costs of running a data center are substantial. Computer equipment needs to be continually serviced, be it for software, security, or kernel patches, or for unscheduled maintenance. In addition, energy costs for running a data center are <a href="http://perspectives.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx" title="Perspectives - Cost of Power in Large-Scale Data Centers">high</a> and <a href="http://www.eweek.com/c/a/IT-Infrastructure/Data-Center-Power-Space-Costs-Will-Increase-in-2010-Gartner-295357/" title="Data Center Power, Space Costs Will Increase in 2010: Gartner">expected to increase in the near future</a>.</p>
<h2 id="data_storage">Data Storage and Localization</h2>
<p>Another major consideration for analyzing data, be it locally or on the cloud, is its shuttling and storage costs. An Illumina sequencing run can easily generate hundreds of gigabytes of compressed data. That data comprises both sequenced reads and quality information. The Crossbow paper reported that transferring 103 Gb of sequence data to the <a href="http://aws.amazon.com/ec2/" title="Amazon Elastic Compute Cloud (Amazon EC2)">Amazon EC2 cluster</a> for processing cost $28, which is about a third of the cost to process it ($85). It also took 1 hour and 15 minutes for the transfer. That effectively increases the wall time for a Crossbow job, especially when compared to a locally-run instance (and that could be an optimistic estimate, since many of the smaller research groups &mdash; poised to be beneficiaries of cloud computing services like Crossbow &mdash; don&#x27;t have access to local servers with significant upload bandwidth).</p>
<p>On the other hand, the cost of marshaling data to and from the Amazon EC2 cluster as well as of long-term storage on the <a href="http://aws.amazon.com/s3/" title="Amazon Simple Storage Service (Amazon S3)">Amazon Simple Storage Server (S3)</a> already reflects the total data center costs that Amazon has to bear. Long-term storage of data in a local data center is expensive, especially when thinking about backup requirements and data availability. Data storage costs should not be separate from computer equipment purchases or the aforementioned IT resource considerations.</p>
<h2 id="scalability_of_sequencing">Sequencing Scale-up</h2>
<p>Next-gen &mdash; or rather now-gen &mdash; sequencing platforms like 454, Illumina, ABI SOLiD, and Helicos, are pushing the bioinformatics and biodata envelope. More and more emphasis is placed on data wrangling and management. And these sequencers are becoming more affordable (e.g., improved protocols require less reagents) as the data they produce is getting larger (e.g., longer high-quality reads). But more than improved technology, the niche for these sequencers is growing rapidly. It didn&#x27;t take long to realize that these sequencers can be used for cheaper genotyping over microarray-based approaches. But now with <a href="http://en.wikipedia.org/wiki/RNA-Seq" title="RNA-Seq - Wikipedia, the free encyclopedia">RNA-seq</a> and <a href="http://en.wikipedia.org/wiki/Chip-Sequencing" title="Chip-Sequencing - Wikipedia, the free encyclopedia">ChIP-seq</a> applications, it&#x27;s becoming apparent that sequencing is not just for genome projects anymore.</p>
<p>In the meantime, the next generation (Gen. 3) of sequencing platforms, like Pacific Biosciences&#x27; <a href="http://www.pacificbiosciences.com/index.php?q=smrt-dna-sequencing" title="Pacific Biosciences | SMRT&trade; DNA Sequencing">SMRT technology</a> or the <a href="http://www.genomeweb.com/sequencing/ibm-enter-dna-sequencing-field-nanopore-technology" title="IBM to Enter DNA Sequencing Field with Nanopore Technology | GenomeWeb Daily News | Sequencing | GenomeWeb">IBM nanopore sequencer</a> are already on the horizon, promising higher and faster throughput. They are likely to bring yet another disruptive shift in DNA interrogation.</p>
<p>As Dooling aptly puts it, &quot;the more sequencing data people get, the more they want.&quot; As sequencing costs get lower and yields get higher, their niche in the scientific marketplace will grow. More and more research initiatives will rely on sequencing, at the very least for post-hoc validation. And as the data grows at an exponential rate, small research groups will discover when it&#x27;s too late that they aren&#x27;t equipped to handle the deluge. Cloud computing, and especially cloud storage, may in the very near future provide an option for long-term persistent storage of raw biological data. There will be some incurred costs (data upload and storage fees), but those costs will be miniscule compared to the local IT infrastructure that will be required to manage the data.</p>
<p><center>&hellip;</center></p>
<p>As things stand, the role of cloud computing in bioinformatics is fairly limited and undefined. There isn&#8217;t a complete suite of analytic utilities available for non-developers to manage and analyze their data on the cloud. As proposed by Dooling, it might make sense for a researcher to process all their data on a cheap commodity computer they can order on NewEgg.com.</p>
<p>But as things stand, change is the only constant. Bioinformatics is very much in flux, as the life sciences are quickly becoming a data science. In the future, small research projects will not only rely on high-throughput sequencing, but on bringing multiple systems-wide data sets to bear. Researchers will not be able to hire or train a developer as an afterthought. IT infrastructure is an absolute necessity at the inception of the project. The question is, where does it make most sense to manage IT resources, locally or remotely? The answer to that question will depend on an institution&#8217;s existing infrastructure. Large IT departments with a mature data center might be able to handle the capacity. But for small groups it make more sense to outsource the IT infrastructure to a growing, scalable data center known as the cloud.</p>
<p>Applications like Crossbow illustrate that cloud computing can play a very important role in the life sciences. Over time robust cloud-based analytics and visualization tools will be made available not only to programmers but directly to scientists. Ubiquitous access to both data and tools will promote both good science and collaboration. That is a primary promise of cloud computing. A secondary promise is that high-throughput analytics will allow users to interrogate the data more rapidly. MapReduce frameworks obviate the need to develop robust pipelines and let users concentrate on specific queries. As more MapReducible bioinformatics tools such as Crossbow are brought online, the scientific paradigm is likely to shift from hypothesis-driven to hypothesis-generating.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D307&amp;linkname=Forecast%3A%20Cloudy">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=307</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Use of the ‘chunk’ coord_system in Ensembl core</title>
		<link>http://www.warelab.org/blog/?p=298</link>
		<comments>http://www.warelab.org/blog/?p=298#comments</comments>
		<pubDate>Thu, 14 May 2009 12:39:35 +0000</pubDate>
		<dc:creator>William Spooner</dc:creator>
				<category><![CDATA[Ensembl]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=298</guid>
		<description><![CDATA[This is a follow-up post to Loading an Ensembl species database from scratch, in particular the issue of &#8216;chunk&#8217; coordinate_system that has traditionally been used to coerce long sequences into the Ensembl core database. Steve Searle summarises the subject as follows;
There is a limit in the mysql client and server for the maximum size for [...]]]></description>
			<content:encoded><![CDATA[<p>This is a follow-up post to <a href="http://www.warelab.org/blog/?p=218">Loading an Ensembl species database from scratch</a>, in particular the issue of &#8216;chunk&#8217; coordinate_system that has traditionally been used to coerce long sequences into the Ensembl core database. Steve Searle summarises the subject as follows;</p>
<blockquote><p>There is a limit in the mysql client and server for the maximum size for network transfer. This is one of the main reasons for chunking. We are moving away from running analyses on the seq level coord system much anymore. Having features on toplevel coords makes access much faster, so generating everything on that coord system makes sense. Doing this makes the underlying chunking fairly irrelevant. The exception would be if you wanted to map to some third coord system through the chunks, but the reason chunks are usually made is because there is no real assembly information (an agp), just the final assembled sequences.</p></blockquote>
<p>The example that I am working on right now is the Arabidopsis lyrata genome, which has a maximum scaffold length of 33 Mbase. So what is the ideal chunk length (if any) in this case?</p>
<p>Having looked at the Ensembl schema (the dna.sequence field is LONGTEXT), the MySQL manual, and the configuration of the public Ensembl database, it seems that the maximum realistic length of a sequence is 16 Mbase. Although I could increase my mysqld&#8217;s max_allowed_packet to encompass my 33 Mbases I feel this would be unwise (and incompatible with most peoples clients), so I&#8217;m stuch with chunking for this genome. </p>
<p>One problem with chunking is that it introduces artificial sequence breaks in downstream analyses that operate on the seqlevel coord_system (input_type_id=CONTIG, in ensembl-pipeline parlance). I cannot seem to find  a list of which analyses use which input types, but I assume that algorithms which have been designed to run on sequences such as clones will use seqlevel; examples include RepeatMasker (max sequence length 4 Mbase), GenScan (max 5 Mbase) and so on. </p>
<p>I have always blindly used a chunk size of 100 kbase in the past, but I&#8217;m now thinking that 1 Mbase may be more appropriate. We&#8217;ll have to wait and see&#8230;</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D298&amp;linkname=Use%20of%20the%20%E2%80%98chunk%E2%80%99%20coord_system%20in%20Ensembl%20core">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=298</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Plant Ontology Database #0409 Release</title>
		<link>http://www.warelab.org/blog/?p=292</link>
		<comments>http://www.warelab.org/blog/?p=292#comments</comments>
		<pubDate>Mon, 20 Apr 2009 19:33:27 +0000</pubDate>
		<dc:creator>shuly</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=292</guid>
		<description><![CDATA[Plant Ontology Consortium (POC) is excited to announce the release #0409 (April 2009) of the Plant Ontology Database.
In this release, we bring you 26625 gene annotations from TAIR, Gramene, SGN and MaizeGDB, 8558 QTL annotations from Gramene, 9832 germplasm associations from SGN, MaizeGDB and NASC.
The following number of annotations were added for the first time: [...]]]></description>
			<content:encoded><![CDATA[<p>Plant Ontology Consortium (POC) is excited to announce the release #0409 (April 2009) of the Plant Ontology Database.</p>
<p>In this release, we bring you 26625 gene annotations from TAIR, Gramene, SGN and MaizeGDB, 8558 QTL annotations from Gramene, 9832 germplasm associations from SGN, MaizeGDB and NASC.</p>
<p>The following number of annotations were added for the first time: 16016 gene annotations from TAIR, 3 gene annotations from Gramene, 2 gene annotations from SGN and 3928 germplasm annotations from MaizeGDB.</p>
<p>For genes curated by TAIR and SGN, you may also find links to their Gene Ontology (GO) pages through PO browser.</p>
<p>The new ontology files and the database dump are available for download.</p>
<p>To submit plant ontology term requests, we encourage researchers to use SourceForge PO tracker.</p>
<p>The Plant Ontology Consortium<br />
web: http:www.plantontology.org<br />
e-mail: po-dev at plantontology.org</p>
<p>The project is funded by National Science Foundation, USA, (Grant No. DBI-0703908)</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D292&amp;linkname=Plant%20Ontology%20Database%20%230409%20Release">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=292</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A quick coding style suggestion</title>
		<link>http://www.warelab.org/blog/?p=270</link>
		<comments>http://www.warelab.org/blog/?p=270#comments</comments>
		<pubDate>Tue, 14 Apr 2009 20:01:13 +0000</pubDate>
		<dc:creator>Jim Thomason</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=270</guid>
		<description><![CDATA[I am going to keep this short and sweet and wax poetic about a certain programming idiom that irks me to no end, and provide my preferred alternative.
Now, most Perl programmers know that it&#8217;s very useful to pass in a hash (or hashref!) of parameters for functions that are very very long. It&#8217;s useful and [...]]]></description>
			<content:encoded><![CDATA[<p>I am going to keep this short and sweet and wax poetic about a certain programming idiom that irks me to no end, and provide my preferred alternative.</p>
<p>Now, most Perl programmers know that it&#8217;s very useful to pass in a hash (or hashref!) of parameters for functions that are very very long. It&#8217;s useful and keeps things tidy.</p>
<p><span id="more-270"></span><br />
Compare, which would you rather read:</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;">function<span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'true'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'apple'</span><span style="color: #339933;">,</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'porcupine'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p></code></p>
<p>or</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;">function<span style="color: #009900;">&#40;</span>
    <span style="color: #ff0000;">'reptile_count'</span>  <span style="color: #339933;">=&gt;</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'in_forest'</span>      <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'true'</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'preferred_fruit'</span><span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'apple'</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'alternative'</span>    <span style="color: #339933;">=&gt;</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'nomenclature'</span>   <span style="color: #339933;">=&gt;</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'predator'</span>       <span style="color: #339933;">=&gt;</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'food_source'</span>    <span style="color: #339933;">=&gt;</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'known_zoos'</span>     <span style="color: #339933;">=&gt;</span> <span style="color: #000066;">undef</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'animal'</span>         <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'porcupine'</span><span style="color: #339933;">,</span>
<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p></code></p>
<p>Even better, all of those undefs are superfluous, since the key&#8217;s value is already undef. We don&#8217;t need to pass them in. Well, unless of course there&#8217;s a default value that we instead want to explicitly set to undef, but more on that in a second.</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;">function<span style="color: #009900;">&#40;</span>
    <span style="color: #ff0000;">'reptile_count'</span>   <span style="color: #339933;">=&gt;</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'in_forest'</span>       <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'true'</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'preferred_fruit'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'apple'</span><span style="color: #339933;">,</span>
    <span style="color: #ff0000;">'animal'</span>          <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'porcupine'</span><span style="color: #339933;">,</span>
<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p></code></p>
<p>And it&#8217;s much tidier. Just by looking at the function, you can probably guess that the animal you&#8217;re dealing with is a porcupine, he&#8217;s in the forest, he likes apples, and there are no reptiles about. Good luck figuring that out from the original version of the function with all the unnamed parameters. Very tidy, very neat, very compact.</p>
<p>But then, almost inevitably, I see functions that have huge blocks of code at the top of them to copy out the hash arguments and set defaults.</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">sub</span> function <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%args</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$reptile_count</span>   <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'reptile_count'</span><span style="color: #009900;">&#125;</span>         <span style="color: #339933;">||</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$in_forest</span>       <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'in_forest'</span><span style="color: #009900;">&#125;</span>             <span style="color: #339933;">||</span> <span style="color: #ff0000;">'false'</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$preferred_fruit</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'preferred_fruit'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$alternative</span>     <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'alternative'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$nomenclature</span>    <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'nomenclature'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$predator</span>        <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'predator'</span><span style="color: #009900;">&#125;</span>              <span style="color: #339933;">||</span> <span style="color: #ff0000;">'humans'</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$food_source</span>     <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'food_source'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$known_zoos</span>      <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'known_zoos'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$animal</span>          <span style="color: #339933;">=</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'animal'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
&nbsp;
    do_something<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$animal</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p></code></p>
<p>And it just gets worse as more and more arguments are added. The block grows gigantically and it&#8217;s just a huge messy wart. Line after line after line of useless setup information.</p>
<p>Plus, there&#8217;s a subtle issue which sometimes pops up. Take a look at the example function, for the predator variable. If no value is passed in, then it uses the default of &#8220;humans&#8221;. But what if you actually wanted to set it to undef? Say the porcupine actually has no predators (does it?). This function won&#8217;t even allow it. Admittedly, that may be a good thing. Or it may not. It can be a subtle bug to track down.</p>
<p>Secondly, tell me quickly, at a glance, which variables have defaults and what are they? You have to scan the whole block to see them. Admittedly, proper columnation of your defaults on the right side will help, but still.</p>
<p>My preference is to just leave all of the passed in arguments in a hash and be done with it. No need to copy out into separate variables. Sure, you have a smidgen bit more typing later in your function, but you also have the added advantage of knowing exactly when you&#8217;re operating with an argument to your function as opposed to a variable you created yourself.</p>
<p>Compare my version:</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">sub</span> function <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%args</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    do_something<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'animal'</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p></code></p>
<p>Done! Much simpler. &#8220;But wait!&#8221;, you protest. &#8220;You&#8217;re no longer setting the defaults!&#8221; Easily done. Make them part of your hash definition:</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">sub</span> function <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%args</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span>
        <span style="color: #ff0000;">'reptile_count'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>
        <span style="color: #ff0000;">'in_forest'</span>     <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'false'</span><span style="color: #339933;">,</span>
        <span style="color: #ff0000;">'predator'</span>      <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'humans'</span><span style="color: #339933;">,</span>
        <span style="color: #0000ff;">@_</span>
    <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    do_something<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'animal'</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p></code></p>
<p>Ta da! You end up with much cleaner, simpler code. At a glance, you can see at the top of your function just which arguments have defaults set for them.</p>
<p>As an added bonus, you can now explicitly set &#8220;predator&#8221; to undef and that&#8217;s what your function will get. Here, you&#8217;re interpreting a list as a hash, so whichever key definition occurs last wins. Early on, it sets &#8216;predator&#8217; to &#8216;humans&#8217; (the default), but then inside of @_ we may have passed in &#8216;predator&#8217; =&gt; undef, so it happily uses that instead.</p>
<p>Okay, but maybe you wanted to ignore keys specifically set to undef. We can still easily do this, we&#8217;re just going to define 2 separate hashes.</p>
<p>Finally:</p>
<p><code> </code></p>
<p><code></p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">sub</span> function <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%defaults</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span>
        <span style="color: #ff0000;">'reptile_count'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>
        <span style="color: #ff0000;">'in_forest'</span>     <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'false'</span><span style="color: #339933;">,</span>
        <span style="color: #ff0000;">'predator'</span>      <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">'humans'</span><span style="color: #339933;">,</span>
    <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%args</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">#set passed args to defaults, if no value passed in</span>
    <span style="color: #0000ff;">%args</span> <span style="color: #339933;">=</span> <span style="color: #000066;">map</span> <span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$_</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#125;</span> <span style="color: #339933;">||</span> <span style="color: #0000ff;">$defaults</span><span style="color: #009900;">&#123;</span><span style="color: #0000ff;">$_</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#125;</span> <span style="color: #000066;">keys</span> <span style="color: #0000ff;">%args</span><span style="color: #339933;">;</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    <span style="color: #339933;">.</span>
    do_something<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$args</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'animal'</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p></code></p>
<p>In my opinion, the resulting code is substantially tidier. You have a small block of defaults at the top, and the rest of the function specifically identifies which variables were arguments to begin with. No big, messy, superfluous wart to wade through.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D270&amp;linkname=A%20quick%20coding%20style%20suggestion">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=270</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Gramene goings on, etc.</title>
		<link>http://www.warelab.org/blog/?p=252</link>
		<comments>http://www.warelab.org/blog/?p=252#comments</comments>
		<pubDate>Thu, 09 Apr 2009 22:03:16 +0000</pubDate>
		<dc:creator>Ken Youens-Clark</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=252</guid>
		<description><![CDATA[In February Gramene released our 29th build.  I won&#8217;t go into details here because you can read our release notes.  Shortly afterwards I got a chance to present a poster on said release at CSHL&#8217;s &#8220;Plant Genomes&#8221; meeting.  It was nice to meet some of our users and listen to the many [...]]]></description>
			<content:encoded><![CDATA[<p>In February Gramene released <a href="http://www.gramene.org/db/help?state=current_release_notes">our 29th build</a>.  I won&#8217;t go into details here because you can read our release notes.  Shortly afterwards I got a chance to present a poster on said release at CSHL&#8217;s &#8220;Plant Genomes&#8221; meeting.  It was nice to meet some of our users and listen to the many interesting talks.  </p>
<p>
Since then, I&#8217;ve been doing lots of things.  In no particular order:<span id="more-252"></span></p>
<ul>
<li>Working fairly fruitlessly on bring our new <a href="http://blast.gramene.org">blast.gramene.org</a> machine online
<li>Answering user email, fixing CMap and Mart bugs
<li>Hacking a bit on <a href="http://sqlfairy.sourceforge.net">SQL::Translator</a> to help create <a href="http://svn.warelab.org/gramene/trunk/schemas/">Gramene schema ER diagrams</a>
<li>Thinking a lot about how we could improve Gramene&#8217;s look and feel.  This was the subject of a presentation to the group, one I&#8217;ll revise for the upcoming Gramene retreat.
<li>Reading <a href="http://www.amazon.com/Managing-Humans-Humorous-Software-Engineering/dp/159059844X/ref=pd_bbs_sr_1?ie=UTF8&#038;s=books&#038;qid=1239305749&#038;sr=8-1">a book on project managment</a> and considering how I could improve.  Apparently I should be having regular one-on-one meetings with people &#8212; revolutionary!
<li>Writing the 2008-9 Gramene grant summary.  This has been a nice exercise for me as I go over <a href="http://gwiki.gramene.org/Gramene_Experimental_Plan_2007">the Gramene experimental plan</a> and consider what we promised to do, what we&#8217;ve done, what we&#8217;ve got left.
<li>Polishing a brief paper on <a href="http://gmod.org/cmap">CMap</a> for the journal <a href="bioinformatics.oxfordjournals.org/">Bioinformatics</a>
<li>I created a <a href="http://www.youtube.com/watch?v=rO3vRVDZuKk">test video</a> of a Gramene tutorial I drafted and solicited opinions.  It&#8217;s decent, but needs much improvement.  I&#8217;ll tackle this again soon, breaking the 10+ minute intro into many smaller, 2-3 minute intros on each section.  I will increase the font size and decrease the screen resolution to try to make the text more readable and will also use <a href="http://www.omnigroup.com/applications/omnidazzle/">Omnidazzle, a free mouse highlighter</a>, to make it easier to see my pointer.  Sheldon asked me what software I used to record.  I purchased (for $99) <a href="http://www.telestream.net/screen-flow/overview.htm">Screenflow</a>, which I found relatively easy to use.  I should also point out that, though I don&#8217;t really like the sound of my own voice (few do, I know), I believe I sound relatively decent because I used a <a href="http://pro-audio.musiciansfriend.com/product/AudioTechnica-PRO-31-Cardioid-Dynamic-Microphone?sku=270609">decent microphone</a> run through an <a href="http://pro-audio.musiciansfriend.com/product/ART-Tube-MP-Professional-Mic-PreampProcessor?sku=484020&#038;src=3WFRWXX&#038;ZYXSEM=0&#038;CAWELAID=26040298">inexpensive pre-amp</a>
</ul>
<p>For several months I&#8217;ve really been thinking about ways to radically update CMap.  My recent work on the CMap paper and answering some user email has made me want to do something even more, but then this week I really got in my head to try to incorporate <a href="http://mkweb.bcgsc.ca/circos/">Circos</a> output into CMap.  Looking into the Circos code, I think it will take some refactoring since all the logic is in a single Perl script.  This will need to be move out into a module to make it easier to call it from other code.  Since Shiran also wants to create Circos views from Ensembl, perhaps we could work together to create an API for Circos.  We&#8217;ve each contacted the author, <a href="http://mkweb.bcgsc.ca/">Martin Krzywinski</a> at the BC Cancer Agency (Sheldon&#8217;s old employer), and we&#8217;re waiting to hear how we can move forward on contributing to that.  It would be nice to see perhaps a <a href="http://code.google.com/">Google Code</a> project setup with SVN, Wiki, etc.  I could be really excited to contribute to such a project.  I will also add that I was very happy to see that Martin apparently found the <a href="http://sqlfairy.sourceforget.net">SQLFairy</a> on his own and thought that Circos would be another cool way to visualize schemas, too, so he created <a href="http://mkweb.bcgsc.bc.ca/schemaball">Schemaball</a>.  It says it&#8217;s limited to MySQL, so I wouldn&#8217;t mind turning that around and creating a Circos producer module for SQL::Translator so that this would be available for any database syntax.  First, though, I need Circos to have an API, and that will require refactoring the code.</p>
<p>Anyway, here&#8217;s just one idea how Circos could radically improve CMap views.  Take a <a href="http://dev.gramene.org/db/cmap/matrix?show_matrix=1;map_set_acc=maize-bins;link_map_set_acc=grjp2008a;">typical, boring table from the correspondence matrix</a> showing how two map sets compare to each other and create this view:</p>
<p><img src="http://www.warelab.org/blog/wp-content/uploads/2009/04/circos.png" alt="Circos tabular view" /></p>
<p>Pretty neat, I think.  I could see making Circos an output option for the matrix and for map comparisons.  At first I got excited thinking I might find a way to use Circos for all CMap&#8217;s graphical output since it is cool and allows multi-way comparisons as opposed to CMap&#8217;s limitation that maps can only be compared to those on their immediate right or left, but then I realized that the standard CMap view does have value such as when you want to align contigs to a reference map for order and orientation.  Also, Shiran seems to think that performance with Circos might be a tad slow for dynamic queries, so it would be extra eye-candy that users would wait on.  Lastly, there is the issue of making the image clickable with image maps.  Circos doesn&#8217;t create anything like that, and, in fact, I seems like a very tricky think to do given that the regions are very curvy.</p>
<p>Along with creating new, exciting visual output, I&#8217;d love to completely rewrite the user interface for CMap.  It&#8217;s time for it to have real Web 2.0 goodness.  I&#8217;ve wondered if I should write it in Java as an applet or a separate download, but my colleague on the Gramene project, Terry, feels that this has been a hindrance to people accepting his <a href="http://www.gramene.org/diversity/gramene_gdpc.html">GDPC tool</a> on Gramene.  While researching other ideas, I happened across the <a href="http://www.chromeexperiments.com/">Chrome Experiments</a> website and was thoroughly intrigued with what you can do nowadays with Javascript + HTML5 + Canvas + SVG (Jim pointed out all the technologies at work there).  I&#8217;m thinking that&#8217;s gotta be the way to go.  Whatever direction I choose, what I&#8217;d like is a much more dynamic comparative map viewer, from the form to the image itself.  I would love to be able to update just a part of the image, e.g., if the user wanted to hide all the ESTs on a map, I could just redraw that one map rather than the whole image, but this might also entail redrawing the correspondence lines to neighboring maps, so I&#8217;m wondering if it would be possible to create the image in layers such as background, map, correspondences, legend, etc.</p>
<p>Another reason I would like to overhaul CMap is the performance issue.  I&#8217;ve learned a lot since I originally designed the schema, and I&#8217;d like a do-over on several parts where I think I could make things much faster (particularly in how I store the feature aliases).  Performance is not terrible anymore on the <a href="http://www.gramene.org/db/cmap/viewer">Gramene site</a> since we considerably lowered the number of features we put on maps, but I would really love to see it be able to handle at least 1 million features per map &#8212; at present we try to keep it to under 50-100K.  </p>
<p>At the heart of CMap is the idea that two things have a correspondence, they are related in some manner be it a sequence similarity or the same reagent being used in two experiments or whatnot.  Modeling this efficiently has never been easy.  Here is the core of the schema that shows how it&#8217;s currently modeled:</p>
<p><img src="http://www.warelab.org/blog/wp-content/uploads/2009/04/foo.png" alt="CMap schema" /></p>
<p>You can see in the feature/feature_correspondence/correspondence_evidence/correspondence_lookup area that things get a little circular, and that&#8217;s usually a bad thing.  My first pass at this (oh, so many, many years ago, hacking on my little white iBook while riding the LIRR &#8212; my, the things I remember) had just the correspondence table, but then I realized that I had to have a table that showed the relationship going in both directions, hence the lookup table that says both &#8220;A->B&#8221; and &#8220;B->A&#8221; such that there are two lookup records for each correspondence.  Ben came along and decided that, if the data was going to have to be denormalized to handle web queries, it might as well have mapping information in it in order to decrease the number of joins required, so he added the start/stop/type/map fields.  Each correspondence can have one or more evidences.</p>
<p>To use these tables, I have some set of features in a region of a given reference map, and I want to find all the other maps that have features with correspondences optionally including or excluding some set of evidence criteria.  I want the species and name of the map sets, the map names, and the number of correspondences.  In the many times I&#8217;ve thought about how I could simplify or denormalize this schema further, I&#8217;ve never really come up with a better solution.  I can&#8217;t see a way to pre-calculate anything.  Every time a user crops a map or excludes some feature types, I have to go back and requery, joining millions of rows across several tables to figure out the choices I can present in the comparative maps menu.  If nothing else, I&#8217;ve thought about binning the features such that I could perhaps at least get rid of the start/stop range queries, which can be a killer for performance.  Another option I&#8217;ve considered is getting rid of the evidence table.  I know this removes some flexibility (e.g., only show me correspondences based on BLAST hits with a score greater than &#8220;X&#8221;), but maybe it&#8217;s not worth it &#8212; or maybe I move it directly into the correspondence record itself, and each evidence gives you a new correspondence record.  No, that would make it hard to count unique correspondences.  Anyway, you see how complicated this can get.</p>
<p>I&#8217;ve done some looking about on <a href="http://scholar.google.com/scholar?q=modeling+pairwise+relationships&#038;hl=en&#038;lr=&#038;btnG=Search">modeling pairwise relationships</a> but haven&#8217;t gotten far.  I&#8217;d love to see how <a href="http://www.facebook.com">Facebook</a> models &#8220;friends&#8221; &#8212; I know they have several million users, each of which can have hundreds of friends, so how do they efficiently store and query this?  If anyone has any ideas for a radically improved version of how to show that a thing of one type in a particular location is related to something else, I&#8217;d love to hear about it.  Something that is as simple as possible, but no simpler would be perfect.</p>
<p>That is all.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D252&amp;linkname=Gramene%20goings%20on%2C%20etc.">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=252</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>I have a suggestion for you</title>
		<link>http://www.warelab.org/blog/?p=243</link>
		<comments>http://www.warelab.org/blog/?p=243#comments</comments>
		<pubDate>Wed, 21 Jan 2009 20:10:02 +0000</pubDate>
		<dc:creator>Jim Thomason</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=243</guid>
		<description><![CDATA[I&#8217;m a little bit behind the times. I was supposed to post something up here last week, but was so wrapped up in actually working on what I was going to write about that I completely forgot.
Gramene is going to be rolling out an autocomplete feature to offer suggestions to users sometime in the near [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a little bit behind the times. I was supposed to post something up here last week, but was so wrapped up in actually working on what I was going to write about that I completely forgot.</p>
<p><a href = "http://www.gramene.org/">Gramene</a> is going to be rolling out an autocomplete feature to offer suggestions to users sometime in the near future. You can sample the wonderfully suggestive goodness <a href = "http://dev.gramene.org/db/searches/quick_search">here</a> until we go live.</p>
<p><span id="more-243"></span><br />
The state of javascript in the world today continuously amazes me. I started writing javascript in the fairly early days. I go back to Netscape 2 with it and got seriously cranking developing things around 1997. I largely abandoned it in early 1999 when I got rather irate at the current implementations as I tried to write a clone of Pac-Man in it. I reached the point where Pac-Man would run around the maze and eat dots, but as I started to implement the ghosts I ran into a stumbling block &#8211; when the timers that controlled the ghosts were global objects, it ran fine. When they were attributes of the ghost object (so the ghost could move itself), then performance ground to a crawl. I got so annoyed at the arbitrariness of it all that I shelved javascript and didn&#8217;t look at it for a few years.</p>
<p>When I finally picked it up again, I was amazed to see that the current state of the world was actually all of the magical things that I&#8217;d been promised many years before. A consistent, easy to understand <a href = "http://en.wikipedia.org/wiki/Document_object_model">Document Object Model</a>. The ability to <a href = "http://en.wikipedia.org/wiki/AJAX">independently update parts of the page</a>. <a href = "http://en.wikipedia.org/wiki/DHTML">Dynamic</a> pages that actually work. I was hooked again.</p>
<p>So. To Gramene. I had this bug kicking around for months but never took the time to delve into it. Conceptually it&#8217;s very easy &#8211; monitor the text boxes the user is typing in. Whenever they let go of the keyboard key, fire off an event to a handler. That&#8217;ll look at what&#8217;s typed in, and bundle it off to ship off to the server. The server will take that information and look it up&#8230;somehow&#8230;on the server side to get a list of suggestions. Those suggestions go back to the client and pop up in a little box below the text. Simple!</p>
<p>Mostly. At first I wanted to use Ken&#8217;s quick search routines and database, but that seemed like it might be slow for our purposes. And it didn&#8217;t have quite the information I wanted to see either. So him and I brainstormed on it and came up with a vocabulary list concept. We set up lists of vocabulary words for individual parts of the site &#8211; after all, if you&#8217;re typing in a text box on a Gene search page, you probably don&#8217;t want to see QTL information popping up. Ken was then nice enough to go populate these vocabulary lists for me. For those keeping score at home, we currently have 3,649,306 distinct vocabulary terms spread across 54 different vocabulary lists.</p>
<p>A few wrapper scripts and that part was done. The AJAX to hand off to the server side was straightforward. I&#8217;d written some AJAX routines for myself a few years back, and we integrated those into Gramene as part of the Help system. I needed to tweak a few options to make it operate properly in this case (for example &#8211; we only want to fire off one AJAX request at a time so as not to get suggestions back from the server out of sync), but it was basically done. The pop up box with the suggestions was just a DIV tag that was absolutely positioned slightly below the text box in question. We had a javascript function for that too. As a side note, finding the absolute position of a relative element on the page is still quite the nuisance, but I guess I can&#8217;t have everything I want yet.</p>
<p>I added in options for the user to disable suggestions, and then we were done! Time to blog it! This is when Ken fouled up the whole works by saying, &#8220;I&#8217;m having trouble autocompleting with more than one box in a form.&#8221; My response? &#8220;Crap.&#8221;</p>
<p>I had ever so carefully engineered and designed the thing and never really taken that into account. Because I wanted to keep the javascript simple to use, I set it up so that additional options for the script were set as hidden form elements. Which vocabulary list to use, for example. This worked fine when there was only one autocomplete box in the form. But once you had more than one, it blew up. After all, you might (as Ken did) want to define two separate search boxes which each use their own distinct vocabulary list. How would the server side script know which vocabulary item to use?</p>
<p>Further, how would it know which input box to look at?</p>
<p>I came up with 3 ways to solve the problem &#8211; the bad way, the good enough way (with some caveats), and the flexible way.</p>
<p>I skipped over the bad way, since it was, well, <i>bad</i>. I could tie the client side javascript and forms to the server side implementation. So that the functions the programmer calls in the browser do some particular parsing and analysis to submit the proper information to the server for the script to function. This is fairly easy to do, but introduced a <a href = "http://en.wikipedia.org/wiki/Coupling_(computer_science)">tight coupling</a> to the code.</p>
<p>Up until this point, the javascript and the server side were completely independent. It didn&#8217;t really care what the script was on the server side, it just handed off its parameters and got something back, which it stuck in a box. It could use the default implementation, or any other one. I felt that this flexibility was quite valuable &#8211; it allows us to easily deploy the code to our other probjects and it would allow us to easily open source it for anyone to use it, regardless of integration with Gramene. Maybe you have your own fast word lookup on your server. You shouldn&#8217;t need to change your code much (other than formatting the output) to use the results.</p>
<p>The good enough way was to extend the javascript to accept a lot of additional parameters. This way, Ken could hardwire in a bunch of values to allow the client side code to remain independent of the server. I set up a bunch of default values in the javascript that handle the usual case. You can even still pass in your values with hidden form elements (and are encouraged to!). But you can also send in all the options you need to your javascript. Voila! We have multiple forms!</p>
<p>I only consider that approach to be &#8220;good enough&#8221; because it has an irritating nuisance &#8211; what if the additional parameters we want to pass in are user configurable? Right now, it&#8217;s hardwired that one box searches the genes vocabulary list, and the other searches the QTL vocabulary list, for example. But it&#8217;s conceivable that we may want the user to choose which part of the site they&#8217;re searching, and then use that to populate the list. So no hardwiring values.</p>
<p>At that point, we need to analyze particular form elements and update arguments and hand around values. Or build new forms on the fly with carefully crafted definitions of what should be in them. Or any number of other nonsense like that. Or maybe something even more clever that I didn&#8217;t think of. Regardless, I opted to defer for another day until we actually needed the functionality.</p>
<p>As a result, after all that rambling about coupling and options and parsing, for most cases, it&#8217;s terribly simple to use.</p>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">input</span></span>
<span style="color: #009900;">   <span style="color: #000066;">type</span> <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">&quot;text&quot;</span></span>
<span style="color: #009900;">   <span style="color: #000066;">name</span> <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">&quot;search_for&quot;</span></span>
<span style="color: #009900;">   <span style="color: #000066;">id</span> <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">&quot;search_for&quot;</span></span>
<span style="color: #009900;">   <span style="color: #000066;">onkeyup</span> <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">'suggest(this.id)'</span></span>
<span style="color: #009900;">   <span style="color: #000066;">onblur</span> <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">'hide_suggestions()'</span></span>
<span style="color: #009900;">   <span style="color: #000066;">onfocus</span> <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">'show_suggestions(this.id)'</span> <span style="color: #66cc66;">/</span>&gt;</span></pre></div></div>

<p>And you&#8217;re off to the races. I always like it when software can be powerfully complicated on the backend, and completely hidden from the end developer so that most of the time, it&#8217;s trivially simple to use.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D243&amp;linkname=I%20have%20a%20suggestion%20for%20you">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=243</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Loading an Ensembl species database from scratch</title>
		<link>http://www.warelab.org/blog/?p=218</link>
		<comments>http://www.warelab.org/blog/?p=218#comments</comments>
		<pubDate>Fri, 09 Jan 2009 11:58:39 +0000</pubDate>
		<dc:creator>William Spooner</dc:creator>
				<category><![CDATA[Ensembl]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[genes]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=218</guid>
		<description><![CDATA[A question came up on the ensembl-dev mailing list to the effect &#8220;how do I create an Ensembl species core database from scratch&#8221;. As this is something that we have to do once in a while, I thought it may be a good topic for a blog post.
I will concentrate on getting a &#8216;minimal&#8217; database [...]]]></description>
			<content:encoded><![CDATA[<p>A question came up on the ensembl-dev mailing list to the effect &#8220;how do I create an Ensembl species core database from scratch&#8221;. As this is something that we have to do once in a while, I thought it may be a good topic for a blog post.</p>
<p>I will concentrate on getting a &#8216;minimal&#8217; database populated, i.e. the sequences, genome assembly and genes (without additional annotation).</p>
<p>In my experience, the exact approach for each species differs, mainly due to the different formats and semantics of the source data. The following description will probably need tweaking for a particular species.</p>
<p>Basic input data;</p>
<ul>
<li>The sequences that are used in the assembly, preferably the unassembled contigs, but the assembled chromosome or supercontig sequences can be used if needed.</li>
<li>A description of the assembly; how the contigs are mapped to form larger regions. This can be multiple levels, e.g. contigs-to-clone-to-supercontig-to-chromosome for instance. If you are working from large, assembled sequences you may need to &#8216;fake&#8217; an assembly to chop the sequences into pieces that can be efficiently stored in the database.</li>
<li>A file that describes the gene structures; the locations of exons and CDSes, and how these are grouped into transcripts and genes. It is also useful to have cDNA and peptide sequences for the genes for validation.</li>
</ul>
<p><strong>Preparation</strong></p>
<p>You need to create an empty ensembl-schema database. The SQL to generate the schema is provide by a file in the ensembl project repository: <a href="http://cvs.sanger.ac.uk/cgi-bin/viewcvs.cgi/ensembl/sql/table.sql?view=markup">table.sql</a></p>
<p><strong>Loading the contig sequences</strong></p>
<p>Only use this method if you have short sequences and an assembly file. If you have long sequences (e.g. whole chromosomes), see the <em>Loading chromosome sequences and faking the assembly</em> section.</p>
<p>Ensembl provides a script for loading sequences from a FASTA file into the database:<br />
<a href="http://cvs.sanger.ac.uk/cgi-bin/viewcvs.cgi/ensembl-pipeline/scripts/load_seq_region.pl?view=markup">load_seq_region.pl</a>. A typical invocation of this script is:</p>
<blockquote><pre>$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -coord_system_name contig -rank 4 -sequence_level -fasta_file sequence.fa</pre>
</blockquote>
<p><strong>Loading the assembly</strong></p>
<p>Once you have the sequences loaded, you need to load the identifiers of other regions that appear in the assembly. Again using the load_seq_region.pl script, these identifiers can be loaded either from a FASTA file (the sequences will be ignored);</p>
<blockquote><pre>$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -coord_system_name clone -rank 3 -fasta_file clone.fa</pre>
</blockquote>
<p>Or from an agp file (assembled regions);</p>
<blockquote><pre>$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-coord_system_name chromosome -rank 1 -agp_file genome.agp</pre>
</blockquote>
<p>Once you have all the pieces of the assembly in the database (typically by running load_seq_region.pl a number of times), you need to load the assembly itself, i.e. the order/orientation/overlap between the assembly pieces. If you have agp file(s), ensembl provides a script to loat it: <a href="http://cvs.sanger.ac.uk/cgi-bin/viewcvs.cgi/ensembl-pipeline/scripts/load_agp.pl?view=markup">load_agp.pl</a>. A typical invocation would be;</p>
<blockquote><pre>$ perl load_agp.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -assembled_name chromosome -assembled_version NCBI34 \
  -component_name contig -agp_file genome.agp</pre>
</blockquote>
<p>You will need to run this script for each level of assembly that you have.</p>
<p><strong>Loading chromosome sequences and faking the assembly</strong></p>
<p>The fastest way to get an assembly into Ensembl is to take the fully assembled chromosome sequences (or other large assembled component) for a species, an load this straight into Ensembl. As it is inefficient to work with huge sequences in single MySQL records, the approach is to chunk into smaller pieces and fake an assembly. A script that does this is available from Gramene: <a href="http://svn.warelab.org/gramene/trunk/scripts/ensembl/load-scripts/load_assembly_from_fasta.pl">load_assembly_from_fasta.pl</a>. Typical invocation would be:</p>
<blockquote><pre>$ perl load_assembly_from_fasta.pl \
  --ensembl_registry file=/path/to/ensembl.registry --species=Homo_sapiens \
  --coord_system=chromosome -assembly_version=v1 --chunk_size=100000</pre>
</blockquote>
<p><strong>Loading the genes</strong></p>
<p>Gene structures are usually contained in a gff-derived file format. The semantics of gff itself are a fairly lose,  so it is better to use a more specific format, such as gtf or gff3. For a starting point, I direct you to a recent Gramene script used to load grape genes into the database; <a href="http://svn.warelab.org/gramene/trunk/scripts/ensembl/load-scripts/load_genes_from_grape_gff.pl">load_genes_from_grape_gff.pl</a>. Typical invocation:</p>
<blockquote><pre>$ perl load_genes_from_grape_gff.pl \
  --ensembl_registry file=/path/to/ensembl.registry --species=grape \
  --logic_name=genes_irgsp /path/to/grape_genes.gff</pre>
</blockquote>
<p><strong>Footnote</strong></p>
<p>This is a very basic treatment on the subject of loading data into Ensembl. There are many optimisations and display issues that could have been addressed. As time allows I would like to extend these notes into a longer article on loading into Ensembl. If you have any suggestions, please post!</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D218&amp;linkname=Loading%20an%20Ensembl%20species%20database%20from%20scratch">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=218</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>All Gramene, all the time</title>
		<link>http://www.warelab.org/blog/?p=215</link>
		<comments>http://www.warelab.org/blog/?p=215#comments</comments>
		<pubDate>Fri, 19 Dec 2008 16:01:19 +0000</pubDate>
		<dc:creator>Ken Youens-Clark</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=215</guid>
		<description><![CDATA[As the Gramene project manager, pretty much everything I do is directly related to that.  I got a chance in November to visit the lab to learn about our new Ensembl Mart building from Richard Holland, one of Will&#8217;s cohorts and the hired gun who wiped our Marts into shape.  It&#8217;s a complicated, [...]]]></description>
			<content:encoded><![CDATA[<p>As the Gramene project manager, pretty much everything I do is directly related to that.<span id="more-215"></span>  I got a chance in November to visit the lab to learn about our new Ensembl Mart building from Richard Holland, one of Will&#8217;s cohorts and the hired gun who wiped our Marts into shape.  It&#8217;s a complicated, eye-crossing, tedious process to flatten our normalized relational databases through the many Mart tools into a structure specifically designed for data integration and mining.  In the end, however, we get a system that can synthesize custom data sets not available anywhere else on our site.  What&#8217;s been interesting for me personally is that, having taken on the Mart mantle, I&#8217;ve been trying to answer the user queries for Mart myself.  When I get stuck, Will helps me along; when the Marts won&#8217;t support the query, I rebuild them.  I was really proud of myself when I was able to answer how to get all the rice QTLs for abiotic stress mapped to chromosome 1.  Moving forward, I&#8217;d like to add more data to our genes Mart (like ontology terms) and probably create Marts for our pathways database and the new association data from our diversity databases.</p>
<p>Even though Gramene is releasing only twice a year, we&#8217;re allowing interim updates of the site and important upgrades when they come along.  Recently we recommissioned our old web server, filetta, as a public-access (and read-only) MySQL server with all of our Ensembl dbs, the markers db, and a few other dbs.  I was able to use that to help a grad student at UT answer some questions about our genes db &#8212; how easy to tell him to connect directly to a running copy that he could query.  (And, of course, I pointed him to the &#8220;mysqldump&#8221; so he could set it up locally for himself.)  </p>
<p>Another release I made just this week was to reinstate the OMAP stacked maps in CMap.  This was a challenge for me because Ben and Bonnie always did this in the past, so I had to follow their docs which ended up being pretty out-dated since so many things have changed since they both left.  I hope that it will be easier this next release.</p>
<p>Since users often ask about our schemas, I finally finished up a &#8220;Build&#8221; action that I&#8217;d started long ago.  If, from the Gramene root directory, you &#8220;./Build schemas&#8221; and &#8220;./Build schema_diagrams,&#8221; the Build.PL script will do a &#8220;mysqldump&#8221; of all our defined &#8220;<modules>&#8221; (in &#8220;gramene.conf&#8221;) into a &#8220;schemas&#8221; directory and then use &#8220;sqlt-graph&#8221; (from my <a href="http://search.cpan.org/dist/SQL-Translator">SQL::Translator</a> project) to create a schema diagram.  It&#8217;s quite useful documentation to see how all the tables in the schemas are related.</p>
<p>One other nice improvement we&#8217;ve made recently to Gramene is the use of an actual blog for our news items rather than updating our index page all the time.  I installed WordPress on bivouac and then run a cron job on the top of the hour to grab the latest three posts from the RSS feed, format them into HTML (taking care to respect Unicode characters after the problems with <a href="http://bivouac.cshl.edu/gramene-news/?p=22">Pankaj&#8217;s post in Hindi</a>), and put that into a file that gets included via SSI on the home page.  It works very nicely, and I&#8217;m happy to see a much smoother system for publishing and archiving news items.</p>
<p>As we approach the data freeze in Gramene, I&#8217;m more determined than in previous builds to make it mean something.  It seems data keeps flying well after the freeze, but we&#8217;re going to have to make this stick this time if there is to be time to build CMap and the Marts in a reasonable time.  We&#8217;re only 13% through 156 items in our &#8220;data&#8221; build, so I&#8217;m wondering how much gets jettisoned or done by then.  I&#8217;m slowly getting more comfortable with my status as a &#8220;decider&#8221; and enforcer, so I&#8217;ll just have to bring down the hammer when the time comes.  Watch out!</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D215&amp;linkname=All%20Gramene%2C%20all%20the%20time">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=215</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The learning curve</title>
		<link>http://www.warelab.org/blog/?p=210</link>
		<comments>http://www.warelab.org/blog/?p=210#comments</comments>
		<pubDate>Thu, 18 Dec 2008 13:32:21 +0000</pubDate>
		<dc:creator>Andrea Eveland</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.warelab.org/blog/?p=210</guid>
		<description><![CDATA[It is interesting to look at where in my current state as a programming newbie I fall on this curve.  My first experience with Perl (or any programming language for that matter) was during the CSHL Programming for Biology course in mid October.  I came away with a very large three-ring binder and an array [...]]]></description>
			<content:encoded><![CDATA[<p>It is interesting to look at where in my current state as a programming newbie I fall on this curve.  My first experience with Perl (or any programming language for that matter) was during the CSHL Programming for Biology course in mid October.  I came away with a very large three-ring binder and an array of books with different animals on them&#8230;essentially the tools needed to tackle any data analysis situation that could be simplified or made manageable using perl.  I am very fortunate to have since been working alongside a group of very helpful friends in the Ware lab.  Reaching my spot this far along the learning curve would have been very difficult without them.  Even so, in these last 2 months I have experienced a series of peaks and valleys corresponding to momentary jumps of joy and periods of frustration where I feel seemingly unproductive.  As I move along the learning curve, although I continue to experience valleys, they are becoming increasingly more complex and my intermittent peaks are actually beginning to produce useful information for my research.  For example, earlier in the week I spent an entire evening trying to figure out whether the data structure that I had constructed in my code was an array of hashes or a hash of hashes or a hash of arrays, etc.  After systematically trying to isolate elements of the code line-by-line and commenting on what each gave back, I felt I had made some progress and went to bed.  As usual, I curled up with my camel book and a glass of wine.  What was not usual was my erratic sleep and the visions of arrays and hashes looping through my head.  I am not even kidding a little bit&#8230;I must have woken up about 10 times, each after hitting an error message.  The next day I felt tired and stressed, but when I sat back down at the computer, I realized almost immediately that I had a hash of arrays in which one element was a hash reference!  Ok, a little weird and probably not uncommon among programmers since I think if you stare at anything long enough it tends to come back to haunt you in your sleep.  Perhaps complete immersion is a little unhealthy since I dreamt of hashes again last night.  But I am happy about my progress along the learning curve.  </p>
<p>So in a nutshell, aside from the technical aspects of things, what have I learned thus far?  First of all, programming is really fun!  As a biologist working with deep sequencing data, it is also really essential&#8230;at least a basic knowledge anyway.  Even knowing only what I know now would have made my research as a graduate student that much easier.  There are only so many lines you can populate in an excel spreadsheet before the computer crashes.  Especially with some of these Solexa datasets&#8230;such analyses would be virtually impossible without the proper codes.  Also, very importantly, I learned that programmers love Starbucks.  Nuff said&#8230;I&#8217;m in good company :)</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.warelab.org%2Fblog%2F%3Fp%3D210&amp;linkname=The%20learning%20curve">Share/Save</a>]]></content:encoded>
			<wfw:commentRss>http://www.warelab.org/blog/?feed=rss2&amp;p=210</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
