Resubmitting failed SGE array tasks

Posted: December 1st, 2009 | Author: Shiran Pasternak | Filed under: Programming | Tags: , , , | No Comments »

I can't seem to find a straighforward mechanism of resubmitting specific tasks in a Sun Grid Engine (SGE) array job, so I rolled my own. There's not much to it, but it's ideal for small array jobs where the failed tasks can be specified by hand.

Read the rest of this entry »


Forecast: Cloudy

Posted: November 30th, 2009 | Author: Shiran Pasternak | Filed under: Big Data | Tags: , , | 4 Comments »

David Dooling, of the Genome Center at Washington University and the blog PolITiGenomics, has written a thoughtful post about the nexus of cloud computing and bioinformatics. Dooling does a fine job summarizing the cloud computing and Big Data conversation that’s happening in bioinformatics circles, so I won’t repeat it here.

The post essentially argues that bioinformaticians are — to mix metaphors — too quick to jump on the cloud computing band wagon. It often makes more sense — and cents — to analyze genomic data locally (or privately). His cost-benefit analysis is spot on, but applies to the state of the art. The state of bioinformatics is, however, changing very rapidly and the scale (pun very much intended) is likely to tip.

Read the rest of this entry »


Use of the ‘chunk’ coord_system in Ensembl core

Posted: May 14th, 2009 | Author: William Spooner | Filed under: Ensembl | Tags: , | 1 Comment »

This is a follow-up post to Loading an Ensembl species database from scratch, in particular the issue of ‘chunk’ coordinate_system that has traditionally been used to coerce long sequences into the Ensembl core database. Steve Searle summarises the subject as follows;

There is a limit in the mysql client and server for the maximum size for network transfer. This is one of the main reasons for chunking. We are moving away from running analyses on the seq level coord system much anymore. Having features on toplevel coords makes access much faster, so generating everything on that coord system makes sense. Doing this makes the underlying chunking fairly irrelevant. The exception would be if you wanted to map to some third coord system through the chunks, but the reason chunks are usually made is because there is no real assembly information (an agp), just the final assembled sequences.

The example that I am working on right now is the Arabidopsis lyrata genome, which has a maximum scaffold length of 33 Mbase. So what is the ideal chunk length (if any) in this case?

Having looked at the Ensembl schema (the dna.sequence field is LONGTEXT), the MySQL manual, and the configuration of the public Ensembl database, it seems that the maximum realistic length of a sequence is 16 Mbase. Although I could increase my mysqld’s max_allowed_packet to encompass my 33 Mbases I feel this would be unwise (and incompatible with most peoples clients), so I’m stuch with chunking for this genome.

One problem with chunking is that it introduces artificial sequence breaks in downstream analyses that operate on the seqlevel coord_system (input_type_id=CONTIG, in ensembl-pipeline parlance). I cannot seem to find a list of which analyses use which input types, but I assume that algorithms which have been designed to run on sequences such as clones will use seqlevel; examples include RepeatMasker (max sequence length 4 Mbase), GenScan (max 5 Mbase) and so on.

I have always blindly used a chunk size of 100 kbase in the past, but I’m now thinking that 1 Mbase may be more appropriate. We’ll have to wait and see…


Plant Ontology Database #0409 Release

Posted: April 20th, 2009 | Author: shuly | Filed under: Uncategorized | No Comments »

Plant Ontology Consortium (POC) is excited to announce the release #0409 (April 2009) of the Plant Ontology Database.

In this release, we bring you 26625 gene annotations from TAIR, Gramene, SGN and MaizeGDB, 8558 QTL annotations from Gramene, 9832 germplasm associations from SGN, MaizeGDB and NASC.

The following number of annotations were added for the first time: 16016 gene annotations from TAIR, 3 gene annotations from Gramene, 2 gene annotations from SGN and 3928 germplasm annotations from MaizeGDB.

For genes curated by TAIR and SGN, you may also find links to their Gene Ontology (GO) pages through PO browser.

The new ontology files and the database dump are available for download.

To submit plant ontology term requests, we encourage researchers to use SourceForge PO tracker.

The Plant Ontology Consortium
web: http:www.plantontology.org
e-mail: po-dev at plantontology.org

The project is funded by National Science Foundation, USA, (Grant No. DBI-0703908)


A quick coding style suggestion

Posted: April 14th, 2009 | Author: Jim Thomason | Filed under: Programming | 2 Comments »

I am going to keep this short and sweet and wax poetic about a certain programming idiom that irks me to no end, and provide my preferred alternative.

Now, most Perl programmers know that it’s very useful to pass in a hash (or hashref!) of parameters for functions that are very very long. It’s useful and keeps things tidy.

Read the rest of this entry »


Gramene goings on, etc.

Posted: April 9th, 2009 | Author: Ken Youens-Clark | Filed under: Uncategorized | 1 Comment »

In February Gramene released our 29th build. I won’t go into details here because you can read our release notes. Shortly afterwards I got a chance to present a poster on said release at CSHL’s “Plant Genomes” meeting. It was nice to meet some of our users and listen to the many interesting talks.

Since then, I’ve been doing lots of things. In no particular order: Read the rest of this entry »


I have a suggestion for you

Posted: January 21st, 2009 | Author: Jim Thomason | Filed under: Uncategorized | No Comments »

I’m a little bit behind the times. I was supposed to post something up here last week, but was so wrapped up in actually working on what I was going to write about that I completely forgot.

Gramene is going to be rolling out an autocomplete feature to offer suggestions to users sometime in the near future. You can sample the wonderfully suggestive goodness here until we go live.

Read the rest of this entry »


Loading an Ensembl species database from scratch

Posted: January 9th, 2009 | Author: William Spooner | Filed under: Ensembl, Programming | Tags: , , , | 4 Comments »

A question came up on the ensembl-dev mailing list to the effect “how do I create an Ensembl species core database from scratch”. As this is something that we have to do once in a while, I thought it may be a good topic for a blog post.

I will concentrate on getting a ‘minimal’ database populated, i.e. the sequences, genome assembly and genes (without additional annotation).

In my experience, the exact approach for each species differs, mainly due to the different formats and semantics of the source data. The following description will probably need tweaking for a particular species.

Basic input data;

  • The sequences that are used in the assembly, preferably the unassembled contigs, but the assembled chromosome or supercontig sequences can be used if needed.
  • A description of the assembly; how the contigs are mapped to form larger regions. This can be multiple levels, e.g. contigs-to-clone-to-supercontig-to-chromosome for instance. If you are working from large, assembled sequences you may need to ‘fake’ an assembly to chop the sequences into pieces that can be efficiently stored in the database.
  • A file that describes the gene structures; the locations of exons and CDSes, and how these are grouped into transcripts and genes. It is also useful to have cDNA and peptide sequences for the genes for validation.

Preparation

You need to create an empty ensembl-schema database. The SQL to generate the schema is provide by a file in the ensembl project repository: table.sql

Loading the contig sequences

Only use this method if you have short sequences and an assembly file. If you have long sequences (e.g. whole chromosomes), see the Loading chromosome sequences and faking the assembly section.

Ensembl provides a script for loading sequences from a FASTA file into the database:
load_seq_region.pl. A typical invocation of this script is:

$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -coord_system_name contig -rank 4 -sequence_level -fasta_file sequence.fa

Loading the assembly

Once you have the sequences loaded, you need to load the identifiers of other regions that appear in the assembly. Again using the load_seq_region.pl script, these identifiers can be loaded either from a FASTA file (the sequences will be ignored);

$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -coord_system_name clone -rank 3 -fasta_file clone.fa

Or from an agp file (assembled regions);

$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-coord_system_name chromosome -rank 1 -agp_file genome.agp

Once you have all the pieces of the assembly in the database (typically by running load_seq_region.pl a number of times), you need to load the assembly itself, i.e. the order/orientation/overlap between the assembly pieces. If you have agp file(s), ensembl provides a script to loat it: load_agp.pl. A typical invocation would be;

$ perl load_agp.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -assembled_name chromosome -assembled_version NCBI34 \
  -component_name contig -agp_file genome.agp

You will need to run this script for each level of assembly that you have.

Loading chromosome sequences and faking the assembly

The fastest way to get an assembly into Ensembl is to take the fully assembled chromosome sequences (or other large assembled component) for a species, an load this straight into Ensembl. As it is inefficient to work with huge sequences in single MySQL records, the approach is to chunk into smaller pieces and fake an assembly. A script that does this is available from Gramene: load_assembly_from_fasta.pl. Typical invocation would be:

$ perl load_assembly_from_fasta.pl \
  --ensembl_registry file=/path/to/ensembl.registry --species=Homo_sapiens \
  --coord_system=chromosome -assembly_version=v1 --chunk_size=100000

Loading the genes

Gene structures are usually contained in a gff-derived file format. The semantics of gff itself are a fairly lose, so it is better to use a more specific format, such as gtf or gff3. For a starting point, I direct you to a recent Gramene script used to load grape genes into the database; load_genes_from_grape_gff.pl. Typical invocation:

$ perl load_genes_from_grape_gff.pl \
  --ensembl_registry file=/path/to/ensembl.registry --species=grape \
  --logic_name=genes_irgsp /path/to/grape_genes.gff

Footnote

This is a very basic treatment on the subject of loading data into Ensembl. There are many optimisations and display issues that could have been addressed. As time allows I would like to extend these notes into a longer article on loading into Ensembl. If you have any suggestions, please post!


All Gramene, all the time

Posted: December 19th, 2008 | Author: Ken Youens-Clark | Filed under: Uncategorized | No Comments »

As the Gramene project manager, pretty much everything I do is directly related to that. Read the rest of this entry »


The learning curve

Posted: December 18th, 2008 | Author: Andrea Eveland | Filed under: Uncategorized | 2 Comments »

It is interesting to look at where in my current state as a programming newbie I fall on this curve.  My first experience with Perl (or any programming language for that matter) was during the CSHL Programming for Biology course in mid October.  I came away with a very large three-ring binder and an array of books with different animals on them…essentially the tools needed to tackle any data analysis situation that could be simplified or made manageable using perl.  I am very fortunate to have since been working alongside a group of very helpful friends in the Ware lab.  Reaching my spot this far along the learning curve would have been very difficult without them.  Even so, in these last 2 months I have experienced a series of peaks and valleys corresponding to momentary jumps of joy and periods of frustration where I feel seemingly unproductive.  As I move along the learning curve, although I continue to experience valleys, they are becoming increasingly more complex and my intermittent peaks are actually beginning to produce useful information for my research.  For example, earlier in the week I spent an entire evening trying to figure out whether the data structure that I had constructed in my code was an array of hashes or a hash of hashes or a hash of arrays, etc.  After systematically trying to isolate elements of the code line-by-line and commenting on what each gave back, I felt I had made some progress and went to bed.  As usual, I curled up with my camel book and a glass of wine.  What was not usual was my erratic sleep and the visions of arrays and hashes looping through my head.  I am not even kidding a little bit…I must have woken up about 10 times, each after hitting an error message.  The next day I felt tired and stressed, but when I sat back down at the computer, I realized almost immediately that I had a hash of arrays in which one element was a hash reference!  Ok, a little weird and probably not uncommon among programmers since I think if you stare at anything long enough it tends to come back to haunt you in your sleep.  Perhaps complete immersion is a little unhealthy since I dreamt of hashes again last night.  But I am happy about my progress along the learning curve.  

So in a nutshell, aside from the technical aspects of things, what have I learned thus far?  First of all, programming is really fun!  As a biologist working with deep sequencing data, it is also really essential…at least a basic knowledge anyway.  Even knowing only what I know now would have made my research as a graduate student that much easier.  There are only so many lines you can populate in an excel spreadsheet before the computer crashes.  Especially with some of these Solexa datasets…such analyses would be virtually impossible without the proper codes.  Also, very importantly, I learned that programmers love Starbucks.  Nuff said…I’m in good company :)