Resubmitting failed SGE array tasks

Posted: December 1st, 2009 | Author: Shiran Pasternak | Filed under: Programming | Tags: , , , | No Comments »

I can't seem to find a straighforward mechanism of resubmitting specific tasks in a Sun Grid Engine (SGE) array job, so I rolled my own. There's not much to it, but it's ideal for small array jobs where the failed tasks can be specified by hand.

Read the rest of this entry »


A quick coding style suggestion

Posted: April 14th, 2009 | Author: Jim Thomason | Filed under: Programming | 2 Comments »

I am going to keep this short and sweet and wax poetic about a certain programming idiom that irks me to no end, and provide my preferred alternative.

Now, most Perl programmers know that it’s very useful to pass in a hash (or hashref!) of parameters for functions that are very very long. It’s useful and keeps things tidy.

Read the rest of this entry »


Loading an Ensembl species database from scratch

Posted: January 9th, 2009 | Author: William Spooner | Filed under: Ensembl, Programming | Tags: , , , | 4 Comments »

A question came up on the ensembl-dev mailing list to the effect “how do I create an Ensembl species core database from scratch”. As this is something that we have to do once in a while, I thought it may be a good topic for a blog post.

I will concentrate on getting a ‘minimal’ database populated, i.e. the sequences, genome assembly and genes (without additional annotation).

In my experience, the exact approach for each species differs, mainly due to the different formats and semantics of the source data. The following description will probably need tweaking for a particular species.

Basic input data;

  • The sequences that are used in the assembly, preferably the unassembled contigs, but the assembled chromosome or supercontig sequences can be used if needed.
  • A description of the assembly; how the contigs are mapped to form larger regions. This can be multiple levels, e.g. contigs-to-clone-to-supercontig-to-chromosome for instance. If you are working from large, assembled sequences you may need to ‘fake’ an assembly to chop the sequences into pieces that can be efficiently stored in the database.
  • A file that describes the gene structures; the locations of exons and CDSes, and how these are grouped into transcripts and genes. It is also useful to have cDNA and peptide sequences for the genes for validation.

Preparation

You need to create an empty ensembl-schema database. The SQL to generate the schema is provide by a file in the ensembl project repository: table.sql

Loading the contig sequences

Only use this method if you have short sequences and an assembly file. If you have long sequences (e.g. whole chromosomes), see the Loading chromosome sequences and faking the assembly section.

Ensembl provides a script for loading sequences from a FASTA file into the database:
load_seq_region.pl. A typical invocation of this script is:

$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -coord_system_name contig -rank 4 -sequence_level -fasta_file sequence.fa

Loading the assembly

Once you have the sequences loaded, you need to load the identifiers of other regions that appear in the assembly. Again using the load_seq_region.pl script, these identifiers can be loaded either from a FASTA file (the sequences will be ignored);

$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -coord_system_name clone -rank 3 -fasta_file clone.fa

Or from an agp file (assembled regions);

$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-coord_system_name chromosome -rank 1 -agp_file genome.agp

Once you have all the pieces of the assembly in the database (typically by running load_seq_region.pl a number of times), you need to load the assembly itself, i.e. the order/orientation/overlap between the assembly pieces. If you have agp file(s), ensembl provides a script to loat it: load_agp.pl. A typical invocation would be;

$ perl load_agp.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
  -assembled_name chromosome -assembled_version NCBI34 \
  -component_name contig -agp_file genome.agp

You will need to run this script for each level of assembly that you have.

Loading chromosome sequences and faking the assembly

The fastest way to get an assembly into Ensembl is to take the fully assembled chromosome sequences (or other large assembled component) for a species, an load this straight into Ensembl. As it is inefficient to work with huge sequences in single MySQL records, the approach is to chunk into smaller pieces and fake an assembly. A script that does this is available from Gramene: load_assembly_from_fasta.pl. Typical invocation would be:

$ perl load_assembly_from_fasta.pl \
  --ensembl_registry file=/path/to/ensembl.registry --species=Homo_sapiens \
  --coord_system=chromosome -assembly_version=v1 --chunk_size=100000

Loading the genes

Gene structures are usually contained in a gff-derived file format. The semantics of gff itself are a fairly lose, so it is better to use a more specific format, such as gtf or gff3. For a starting point, I direct you to a recent Gramene script used to load grape genes into the database; load_genes_from_grape_gff.pl. Typical invocation:

$ perl load_genes_from_grape_gff.pl \
  --ensembl_registry file=/path/to/ensembl.registry --species=grape \
  --logic_name=genes_irgsp /path/to/grape_genes.gff

Footnote

This is a very basic treatment on the subject of loading data into Ensembl. There are many optimisations and display issues that could have been addressed. As time allows I would like to extend these notes into a longer article on loading into Ensembl. If you have any suggestions, please post!


I love refactoring

Posted: November 23rd, 2008 | Author: Jim Thomason | Filed under: Programming | 3 Comments »

I love refactoring code.

One of the tenets of Extreme Programming is You ain’t gonna need it. Or, more verbosely stated: Always implement things when you actually need them, never when you just foresee that you need them.

And the rationale for this is that humans, as a species, are notoriously bad at predicting the future. Psychics are wrong more often than not. The end of the world hasn’t come nearly as often as it’s been predicted. I have no clue what I’m going to have for lunch today. So in our software, as in life, we should focus on the present. That’s what we know and what we should make our decisions off of.

Build with a nod towards the future, but that nod should be keeping in mind the fact that software needs to breathe and evolve and change, so we shouldn’t stifle it too much in its development. It’ll need to change in the future. But until then, build for right now.

Read the rest of this entry »


an elusive bug

Posted: July 31st, 2008 | Author: Sharon | Filed under: Programming | 3 Comments »

The following code would cause the program to die if there is no value defined in my config file for mdb_map_name_prefix, right?

my $mdb_map_name_prefix = $conf_hash{mdb_map_name_prefix}
|| die "need to define mdb_map_name_prefix"

what about the following one? Instead of dying, the program will print out an error message and proceed, which is the behavior I want now.

my $mdb_map_name_prefix = $conf_hash{mdb_map_name_prefix}
|| print STDERR "need to define mdb_map_name_prefix";

But one easily missed side effect of this change is that the value of $mdb_map_name_prefix is no longer ‘undef’ when you do not set value for mdb_map_name_prefix in your config file. Surprisingly, you would get 1 as the value for $mdb_map_name_prefix, which produce erroneous output from your program. This is because the assignment operation is after the “||” operation, which will always be true if the print statement succeeds when the configured value for mdb_map_name_prefix is null.

The correct way of coding this logic is

my $mdb_map_name_prefix = $conf_hash{mdb_map_name_prefix};
print "No markers db map_name_prefix defined\n"
unless $mdb_map_name_prefix;

Config file:

mdb_map_name_prefix


My day at YAPC, Part II – the afternoon

Posted: June 18th, 2008 | Author: Jim Thomason | Filed under: Programming | No Comments »

After lunch (we went to Quizno’s, which was sold to me as being much closer to campus than was actually the case), we dove back into things. It’s simply incredible how much information leaves along with your staff if they switch jobs. And I’m not even talking about simple knowledge of your systems or processes. The history of your workplace gets lost as well, as people who know the origin of things leave and new folks don’t have opportunity to learn how something came to be. Fortunately, I like telling stories and regaled them with many before we had to head back for the afternoon sessions.

Read the rest of this entry »


My day at YAPC

Posted: June 18th, 2008 | Author: Jim Thomason | Filed under: Programming | 1 Comment »

YAPC (the Yet Another Perl Conference – http://www.yapc.org) was back in Chicago this year so I got to go head down there for a day. Unfortunately, I was sick on Monday and our babysitter canceled for Wednesday so I was only able to attend one day of the event. But what a day it was.
Read the rest of this entry »


Ken’s week ending 5/23/08

Posted: May 23rd, 2008 | Author: Ken Youens-Clark | Filed under: Programming | No Comments »

We’re off on the new Gramene build, and I definitely feel in the weeds already. Read the rest of this entry »


Ken’s week ending 5/16/08

Posted: May 16th, 2008 | Author: Ken Youens-Clark | Filed under: Programming | 2 Comments »

I’ve missed some regular posting what with all the traveling and releasing and conferencing and such. Read the rest of this entry »


iPlant, Gramene release in sight (?)

Posted: April 18th, 2008 | Author: Ken Youens-Clark | Filed under: Programming | No Comments »

I missed posting last week because I was caught up in the iPlant meeting. Read the rest of this entry »