Use of the ‘chunk’ coord_system in Ensembl core

Posted: May 14th, 2009 | Author: William Spooner | Filed under: Ensembl | Tags: , | 1 Comment »

This is a follow-up post to Loading an Ensembl species database from scratch, in particular the issue of ‘chunk’ coordinate_system that has traditionally been used to coerce long sequences into the Ensembl core database. Steve Searle summarises the subject as follows;

There is a limit in the mysql client and server for the maximum size for network transfer. This is one of the main reasons for chunking. We are moving away from running analyses on the seq level coord system much anymore. Having features on toplevel coords makes access much faster, so generating everything on that coord system makes sense. Doing this makes the underlying chunking fairly irrelevant. The exception would be if you wanted to map to some third coord system through the chunks, but the reason chunks are usually made is because there is no real assembly information (an agp), just the final assembled sequences.

The example that I am working on right now is the Arabidopsis lyrata genome, which has a maximum scaffold length of 33 Mbase. So what is the ideal chunk length (if any) in this case?

Having looked at the Ensembl schema (the dna.sequence field is LONGTEXT), the MySQL manual, and the configuration of the public Ensembl database, it seems that the maximum realistic length of a sequence is 16 Mbase. Although I could increase my mysqld’s max_allowed_packet to encompass my 33 Mbases I feel this would be unwise (and incompatible with most peoples clients), so I’m stuch with chunking for this genome.

One problem with chunking is that it introduces artificial sequence breaks in downstream analyses that operate on the seqlevel coord_system (input_type_id=CONTIG, in ensembl-pipeline parlance). I cannot seem to find a list of which analyses use which input types, but I assume that algorithms which have been designed to run on sequences such as clones will use seqlevel; examples include RepeatMasker (max sequence length 4 Mbase), GenScan (max 5 Mbase) and so on.

I have always blindly used a chunk size of 100 kbase in the past, but I’m now thinking that 1 Mbase may be more appropriate. We’ll have to wait and see…


One Comment on “Use of the ‘chunk’ coord_system in Ensembl core”

  1. 1 Will said at 10:54 am on August 20th, 2009:

    Something else to consider; sequence gaps; in an ideal world the ‘chunking’ script should break the sequence either side of runs of ‘N’s and represent these as gaps in the assembly rather than literal ‘N’s. This would have several advantages; reduce the amount of sequence in the database (slightly), cause gaps to display as white in the clone tracks, and hide the gaps from any algorithms intent on annotating them in the pipeline (dust insists on annotating them as repeats).


Leave a Reply