Posted: May 14th, 2009 | Author: William Spooner | Filed under: Ensembl | Tags: database, Ensembl | 1 Comment »
This is a follow-up post to Loading an Ensembl species database from scratch, in particular the issue of ‘chunk’ coordinate_system that has traditionally been used to coerce long sequences into the Ensembl core database. Steve Searle summarises the subject as follows;
There is a limit in the mysql client and server for the maximum size for network transfer. This is one of the main reasons for chunking. We are moving away from running analyses on the seq level coord system much anymore. Having features on toplevel coords makes access much faster, so generating everything on that coord system makes sense. Doing this makes the underlying chunking fairly irrelevant. The exception would be if you wanted to map to some third coord system through the chunks, but the reason chunks are usually made is because there is no real assembly information (an agp), just the final assembled sequences.
The example that I am working on right now is the Arabidopsis lyrata genome, which has a maximum scaffold length of 33 Mbase. So what is the ideal chunk length (if any) in this case?
Having looked at the Ensembl schema (the dna.sequence field is LONGTEXT), the MySQL manual, and the configuration of the public Ensembl database, it seems that the maximum realistic length of a sequence is 16 Mbase. Although I could increase my mysqld’s max_allowed_packet to encompass my 33 Mbases I feel this would be unwise (and incompatible with most peoples clients), so I’m stuch with chunking for this genome.
One problem with chunking is that it introduces artificial sequence breaks in downstream analyses that operate on the seqlevel coord_system (input_type_id=CONTIG, in ensembl-pipeline parlance). I cannot seem to find a list of which analyses use which input types, but I assume that algorithms which have been designed to run on sequences such as clones will use seqlevel; examples include RepeatMasker (max sequence length 4 Mbase), GenScan (max 5 Mbase) and so on.
I have always blindly used a chunk size of 100 kbase in the past, but I’m now thinking that 1 Mbase may be more appropriate. We’ll have to wait and see…
Posted: January 9th, 2009 | Author: William Spooner | Filed under: Ensembl, Programming | Tags: assembly, database, Ensembl, genes | 4 Comments »
A question came up on the ensembl-dev mailing list to the effect “how do I create an Ensembl species core database from scratch”. As this is something that we have to do once in a while, I thought it may be a good topic for a blog post.
I will concentrate on getting a ‘minimal’ database populated, i.e. the sequences, genome assembly and genes (without additional annotation).
In my experience, the exact approach for each species differs, mainly due to the different formats and semantics of the source data. The following description will probably need tweaking for a particular species.
Basic input data;
- The sequences that are used in the assembly, preferably the unassembled contigs, but the assembled chromosome or supercontig sequences can be used if needed.
- A description of the assembly; how the contigs are mapped to form larger regions. This can be multiple levels, e.g. contigs-to-clone-to-supercontig-to-chromosome for instance. If you are working from large, assembled sequences you may need to ‘fake’ an assembly to chop the sequences into pieces that can be efficiently stored in the database.
- A file that describes the gene structures; the locations of exons and CDSes, and how these are grouped into transcripts and genes. It is also useful to have cDNA and peptide sequences for the genes for validation.
Preparation
You need to create an empty ensembl-schema database. The SQL to generate the schema is provide by a file in the ensembl project repository: table.sql
Loading the contig sequences
Only use this method if you have short sequences and an assembly file. If you have long sequences (e.g. whole chromosomes), see the Loading chromosome sequences and faking the assembly section.
Ensembl provides a script for loading sequences from a FASTA file into the database:
load_seq_region.pl. A typical invocation of this script is:
$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-coord_system_name contig -rank 4 -sequence_level -fasta_file sequence.fa
Loading the assembly
Once you have the sequences loaded, you need to load the identifiers of other regions that appear in the assembly. Again using the load_seq_region.pl script, these identifiers can be loaded either from a FASTA file (the sequences will be ignored);
$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-coord_system_name clone -rank 3 -fasta_file clone.fa
Or from an agp file (assembled regions);
$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-coord_system_name chromosome -rank 1 -agp_file genome.agp
Once you have all the pieces of the assembly in the database (typically by running load_seq_region.pl a number of times), you need to load the assembly itself, i.e. the order/orientation/overlap between the assembly pieces. If you have agp file(s), ensembl provides a script to loat it: load_agp.pl. A typical invocation would be;
$ perl load_agp.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \
-assembled_name chromosome -assembled_version NCBI34 \
-component_name contig -agp_file genome.agp
You will need to run this script for each level of assembly that you have.
Loading chromosome sequences and faking the assembly
The fastest way to get an assembly into Ensembl is to take the fully assembled chromosome sequences (or other large assembled component) for a species, an load this straight into Ensembl. As it is inefficient to work with huge sequences in single MySQL records, the approach is to chunk into smaller pieces and fake an assembly. A script that does this is available from Gramene: load_assembly_from_fasta.pl. Typical invocation would be:
$ perl load_assembly_from_fasta.pl \
--ensembl_registry file=/path/to/ensembl.registry --species=Homo_sapiens \
--coord_system=chromosome -assembly_version=v1 --chunk_size=100000
Loading the genes
Gene structures are usually contained in a gff-derived file format. The semantics of gff itself are a fairly lose, so it is better to use a more specific format, such as gtf or gff3. For a starting point, I direct you to a recent Gramene script used to load grape genes into the database; load_genes_from_grape_gff.pl. Typical invocation:
$ perl load_genes_from_grape_gff.pl \
--ensembl_registry file=/path/to/ensembl.registry --species=grape \
--logic_name=genes_irgsp /path/to/grape_genes.gff
Footnote
This is a very basic treatment on the subject of loading data into Ensembl. There are many optimisations and display issues that could have been addressed. As time allows I would like to extend these notes into a longer article on loading into Ensembl. If you have any suggestions, please post!
Posted: October 17th, 2008 | Author: William Spooner | Filed under: Uncategorized | No Comments »
It’s late on stormy Friday night in mid October, probably a good time for my inaugural post to this blog. In particular as I have done very little in the way of plant informatics work recently.
So; what has been on my plate? I’ve contributed a chunk of web code to the (long delayed) Ensembl 51 release. This is a page for displaying gene trees. I’ve been promising various Ensembl-ites that I would work this up to for many years, so this is my special gift for their half-century. I’m not going to try to explain what it looks like, or provide a link (it’s not released yet), sorry. I am, however, quite pleased by the end result!
But most of my time recently has been taken up with writing the ‘WormMart’ paper in the last few weeks that I have working on that project (WormBase is off to Canada without me). WormMart, for those not intimately familiar with resources for the study of the biology of nematodes, is my (BioMart-based) data warehouse for WormBase and is, as such, similar to GrameneMart. So I’ve been buried in bioinformatics database papers in an effort to absorb sufficient of their nuances for me produce a decent forgery.
Recent Comments