Loading an Ensembl species database from scratch
Posted: January 9th, 2009 | Author: William Spooner | Filed under: Ensembl, Programming | Tags: assembly, database, Ensembl, genes | 4 Comments »A question came up on the ensembl-dev mailing list to the effect “how do I create an Ensembl species core database from scratch”. As this is something that we have to do once in a while, I thought it may be a good topic for a blog post.
I will concentrate on getting a ‘minimal’ database populated, i.e. the sequences, genome assembly and genes (without additional annotation).
In my experience, the exact approach for each species differs, mainly due to the different formats and semantics of the source data. The following description will probably need tweaking for a particular species.
Basic input data;
- The sequences that are used in the assembly, preferably the unassembled contigs, but the assembled chromosome or supercontig sequences can be used if needed.
- A description of the assembly; how the contigs are mapped to form larger regions. This can be multiple levels, e.g. contigs-to-clone-to-supercontig-to-chromosome for instance. If you are working from large, assembled sequences you may need to ‘fake’ an assembly to chop the sequences into pieces that can be efficiently stored in the database.
- A file that describes the gene structures; the locations of exons and CDSes, and how these are grouped into transcripts and genes. It is also useful to have cDNA and peptide sequences for the genes for validation.
Preparation
You need to create an empty ensembl-schema database. The SQL to generate the schema is provide by a file in the ensembl project repository: table.sql
Loading the contig sequences
Only use this method if you have short sequences and an assembly file. If you have long sequences (e.g. whole chromosomes), see the Loading chromosome sequences and faking the assembly section.
Ensembl provides a script for loading sequences from a FASTA file into the database:
load_seq_region.pl. A typical invocation of this script is:
$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \ -coord_system_name contig -rank 4 -sequence_level -fasta_file sequence.fa
Loading the assembly
Once you have the sequences loaded, you need to load the identifiers of other regions that appear in the assembly. Again using the load_seq_region.pl script, these identifiers can be loaded either from a FASTA file (the sequences will be ignored);
$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \ -coord_system_name clone -rank 3 -fasta_file clone.fa
Or from an agp file (assembled regions);
$ perl load_seq_region.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \ -coord_system_name chromosome -rank 1 -agp_file genome.agp
Once you have all the pieces of the assembly in the database (typically by running load_seq_region.pl a number of times), you need to load the assembly itself, i.e. the order/orientation/overlap between the assembly pieces. If you have agp file(s), ensembl provides a script to loat it: load_agp.pl. A typical invocation would be;
$ perl load_agp.pl -dbhost host -dbuser user -dbname my_db -dbpass **** \ -assembled_name chromosome -assembled_version NCBI34 \ -component_name contig -agp_file genome.agp
You will need to run this script for each level of assembly that you have.
Loading chromosome sequences and faking the assembly
The fastest way to get an assembly into Ensembl is to take the fully assembled chromosome sequences (or other large assembled component) for a species, an load this straight into Ensembl. As it is inefficient to work with huge sequences in single MySQL records, the approach is to chunk into smaller pieces and fake an assembly. A script that does this is available from Gramene: load_assembly_from_fasta.pl. Typical invocation would be:
$ perl load_assembly_from_fasta.pl \ --ensembl_registry file=/path/to/ensembl.registry --species=Homo_sapiens \ --coord_system=chromosome -assembly_version=v1 --chunk_size=100000
Loading the genes
Gene structures are usually contained in a gff-derived file format. The semantics of gff itself are a fairly lose, so it is better to use a more specific format, such as gtf or gff3. For a starting point, I direct you to a recent Gramene script used to load grape genes into the database; load_genes_from_grape_gff.pl. Typical invocation:
$ perl load_genes_from_grape_gff.pl \ --ensembl_registry file=/path/to/ensembl.registry --species=grape \ --logic_name=genes_irgsp /path/to/grape_genes.gff
Footnote
This is a very basic treatment on the subject of loading data into Ensembl. There are many optimisations and display issues that could have been addressed. As time allows I would like to extend these notes into a longer article on loading into Ensembl. If you have any suggestions, please post!

You can find other examples of scripts used to load various data formats into ensembl in the ensembl-pipeline cvs http://cvs.sanger.ac.uk/cgi-bin/viewcvs.cgi/ensembl-pipeline/scripts/DataConversion/
Thanks for this. I got hit by a slew of other things but I am working through it now.
I just finished importing using the load_assembly_from_fasta.pl . One point about that: this script tries to fetch taxonomy info on the organism. If it is a proprietary organism than it fails to find this info and quits.
So I had to disable part of the code to get it to work. That’s always a little bit scary with PERL because it can be a little bit difficult to determine what you can change and what you can’t. The script seemed to run OK, though.
Great thanks! How to configure the web site to view it?
V.nono suggests the following document as an alternative description for loading assemblies into Ensembl;
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/loading_sequence_into_ensembl.txt?revision=1.5&root=ensembl&view=markup
It’s not been updated for the past 4 years, but that may not be a bad thing…