understand Gerp conservation score
August 15th, 2008 by Sharon WeiRecently I have been working on setting up and testing the pecan multiple whole genome alignment pipeline. One useful feature this analysis provides is the conservation scores for the alignmened positions/columns. The software used to caclulate such a score is GERP - Genomic Evolutionary Rate Profiling, a statistically rigorous and biologically transparent framework for identifying constrained element. One important concept GERP employs is “Rejected Substitution (RS)”, which is used to assess the strength of the constraint element candidates. The rational behind is that “due to their deleterious nature, mutations that arise within functional elements are more likely to be subjected to purifying selection. They are therefore less likely to be fixed in the population and result in evolutionary changes. The more deleterious a mutation is, the greater the tendency for purifying selection to eliminate it. Thus, the direct consequence of purifying selection is a deficit of substitution events within functional elements as compared with neutral DNA, which the magnitude of the deficit relating to the strength of the constraints that have acted upon it”.
So pratically how the RS is calculated? Simply speaking, RS is the sum of the difference between the observed substitution rate and expected substitution rate at each column of the compressed alignmnent for the candidate element/region. Mathematical form is R=
(Exp - Obs). Then how is expected rate and observed rate computed? The expected substitution rate for each column is “determined by summing the branches of the neutral tree that remain after removing species with a gap character”. The maximum likelihood estimatior is used in computing the observed substitution rate. “In this procedure, the likelihood is maximized with respect to all branch lengths in the topology relating the analyzed species, and the maximum likelihood estimate of the column’s expected substitution count is treated as the observed value”
In the pecan databases, the table “conservation_score” is used to store these substitution rates. The fields, observed_score, expected_score, diff_score holds values for observed substituion/site rate, expected substitietion rate and the deviation of observed rate from the expected rate, the sum of which in a specified block would constitute RS for this candidate constraint element. However, when I selected the values from this table, I saw enigmatic symbols like ‘ ,.. ” ;’ for these scores, impossible to interpret. It made a little sense later when I found out that their data types are blob, which is binary, usually used to store packed data. However, using bin() function won’t help much, what I saw would be machine language like strings of 0 and 1.
We were kind of stuck here. Luckily, together with Doreen, we looked at the display of conservation score at ensembl contigview. So I am thinking there must be a way that the ensembl web server makes sense out of these blobs. Consequently, I went to read the ensembl API for conservation_score and finally understand them. In our cases, these blobs stores packed single precision floats that represent a group of consecutive conservation scores.
The following is the explanation from the API
ConservationScore
Object for storing conservation scores. The scores are averaged over different
window sizes to speed up drawing over large regions. The scores are packed as
floats and stored in a string. The scores can be stored and retrieved in
either a packed or unpacked format. The unpacked format is as a space delimited
string eg (”0.123 0.456 0.789″). The packed format is a single precision float
(4 bytes). It is recommended to use the unpacked format.
ConservationScoreAdaptor
This module is used to access data in the conservation_score table.
Each score is represented by a Bio::EnsEMBL::Compara::ConservationScore. The position and an observed, expected score and a difference score (expected-observed) is stored for each column in a multiple alignment. Not all bases in an alignment have a score (for example, if there is insufficient coverage) and termed here as ‘uncalled’.
In order to speed up processing of the scores over large regions, the scores are stored in the database averaged over window_sizes of 1 (no averaging), 10, 100 and 500. When retrieving the scores, the most appropriate window_size is estimated from the length of the alignment or slice and the number of scores requested, given by the display_size. There is no need to specify the window_size directly. If the number of scores requested (display_size) is smaller than the alignment length or slice length, the scores will be either averaged if display_type = “AVERAGE” or the maximum value taken if display_type = “MAX”. Scores in uncalled regions are not returned. If a score for each column in an alignment is required, the display_size should be set to be the same size as the alignment length or slice length.
Then Javier, the developer of Pecan pipeline, gave us an even more clear explanation, the following is the excerpt from his email.
Yes, we decided to store the scores in a binary format in order to save some space. Unfortunately, MySQL does not know how to unpack these scores and we depend on the Perl API for this. There are several layers of complexity here: - each entry in the MySQL table represents more than one score. Cons. scores are grouped, packed and stored together in the table. - Some columns/positions in the alignment do not have any score. If there is just one or a few positions with no scores, these "undef" can be part of one of the previous set of scores. If there are many "undef" scores, the previous set is closed and a new one start when relevant. Let use an example: Say we want to store the following scores: 1.2 :: 1.4 :: -1.2 :: -0.4 :: 0.1 :: 3.1 :: 2.7 :: UNDEF :: -0.1 :: 1.2 :: 1.4 :: -0.4 :: 1.1 :: 1.1 :: 2.7 :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: -0.1 :: 0.1 :: 3.1 :: 2.7 :: 0.1 :: 3.1 :: 2.7 we could have in the table the following entries: pos 1 -> [1.2 :: 1.4 :: -1.2 :: -0.4 :: 0.1 :: 3.1] pos 7 -> [2.7 :: UNDEF :: -0.1 :: 1.2 :: 1.4 :: -0.4] pos 13 -> [1.1 :: 1.1 :: 2.7] (no more scores are stored in this case as the remaining ones are undef) pos 37 -> [-0.1 :: 0.1 :: 3.1 :: 2.7 :: 0.1 :: 3.1] pos 43 -> [2.7] - Last, we store the scores at several resolution levels (widow size). Let's say you want to display the scores on the interface for a 1 MB window. It doesn't make sense to try to represent 1 million values as you will probably have around 1000 pixels only. The web interface tell the ConservationScoreAdaptor how many values (pixels) it wants and the adaptor returns the scores at the best resolution.