understand Gerp conservation score

August 15th, 2008 by Sharon Wei

Recently I have been working on setting up and testing the pecan multiple whole genome alignment pipeline. One useful feature this analysis provides is the conservation scores for the alignmened positions/columns. The software used to caclulate such a score is GERP - Genomic Evolutionary Rate Profiling, a statistically rigorous and biologically transparent framework for identifying constrained element. One important concept GERP employs is “Rejected Substitution (RS)”, which is used to assess the strength of the constraint element candidates. The rational behind is that “due to their deleterious nature, mutations that arise within functional elements are more likely to be subjected to purifying selection. They are therefore less likely to be fixed in the population and result in evolutionary changes. The more deleterious a mutation is, the greater the tendency for purifying selection to eliminate it. Thus, the direct consequence of purifying selection is a deficit of substitution events within functional elements as compared with neutral DNA, which the magnitude of the deficit relating to the strength of the constraints that have acted upon it”. 

So pratically how the RS is calculated? Simply speaking, RS is the sum of the difference between the observed substitution rate and expected substitution rate at each column of the compressed alignmnent for the candidate element/region. Mathematical form is R= \sum(Exp - Obs). Then how is expected rate and observed rate computed? The expected substitution rate for each column is “determined by summing the branches of the neutral tree that remain after removing species with a gap character”. The maximum likelihood estimatior is used in computing the observed substitution rate. “In this procedure, the likelihood is maximized with respect to all branch lengths in the topology relating the analyzed species, and the maximum likelihood estimate of the column’s expected substitution count is treated as the observed value”

In the pecan databases, the table “conservation_score” is used to store these substitution rates.  The fields, observed_score, expected_score, diff_score holds values for observed substituion/site rate, expected substitietion rate and the deviation of observed rate from the expected rate, the sum of which in a specified block would constitute RS for this candidate constraint element. However, when I selected the values from this table, I saw enigmatic symbols like ‘ ,.. ” ;’ for these scores, impossible to interpret. It made a little sense later when I found out that their data types are blob, which is binary, usually used to store packed data. However, using bin() function won’t help much, what I saw would be machine language like strings of 0 and 1. 

We were kind of stuck here. Luckily, together with Doreen, we looked at the display of conservation score at ensembl contigview. So I am thinking there must be a way that the ensembl web server makes sense out of these blobs. Consequently, I went to read the ensembl API for conservation_score and finally understand them.  In our cases, these blobs stores packed single precision floats that represent a group of consecutive conservation scores. 

The following is the explanation from the API

ConservationScore

Object for storing conservation scores. The scores are averaged over different
window sizes to speed up drawing over large regions. The scores are packed as
floats and stored in a string. The scores can be stored and retrieved in
either a packed or unpacked format. The unpacked format is as a space delimited
string eg (”0.123 0.456 0.789″). The packed format is a single precision float
(4 bytes). It is recommended to use the unpacked format.

 

ConservationScoreAdaptor

This module is used to access data in the conservation_score table.
Each score is represented by a Bio::EnsEMBL::Compara::ConservationScore. The position and an observed, expected score and a difference score (expected-observed) is stored for each column in a multiple alignment. Not all bases in an alignment have a score (for example, if there is insufficient coverage) and termed here as ‘uncalled’.
In order to speed up processing of the scores over large regions, the scores are stored in the database averaged over window_sizes of 1 (no averaging), 10, 100 and 500. When retrieving the scores, the most appropriate window_size is estimated from the length of the alignment or slice and the number of scores requested, given by the display_size. There is no need to specify the window_size directly. If the number of scores requested (display_size) is smaller than the alignment length or slice length, the scores will be either averaged if display_type = “AVERAGE” or the maximum value taken if display_type = “MAX”. Scores in uncalled regions are not returned. If a score for each column in an alignment is required, the display_size should be set to be the same size as the alignment length or slice length.

 Then Javier, the developer of Pecan pipeline, gave us an even more clear explanation, the following is the excerpt from his email.

Yes, we decided to store the scores in a binary format in order to save some
space. Unfortunately, MySQL does not know how to unpack these scores and we
depend on the Perl API for this. There are several layers of complexity here:

- each entry in the MySQL table represents more than one score. Cons. scores
are grouped, packed and stored together in the table.

- Some columns/positions in the alignment do not have any score. If there is
just one or a few positions with no scores, these "undef" can be part of one
of the previous set of scores. If there are many "undef" scores, the previous
set is closed and a new one start when relevant. Let use an example:

Say we want to store the following scores:
1.2 :: 1.4 :: -1.2 :: -0.4 :: 0.1 :: 3.1 :: 2.7 :: UNDEF :: -0.1 :: 1.2 ::
1.4 :: -0.4 :: 1.1 :: 1.1 :: 2.7 :: UNDEF :: UNDEF :: UNDEF :: UNDEF ::
UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF ::
UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF :: UNDEF ::
UNDEF :: -0.1 :: 0.1 :: 3.1 :: 2.7 :: 0.1 :: 3.1 :: 2.7

we could have in the table the following entries:
pos 1 -> [1.2 :: 1.4 :: -1.2 :: -0.4 :: 0.1 :: 3.1]
pos 7 -> [2.7 :: UNDEF :: -0.1 :: 1.2 :: 1.4 :: -0.4]
pos 13 -> [1.1 :: 1.1 :: 2.7] (no more scores are stored in this case as the
remaining ones are undef)
pos 37 -> [-0.1 :: 0.1 :: 3.1 :: 2.7 :: 0.1 :: 3.1]
pos 43 -> [2.7]

- Last, we store the scores at several resolution levels (widow size). Let's
say you want to display the scores on the interface for a 1 MB window. It
doesn't make sense to try to represent 1 million values as you will probably
have around 1000 pixels only. The web interface tell the
ConservationScoreAdaptor how many values (pixels) it wants and the adaptor
returns the scores at the best resolution.

Ken’s week ending 8/8/08

August 8th, 2008 by Ken Youens-Clark

The focus this week for me was cleaning up Gramene and building CMap. Read the rest of this entry »

Anyone interested in grape?

August 7th, 2008 by Doreen Ware

I recently had a wonderful visit with Anne-Francoise Adam from INRA. Read the rest of this entry »

an elusive bug

July 31st, 2008 by Sharon Wei

The following code would cause the program to die if there is no value defined in my config file for mdb_map_name_prefix, right?

my $mdb_map_name_prefix = $conf_hash{mdb_map_name_prefix}
|| die "need to define mdb_map_name_prefix"

what about the following one? Instead of dying, the program will print out an error message and proceed, which is the behavior I want now.

my $mdb_map_name_prefix = $conf_hash{mdb_map_name_prefix}
|| print STDERR "need to define mdb_map_name_prefix";

But one easily missed side effect of this change is that the value of $mdb_map_name_prefix is no longer ‘undef’ when you do not set value for mdb_map_name_prefix in your config file. Surprisingly, you would get 1 as the value for $mdb_map_name_prefix, which produce erroneous output from your program. This is because the assignment operation is after the “||” operation, which will always be true if the print statement succeeds when the configured value for mdb_map_name_prefix is null.

The correct way of coding this logic is

my $mdb_map_name_prefix = $conf_hash{mdb_map_name_prefix};
print "No markers db map_name_prefix defined\n"
unless $mdb_map_name_prefix;

Config file:

mdb_map_name_prefix

Yojimbo

July 31st, 2008 by Bonnie Hurwitz

Being on the brink of going back to graduate school and going through all of my notes and papers, I have realized that my computer has become a massive pile of electronic paperwork.  I have folders I didn’t know exist and keep finding little gems of information I completely forgot about.   Also, we just moved to Arizona and moved 30+ file boxes of my husband’s class notes and “important” papers from his recently completed PhD .  As, I stared at the boxes piled up on a fold-out table in our garage, I thought ‘there must be a better way’.   So, I started searching for software to save me from getting totally lost in the electronic jungle during my graduate career.  I found a really neat application called Yojimbo that has a slick way to store documents that allows you to tag the documents and add keywords.  Additionally, it has an encription feature so you can even store personal information about passwords or you details on your secret plot to take over the world :) It is similar to an email software I used called mail tags that has transformed my inbox and makes it so I can actually find things.  I have been using it for the last two weeks and now store all of my notes and random paperwork there.  I have been really impressed by how integrated it is with the entire OS.  With a simple copy and click on the F8 key it imports all of my info into Yojimbo.  Or, you can drag in docs to the side bar or print PDFs to Yojimbo from where ever you are (email, word, the web).  I also really like the “smart collection” feature in the software, where you set a few keywords and all of the docs matching those keys words pop up.  Little by little, I am adding all of my random electronic paperwork there and even just added a great recipe from Ken on how to make a Texas sheet cake (now I might actually get a chance to make it!).  The bad news is that it is not free…the cost is $29 for an academic license.  But, I think it is totally worth it and is a must have for all mac users :)

-B side

Ken’s Catchup

July 25th, 2008 by Ken Youens-Clark

It’s been a while since I wrote. Read the rest of this entry »

My day at YAPC, Part II - the afternoon

June 18th, 2008 by Jim Thomason

After lunch (we went to Quizno’s, which was sold to me as being much closer to campus than was actually the case), we dove back into things. It’s simply incredible how much information leaves along with your staff if they switch jobs. And I’m not even talking about simple knowledge of your systems or processes. The history of your workplace gets lost as well, as people who know the origin of things leave and new folks don’t have opportunity to learn how something came to be. Fortunately, I like telling stories and regaled them with many before we had to head back for the afternoon sessions.

Read the rest of this entry »

My day at YAPC

June 18th, 2008 by Jim Thomason

YAPC (the Yet Another Perl Conference - http://www.yapc.org) was back in Chicago this year so I got to go head down there for a day. Unfortunately, I was sick on Monday and our babysitter canceled for Wednesday so I was only able to attend one day of the event. But what a day it was.
Read the rest of this entry »

Ken’s weeks ending 6/6/08

June 6th, 2008 by Ken Youens-Clark

I missed posting last week. The short week really caught me off-guard when Friday rolled around, so this is to sum up the last couple of weeks. Read the rest of this entry »

Ken’s week ending 5/23/08

May 23rd, 2008 by Ken Youens-Clark

We’re off on the new Gramene build, and I definitely feel in the weeds already. Read the rest of this entry »