Gramene goings on, etc.

Posted: April 9th, 2009 | Author: Ken Youens-Clark | Filed under: Uncategorized | 1 Comment »

In February Gramene released our 29th build. I won’t go into details here because you can read our release notes. Shortly afterwards I got a chance to present a poster on said release at CSHL’s “Plant Genomes” meeting. It was nice to meet some of our users and listen to the many interesting talks.

Since then, I’ve been doing lots of things. In no particular order:

  • Working fairly fruitlessly on bring our new blast.gramene.org machine online
  • Answering user email, fixing CMap and Mart bugs
  • Hacking a bit on SQL::Translator to help create Gramene schema ER diagrams
  • Thinking a lot about how we could improve Gramene’s look and feel. This was the subject of a presentation to the group, one I’ll revise for the upcoming Gramene retreat.
  • Reading a book on project managment and considering how I could improve. Apparently I should be having regular one-on-one meetings with people — revolutionary!
  • Writing the 2008-9 Gramene grant summary. This has been a nice exercise for me as I go over the Gramene experimental plan and consider what we promised to do, what we’ve done, what we’ve got left.
  • Polishing a brief paper on CMap for the journal Bioinformatics
  • I created a test video of a Gramene tutorial I drafted and solicited opinions. It’s decent, but needs much improvement. I’ll tackle this again soon, breaking the 10+ minute intro into many smaller, 2-3 minute intros on each section. I will increase the font size and decrease the screen resolution to try to make the text more readable and will also use Omnidazzle, a free mouse highlighter, to make it easier to see my pointer. Sheldon asked me what software I used to record. I purchased (for $99) Screenflow, which I found relatively easy to use. I should also point out that, though I don’t really like the sound of my own voice (few do, I know), I believe I sound relatively decent because I used a decent microphone run through an inexpensive pre-amp

For several months I’ve really been thinking about ways to radically update CMap. My recent work on the CMap paper and answering some user email has made me want to do something even more, but then this week I really got in my head to try to incorporate Circos output into CMap. Looking into the Circos code, I think it will take some refactoring since all the logic is in a single Perl script. This will need to be move out into a module to make it easier to call it from other code. Since Shiran also wants to create Circos views from Ensembl, perhaps we could work together to create an API for Circos. We’ve each contacted the author, Martin Krzywinski at the BC Cancer Agency (Sheldon’s old employer), and we’re waiting to hear how we can move forward on contributing to that. It would be nice to see perhaps a Google Code project setup with SVN, Wiki, etc. I could be really excited to contribute to such a project. I will also add that I was very happy to see that Martin apparently found the SQLFairy on his own and thought that Circos would be another cool way to visualize schemas, too, so he created Schemaball. It says it’s limited to MySQL, so I wouldn’t mind turning that around and creating a Circos producer module for SQL::Translator so that this would be available for any database syntax. First, though, I need Circos to have an API, and that will require refactoring the code.

Anyway, here’s just one idea how Circos could radically improve CMap views. Take a typical, boring table from the correspondence matrix showing how two map sets compare to each other and create this view:

Circos tabular view

Pretty neat, I think. I could see making Circos an output option for the matrix and for map comparisons. At first I got excited thinking I might find a way to use Circos for all CMap’s graphical output since it is cool and allows multi-way comparisons as opposed to CMap’s limitation that maps can only be compared to those on their immediate right or left, but then I realized that the standard CMap view does have value such as when you want to align contigs to a reference map for order and orientation. Also, Shiran seems to think that performance with Circos might be a tad slow for dynamic queries, so it would be extra eye-candy that users would wait on. Lastly, there is the issue of making the image clickable with image maps. Circos doesn’t create anything like that, and, in fact, I seems like a very tricky think to do given that the regions are very curvy.

Along with creating new, exciting visual output, I’d love to completely rewrite the user interface for CMap. It’s time for it to have real Web 2.0 goodness. I’ve wondered if I should write it in Java as an applet or a separate download, but my colleague on the Gramene project, Terry, feels that this has been a hindrance to people accepting his GDPC tool on Gramene. While researching other ideas, I happened across the Chrome Experiments website and was thoroughly intrigued with what you can do nowadays with Javascript + HTML5 + Canvas + SVG (Jim pointed out all the technologies at work there). I’m thinking that’s gotta be the way to go. Whatever direction I choose, what I’d like is a much more dynamic comparative map viewer, from the form to the image itself. I would love to be able to update just a part of the image, e.g., if the user wanted to hide all the ESTs on a map, I could just redraw that one map rather than the whole image, but this might also entail redrawing the correspondence lines to neighboring maps, so I’m wondering if it would be possible to create the image in layers such as background, map, correspondences, legend, etc.

Another reason I would like to overhaul CMap is the performance issue. I’ve learned a lot since I originally designed the schema, and I’d like a do-over on several parts where I think I could make things much faster (particularly in how I store the feature aliases). Performance is not terrible anymore on the Gramene site since we considerably lowered the number of features we put on maps, but I would really love to see it be able to handle at least 1 million features per map — at present we try to keep it to under 50-100K.

At the heart of CMap is the idea that two things have a correspondence, they are related in some manner be it a sequence similarity or the same reagent being used in two experiments or whatnot. Modeling this efficiently has never been easy. Here is the core of the schema that shows how it’s currently modeled:

CMap schema

You can see in the feature/feature_correspondence/correspondence_evidence/correspondence_lookup area that things get a little circular, and that’s usually a bad thing. My first pass at this (oh, so many, many years ago, hacking on my little white iBook while riding the LIRR — my, the things I remember) had just the correspondence table, but then I realized that I had to have a table that showed the relationship going in both directions, hence the lookup table that says both “A->B” and “B->A” such that there are two lookup records for each correspondence. Ben came along and decided that, if the data was going to have to be denormalized to handle web queries, it might as well have mapping information in it in order to decrease the number of joins required, so he added the start/stop/type/map fields. Each correspondence can have one or more evidences.

To use these tables, I have some set of features in a region of a given reference map, and I want to find all the other maps that have features with correspondences optionally including or excluding some set of evidence criteria. I want the species and name of the map sets, the map names, and the number of correspondences. In the many times I’ve thought about how I could simplify or denormalize this schema further, I’ve never really come up with a better solution. I can’t see a way to pre-calculate anything. Every time a user crops a map or excludes some feature types, I have to go back and requery, joining millions of rows across several tables to figure out the choices I can present in the comparative maps menu. If nothing else, I’ve thought about binning the features such that I could perhaps at least get rid of the start/stop range queries, which can be a killer for performance. Another option I’ve considered is getting rid of the evidence table. I know this removes some flexibility (e.g., only show me correspondences based on BLAST hits with a score greater than “X”), but maybe it’s not worth it — or maybe I move it directly into the correspondence record itself, and each evidence gives you a new correspondence record. No, that would make it hard to count unique correspondences. Anyway, you see how complicated this can get.

I’ve done some looking about on modeling pairwise relationships but haven’t gotten far. I’d love to see how Facebook models “friends” — I know they have several million users, each of which can have hundreds of friends, so how do they efficiently store and query this? If anyone has any ideas for a radically improved version of how to show that a thing of one type in a particular location is related to something else, I’d love to hear about it. Something that is as simple as possible, but no simpler would be perfect.

That is all.


One Comment on “Gramene goings on, etc.”

  1. 1 Chris Duran said at 9:25 pm on November 15th, 2009:

    A number of months behind here with this comment, but I only just now found this blog! Congratulations on the CMap paper by the way – it is good to see the application with a publication attached to it.

    With regards to modelling pairwise relationships, I think a problem with a situation such as the correspondence relationships in CMap is that this type of relationship is poorly represented in a relational storage model. As you mentioned, this is where you need to construct a solution such a lookup-style table – a pretty common database design compromise for quick queries in web services.

    I’m not sure exactly how they model it, but I am pretty sure that Facebook use the Cassandra project (http://incubator.apache.org/cassandra/) for their data storage. For those not familiar, it is a key-value type of database store, with keys mapping to multiple values stored as a family of columns.

    A really interesting datastore that I think has great potential is Neo4j (http://neo4j.org/) which stores data as a graph, with nodes representing data objects and edges representing the relationship between the objects.


Leave a Reply