Massively Parallel Sequencing Data Storage Requirements

Posted: December 5th, 2008 | Author: Jer-Ming Chia | Filed under: Uncategorized | No Comments »

Peter recently asked for estimations of our disk space requirements in the next couple of years. I came across this table (Next-Generation Sequencing Informatics Statistics) and thought it would be useful.

Don’t you find the phrase “Next-Gen Sequencing” so …… “Web 2.0″ ? I prefer “Massively, Embarrassingly, Shamelessly Parallel Sequencing” but it is rather clunkly. Suggestions?


Move to the new Lab

Posted: December 2nd, 2008 | Author: Lifang Zhang | Filed under: Uncategorized | No Comments »

We have been officially kicked out from Room 110 in Dlebruck building. From November, we are slowly moved bit by bit to new room 101, who used be occupied by Jacek lab. Read the rest of this entry »


These last couple weeks I…

Posted: December 2nd, 2008 | Author: shuly | Filed under: Uncategorized | 2 Comments »

…became an aunt to two beautiful twins, a girl and a boy, of my brother and his beautiful wife! Then on the next day I hear that my other brother’s girlfriend is pregnant too, and we’re all very excited – so I guess I can finally, somehow, in a way… be part of the “gramene babies” family…

Now back to business… These last couple weeks I’ve been working on several things. I’ll briefly discuss a few.

So, right after we’ve came out with the Plant Ontology data release on November 12 (yea!!!), I started setting up the Plant Ontology wiki. A decision was made, to convert all the documentation pages that are currently hosted on the plant ontology website in html format, onto the wiki, in order to simplify the task of editing documents and updating the website’s repository. To convert the html to wiki format I used the html2wiki converter, which helped quite a bit with the task, but required a few manual fixes to the pages.

The PO wiki access has been set up as read-only for everyone, and may be edited by registered users only. However, we also need an internal section that only registered users may read. Well, this is a bit tricky, since MediaWiki is not supporting per-page access restriction. According to MediaWiki documentations, there are two basic possibilities:

1. Set up separate wikis with a shared user database, configure one as viewable and one as unviewable, and make interwiki links between them.
2. Install a third-party hack or extension. You will have to reapply it every time you upgrade the software, and it may not be updated immediately when new security fixes or upgrades of MediaWiki are released. Almost all hacks or patches promising to add them will likely have flaws somewhere, which could lead to exposure of confidential data

A list of various extensions that restrict user access to specific pages or namespaces, and problems they may exhibit or, on the contrary, deal with, is found here.

In order to test the extensions for security problems, one may consult this page

Going over the above lists, I have chosen to test the Extension:Lockdown, which should allow us to use a custom namespace for our internal usage.

Using this extension, I’ve created a custom namespace, which only registered users may access. So far it seem to be working well. However, pages in this namespace do appear in search results, as well as on the “recent changes” page, yet the whole page is not accessible to anonymous users, and requires login to view. I intend to make some further testing to make sure no sensitive data is exposed, yet, I feel that we should eventually use two separate wikis with interwiki links between them (despite the fact that Pankaj is reluctant to maintain two wikis).

Another task I accomplished was to copy MySQL databases (all ensembl and markers dbs) from our live database server onto a new server, ‘filetta’, designated for web services, to be used by external users. These databases were created as compressed, read-only, using the ‘myisampack’ utility. This resulted in significant savings in space, and hopefully, better performance (that we’re still testing). To make my compression task simpler, and not to forget any step along the way (such as: locking the tables before packing, running myisamchk to check the tables for errors, rebuilding the indexes after packing and then flushing the tables) I wrote a simple shell script, which I can provide upon request (I intended to post it here, but encountered serious indentation problems).

Just to mention another MySQL database related work, is the “house cleaning” of “cabot”, our development database server, and changing the backup strategy from backing up all databases on a daily basis, which takes a huge amount of space and a long processing time (more than 12 hours), to selectively backing up some databases daily, and others on a weekly basis, as needed. Next thing would be to keep up with the performance tuning work as I discussed in my previous blog enrty.

One other thing I’ve been working on is semantic web services for the Plant Ontology using the SSWAP infrastructure. I’ve started off with composing a simple OWL-DL ontology, using the ontology editor protégé, to describe PO annotations. This first draft of the ontology is based on the PO database structure and the type of data we are aiming to provide, and is following the guidelines of similar ontologies hosted on the SSWAP ontologies page . The generated poAnnotation.owl ontology file may be best viewed by opening it with protege.

Well, that’s it for now…


Gramene Species tree and others

Posted: December 1st, 2008 | Author: Sharon | Filed under: Science | No Comments »

It is not so often to have newly generated original data to report at the end of the week. I am glad to say that I do this time. Now if you want to use a species tree with evolutionary rate as branch length for all the Gramene species, I could provid you with one, or more than one. For example: Read the rest of this entry »


I love refactoring

Posted: November 23rd, 2008 | Author: Jim Thomason | Filed under: Programming | 3 Comments »

I love refactoring code.

One of the tenets of Extreme Programming is You ain’t gonna need it. Or, more verbosely stated: Always implement things when you actually need them, never when you just foresee that you need them.

And the rationale for this is that humans, as a species, are notoriously bad at predicting the future. Psychics are wrong more often than not. The end of the world hasn’t come nearly as often as it’s been predicted. I have no clue what I’m going to have for lunch today. So in our software, as in life, we should focus on the present. That’s what we know and what we should make our decisions off of.

Build with a nod towards the future, but that nod should be keeping in mind the fact that software needs to breathe and evolve and change, so we shouldn’t stifle it too much in its development. It’ll need to change in the future. But until then, build for right now.

Read the rest of this entry »


Liya’s trip to CSHL

Posted: November 14th, 2008 | Author: Liya | Filed under: Uncategorized | No Comments »

I visited the lab last Friday (November 7th). Since it was Friday, I didn’t meet many colleagues. The lab office was in remodeling. I’ll go to the lab next week (November 19 possibly) and some telecommuters will be there too !
Doreen had spent the whole day with me. We discussed my objectives for the next several months. First we discussed the protein annotation and xrefs. Will joined us on the phone. We set up a document on protein annoation and xref pipeline in order to have consistency when running the pipeline by different groups. We looked into the reason of inconsistent GO annotation results on rice, maize and sorghum. One reason is the inconsistent annotation. Rice has better annotation than maize and sorghum. It has more Uniprot protein annotations. Also it is possible the annotation pipeline uses different version of Interpro database and xref source databases. Another reason is biological, e.g the corresponding region becomes a partial gene or an intron.
We also discussed some compara analysis with Josh by phone. Josh wrote a very informative word document on the research objectives. What I need do is finding the orthologue dataset first. The summer interns have a perl script to get the orthologues from compara database by using ensembl API. I have used this script to get the orthologues. I’d like to update the script so the script is more generic. I also like to take a look of the mart schema to see the underline structure. The next item to do is looking for the synteny region beteen rice and sorghum. Jack Chen’s lab has developed OrthoCluster (Ismael Vergara) to get the synteny region. Josh also points out the DiagHunter and SyMap.
The third item on the do list is about microRNA targets and pathway analysis and enrichment for GO categories. Read Chris’s paper of ‘Identifying microRNAs in plant genomes.’. Get the protein targets for the microRNAs from Lifang and identify possible pathways and GO categories.
The trip is very objective-oriented. I’ll visit the lab more often from now on.


HDF5 and next-gen data

Posted: November 10th, 2008 | Author: Jer-Ming Chia | Filed under: Big Data | 3 Comments »

I’ve realized that over the last few months I have become de-sensitized to large numbers. It’s either because of the time spent working on the next-generation sequencing data or the never-ending 0s being added to the bailout package.

Read the rest of this entry »


Moving to Subversion

Posted: October 31st, 2008 | Author: Shiran Pasternak | Filed under: Uncategorized | No Comments »

Recently, we decided to migrate our entire codebase from CVS to Subversion (SVN). Ken did most of the groundwork. I am really excited about our new version control setup, and have been dreaming about this for a long time. But to butcher a biblical metaphor, I only played Aaron to Ken’s Moses.

Read the rest of this entry »


Ken’s scary Halloween update

Posted: October 31st, 2008 | Author: Ken Youens-Clark | Filed under: Uncategorized | No Comments »

It’s Halloween and hot in Texas, and I’m blogging. Read the rest of this entry »


GO Slim

Posted: October 21st, 2008 | Author: Liya | Filed under: Uncategorized | 2 Comments »

Gramene uses GO terms to annotate proteins and genes. I am familiar with GO terms since I am responsible for developing and maintaining the ontology database and loading all kinds of ontology terms and annotations. I knew little about the GO slim terms. Doreen asked me to generate the GO and GO slim annotations for rice and arabidopsis in this summer. Until then I began to touch the GO slim terms.

GO slim terms are subset of GO terms. It is still a ontology and a DAG. Initially I thought GO slim terms are bunch of ancestor terms without relationship among them. But they are not. Because the GO ontology is a DAG and the GO slim is also a DAG, one term used in annotation may map to several slim terms via different paths. When I looked at the slim example graph It is quite interesting for node 9 and 10. Node 9 seems mapping to slim node 3 and 4. But node 3 is not because it is a parent term of the slim node 4. Node 4 is the most pertinent term of node 9 comparing to node 3. Node 10 can get to slim node 2 and node 3 via two paths( 10->5->2 or 10->8->6->3) . Since slim node 2 and node 3 have no hierarchical relationship, node 10 can map to both slim node 2 and node 3.

Gramene ontology database stores the GO terms and the relationship of these terms. It does not store the slim terms and relationships. I can’t use the gramene ontology database to do the GO slim mapping. I find Chris Mungall already has one script map2slim to do the work. It is easy now since Shuly has already installed the go-perl package for amigo on our machine, and I have the ontology cvs which contains the GO ontology and GO slim ontology terms. The map2slim need the input file in specific GO annotation format. What I have to do is to write a generic script to get the ensembl gene annotations from ensembl databases and generate the result in the GO annotation format. My work is done ! Later I’d like to look into the map2slim codes to see what the code does when the relationships in the slim ontology is not consistent with the relationships of the GO ontology, e.g. a slim term becomes obsolete in the GO ontology. Now the script just fails.