Resubmitting failed SGE array tasks

Posted: December 1st, 2009 | Author: Shiran Pasternak | Filed under: Programming | Tags: , , , | No Comments »

I can't seem to find a straighforward mechanism of resubmitting specific tasks in a Sun Grid Engine (SGE) array job, so I rolled my own. There's not much to it, but it's ideal for small array jobs where the failed tasks can be specified by hand.

Let me back up a bit. We're into grid computing. We have a boat-load of data from various plant genomes that we like to go to town on. On our campus we happen to have a 2,000 core compute farm (known as BlueHelix) and we use Sun Grid Engine (SGE) for job management. One nice feature of SGE is the ability to submit array jobs. An array job is a convenient SGE semantic for grouping tasks that are very similar. You submit an array job once and SGE spawns individual compute tasks that inherit the environment and run parameters of the job. Tasks are identified by their index number. A simple array job submission looks like this:

$ qsub -N job_name -t 1-10 ./script.sh

The job is named job_name and is composed of 10 tasks. While the job is queued or running, it's very easy to track and manipulate that job using its chosen name. For example, to find the status of the array job, run:

$ qstat -j job_name

SGE also lets you specify command-line parameters on special lines within a submission script that are only interpreted at job runtime. So you can submit an array job by running

$ qsub ./script.sh

where the first few lines of ./script.sh might look like this:

#!/bin/sh
 
#$ -N job_name
#$ -t 1-10

Each spawned task is assigned a unique environment variable, $GE_TASK_ID, that identifies its index in the array. It is this variable that distinguishes one task from the next (it can be used programmatically inside ./script.sh). This presents a slight complication because not all jobs can be naturally indexed by number. How can an array job be used to process a batch of arbitrarily-named files, for example? It is the submitter's responsibility to dereference the task ID to a meaningful value that can be processed by a given task.

Today I had a specific task: Dump all the maize chromosomes, soft-masking repeats, from an Ensembl database. Maize has 10 chromosomes, and an additional 11th chromosome we call UNKNOWN, where a small number of unanchored BAC clones are arbitrarily concatenated. I've written a handy script in the past that uses the EnsEMBL API to dump any specified region, with the option to soft- or hard-mask repeats. Dumping the ~2 Gb maize genome serially takes a good number of hours. With repeat-masking (the genome is more than 80% repetitive), it also requires a lot of memory. Naturally, we can use parallelization to speed up the dump. The simplest approach (not necessarily the fastest, or the best use of a 2,000 node cluster) is to divide the genome by chromosome. The job array would submit 11 tasks. Of course, the tasks aren't normalized, as the chromosome lengths vary rather wildly (The largest chromosome, 1, is more than twice the size of the smallest real chromosome, 10).

Within the submission script, we can use the $GE_TASK_ID as is to pass into the dump script the chromosome being dumped by the task. This is all good for 10 out of 11 chromosomes. We just need to handle the special case of the unanchored chromosome. The submission script looks like this:

#!/bin/sh
 
#$ -N job_name
#$ -t 1-11
 
chr=$GE_TASK_ID
[ $chr -eq 11 ] && chr="UNKNOWN"
 
./dump.pl --region $chr

Not very generalizable but illustrates how the task ID can be used to dereference meaningful values (a good general approach is to have task IDs be line numbers in a control file where each line contains the meaningful value).

A problem I ran into is that, using the default submission parameters, 5 out of 11 chromosome tasks failed because they ran out of memory (ironically, the largest chromosome was dumped successfully, probably because it ran on a high-memory node). I wanted to resubmit just those failed tasks without rerunning the 6 completed tasks. But as I said in the beginning, I didn't find a straightforward way to resubmit those 5 jobs. The -t parameter does not accept a delimited list of tasks, unfortunately (it does let you do other nifty things, such as specifying one range, e.g., 3-8).

The solution I came up with took advantage of another cool SGE feature that allows you to pass in a script through standard input. This lets you write a wrapper script that does some preprocessing before printing (typically via a heredoc) a shell script to standard output. Submitting then looks like this:

$ ./submit.sh | qsub

For this illustration, I had the following chromosomes fail to dump: 2, 4, 6, 7, and 8. To generally specify specific tasks to run, I used a bash array variable that held the failed chromosomes, then used the array length to pass in the array job parameters. The final script looks something like this:

#!/bin/sh
 
chrs=(2 4 6 7 8)
 
cat <<SCRIPT
#!/bin/sh
 
#$ -N dump_chromosomes
#$ -t 1-${#chrs[*]}
 
chrs=(`echo "${chrs[*]}"`)
 
chr=\${chrs[\$SGE_TASK_ID-1]}
 
./dump.pl --region \$chr
SCRIPT

This isn’t that pretty, but it gets the job done, no pun intended. Some sigils ($) had to be escaped to avoid interpolation. That’s because we have to distinguish between variables that are interpreted as the wrapper is executed and variable names that remain intact so they can be interpreted later when the task is submitted.

The first declaration creates an array of chromosomes that need to be re-dumped (this can easily have been the command-line argument list, which would allow you to run the job as ./submit.sh 2 4 6 7 8). The script then prints another script where the job array parameter is reconstructed using the length of the array (${#chrs[*]}). A slight inconvenience is that we had to also reconstruct the same array within the generated script so that the task can use it to dereference its corresponding chromosome.

This approach no longer treats the $GE_TASK_ID as incidental chromosome names, but rather dereferences meaningful values out of some array ($chrs, in this example). This is one illustration of how job array task IDs can be dereference in a general way. Submitting jobs by a generated script also allows you, as a sanity check, to see the script that would be submitted (by not piping it into qsub).

So what have we learned today? Besides the fact that the maize genome is huge and mostly repetitive, there are various sleek ways to submit jobs to a compute farm using SGE. The SGE qsub command is very flexible and provides powerful ways of submitting — and resubmitting — parallel jobs.



Leave a Reply