Installing and setting up Sun Grid Engine on a single multi-core PC

Introduction

I was on the phone the other day with a friend of mine. He told me that he wanted to execute his C program for several sets of input arguments. His was planning to use his quad-core workstation at work. He wanted to ensure that the four cores would remain busy until all the runs of the program had completed for the various sets of input arguments and that each core would run for exactly one set of arguments at a time.

His initial thought had been to make use of the “make” command. This solution would entail a makefile, which would be invoked by running “make -j4” in the directory where the makefile would reside. The “-j4” option specifies that four jobs should run simultaneously, thus allocating a single job at a time to each of the four cores.

I wasn’t aware of such implementation of the “make” command. I found it elegant, although none of us dedicated the time to find out how the makefile should be written. My initial thought was to write a bash script so as to regulate the submission of jobs. However, the term “job scheduling” then came to my mind. It would be probably more time saving in the long run, more neat and it would offer better job supervision to employ a job scheduler.

The Load Sharing Facility (LSF) platform and Condor were the two job schedulers I knew from my work experience. My friend suggested the Son of Grid Engine (SGE). The obvious challenge was the installation and setting up of the job scheduler. There are numerous manuals, user guides, FAQs and forums explaining how to install SGE on a cluster. We have been inexperienced starters intending to use SGE to cover our individual computing needs on the basis of a single multic-core machine instead of a cluster. It turned out to be a cumbersome task for us to find coherent documentation for dummies to resolve the installation task at hand.

Before going any further with SGE, it is worth mentioning GNU Parallel, which is a shell tool for executing jobs in parallel in one or more computers. Parallel’s scope is somewhat different to SGE’s, since the former is not a fully fledged job scheduler. Nevertheless, Parallel is a lightweight tool, it is user-friendly, it handles job allocation to multiple cores of the same machine and it allows submission of multiple job arrays. In fact, my friend fulfilled his initial program execution request with Parallel. So, Parallel is an alternative functional tool for parallel job execution suiting the needs of an individual user.

In what follows, I will try to provide an overview on how to install and set up SGE on a single multi-core PC. As it usually happens, this post may not solve all the problems you might encounter during installation, although the intention has been to make it self-complete. It is the encounter of the exciting voyage of my friend and I, summoning our naive first-hand experience of tailoring the SGE installation to the modern reality of an individual user with a single multi-core machine.

The installation and setup instructions have been based on the Linux Mint Debian (LMDE) distribution and have been tested on Kubuntu 11.10. This guide is structured in three parts. The first part focuses on the installation, the second on setting up SGE and the third on testing SGE by means of elementary examples.


Part 1: Installation of SGE

To embark on the installation of the gridengine packages, run the following command on your terminal:

sudo apt-get install \
  gridengine-master gridengine-exec gridengine-common gridengine-qmon gridengine-client

Instead, you can run the shorter, and perhaps more error-prone, command

sudo apt-get install gridengine-*

A pop-up window will appear within the terminal during installation, with title “Configuring gridengine-common”. A series of questions show up sequentially in this window:

  1. Question: “Configure SGE automatically?” Answer: highlight “<Yes>” and press “Enter”.
  2. Question: “SGE cell name:” Answer: type “default”, then press “Tab” to highlight “<Ok>” and press “Enter”.
    Note here that you are free to choose any name you want for your SGE cell instead of “default”, such as “sge_cell” for example. If you alter the SGE cell name, you will have to subsequently set the SGE_CELL variable in your ~/.bashrc file accordingly (assuming that bash is your default shell). For instance, if you set the SGE cell name to be sge_cell, you will add the following line in your ~/.bashrc:

    export SGE_CELL="sge_cell"
    

    Furthermore, you will need to add the above line of code in your /root/.bashrc file so that the SGE cell is also known to the root. It is advised that you leave the SGE cell name as it is, holding the “default” value.

  3. Question: “SGE master hostname:” Answer: type “localhost”, then press “Tab” to highlight “<Ok>” and press “Enter”.
    Instead of “localhost”, you can choose the hostname of your computer, which can by found by running the “hostname” command from the terminal:

    hostname
    

After answering these three questions, the pop-up window closes and the installation continues on the terminal. If for any reason you need to reconfigure the gridengine-master package, you can do so by invoking the following command:

sudo dpkg-reconfigure gridengine-master

The installation of gridengine is now complete, yet this does not mean that you are necessarily ready to use SGE. First of all, check whether sge_qmaster and sge_execd are running by using the command

ps aux | grep "sge"

The output I got verified that sge_qmaster and sge_execd are running:

sgeadmin  1310  0.0  0.1 135968  5376 ?        Sl   13:41   0:00 /usr/lib/gridengine/sge_qmaster
sgeadmin  1336  0.0  0.0  54760  1544 ?        Sl   13:41   0:00 /usr/lib/gridengine/sge_execd
1000      3171  0.0  0.0   7780   860 pts/0    S+   13:54   0:00 grep --colour=auto sge

If this is not the case for you, then start up sge_qmaster and sge_execd by executing the following three commands:

sudo su
sge_qmaster
sge_execd

Once you ensure that sge_qmaster and sge_execd are running, try to start qmon, the graphical user interface (GUI) for the administration of SGE:

sudo qmon

It is likely that the qmon window will not load, but instead you will get an error message. This is what I got:

Warning: Cannot convert string "-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct
X Error of failed request:  BadName (named color or font does not exist)
  Major opcode of failed request:  45 (X_OpenFont)
  Serial number of failed request:  643
  Current serial number in output stream:  654

The error message indicates that some fonts are missing. The package which contains the necessary fonts is called xfonts-75dpi. In my case, xfonts-75dpi was installed automatically alongside the installation of the gridengine packages. Nevertheless, I got the error message because the fonts were not loaded after their installation. So, I merely restarted my computer. After rebooting, the “sudo qmon” command loaded the qmon window. If xfonts-75dpi is not installed on your system, then install it using the following command and then reboot:

sudo apt-get install xfonts-75dpi

After having resolved any possible font-related issues “sudo qmon” should load the SGE admin window. If you let the window remain idle or if you try to press any of its buttons, such as “Job Control”, the most likely event will be the appearance of a message pop-up window with the text “cannot reach qmaster”. Click on the “Abort” button of the pop-up window to terminate qmon. Try also the qstat command, which in my case gave the following error message:

error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "russell")
error: unable to contact qmaster using port 6444 on host "russell"

It is useful to delve in the error message in conjunction with the /etc/hosts file of my system:

127.0.0.1	localhost
127.0.1.1	russell
# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

The hostname of my computer is “russell”. According to the error message, SGE set the client hostname to “russell”, whose LAN IP address is 127.0.1.1, while it set the client IP to 127.0.0.1, which is the LAN IP designated to the hostname “localhost”. To resolve this ambiguity, I changed the first two lines of my /etc/hosts so that both hostnames “localhost” and “russell” share the same LAN IP (as a word of warning, make a backup of your /etc/hosts file before making any changes to it). To be more specific, I deleted the second line and appended the “russell” hostname to the end of the first line. My /etc/hosts file thus became:

127.0.0.1	localhost russell
# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Moreover, it is possible that your /etc/hosts file contains by default the string “localhost.localdomain” in the first line, for example as in

127.0.0.1	localhost localhost.localdomain russell

If that’s the case, make sure you remove “localhost.localdomain” so that only “localhost” and your machine’s hostname (“russell” is my hostname), are tied to the LAN IP 127.0.0.1:

127.0.0.1	localhost russell

You may restart sge_qmaster and sge_execd, although it is not advised given that you made a fundamental change to your system’s state by reconfiguring the association between IPs and hostnames in the /etc/hosts file. Instead, you are advised to restart your computer before you proceed any further. After rebooting, “qstat” and “sudo qmon” should run without returning any error messages.


Part 2: Setting up SGE

SGE is now (assumed to be) installed on your system and it needs to be set up before using it. To this end, it it will be demonstrated how to use qmon in order to initialize and thereafter administrate SGE.

  1. Firstly, run “sudo qmon” and click on the “User Configuration” button of the loaded qmon window. On the text box of the “User” tab, insert your username, let’s say “theodore” (without the double quotes), and then click on the “Add” button. As soon as you add yourself as an SGE user, the “User” tab of the user configuration window will look similar to the following snapshot:
    User tab of user configuration panel of qmon

    Click on the “Userset” tab and either create a new set of users or highlight an existing one. To follow up with my exemplified setup, highlight the existing set of users called “arusers” and press “Modify”. A new window pops-up. In the text box with title “User/Group”, type the username of the SGE user that you recently created (theodore) and press “Ok”. The snapshot below shows how the “Userset” tab of the user configuration window will look:

    Userset tab of user configuration panel of qmon

    Press “Done” to leave the user configuration window.

  2. Click on the “Host Configuration” button of the qmon window. On the text box of the “Submit Host” tab, insert the hostname, for example “localhost” (without the double quotes), and click on the “Add” button. This is a snapshot of the submit host tab of the host configuration panel:
    Submit host tab of host configuration panel of qmon

    Select the “Execution Host” tab of the host configuration panel. Highlight one of the hosts, say the localhost, and click on the “Modify” button. A new window pops up. Select the “User Access” tab, then highlight one of the access list entries (for example arusers) and, after making sure that the “Allow Access” radio button is chosen, click on the red right arrow as demonstrated in the snapshot below:

    Giving access to a set of users for a selected execution host

    Click “Ok” to return to the host configuration panel. This is how the localhost entry in the execution host tab of the host configuration panel will look:

    Execution host tab of host configuration panel of qmon

    Notice that the “Access” entry of the “Access Attributes” list has the value “arusers”, not “NONE”.

Finally, click on the “Queue Control” button of the qmon window. Press “Add” under the “Cluster Queues” tab. A new window loads. Type the name of your choice for the new queue in the text box with title “Queue Name” (“mainqueue” was chosen for the current example). Type also the hostname (for example localhost) in the text box with title “New Host/Hostgroup” and then click on the red left arrow so that the hostname is added to the hostlist of the queue. On the “User Access” tab of the current window, highlight one of the access list entries (for example arusers) and, after making sure that the “Allow Access” radio button is chosen, click on the red right arrow. The following snapshot shows a snapshot of the “User Access” tab:

Giving access to a set of users for a selected queue

Select the “General Configuration” tab of the current window. Change the following entries:

  • The value of the “Processors” entry is by default set to “UNDEFINED”. Set it to be equal to the number of processors of your PC. I set it to 2, as I have a dual-core computer.
  • Specify the “Shell”. By default, it is set to /bin/csh. I set it to /bin/bash, as I prefer to work on the bash shell.
  • Specify the number of “Slots”, which is by default set to 1. Should you wish, you can define more slots than the number of cores and regulate the maximum number of concurrently running jobs retrospectively using the “-tc” option for job arrays. I chose to set the number of slots to be as many as the cores of my computer, that is 2.

This is how the general configuration tab of the queue control panel will look after the above changes:

General configuration tab of queue control panel of qmon

Press “Ok” to leave the current window of the newly created queue, then click on “Done” to leave the queue control panel and finally click on the “Exit” button of the qmon window to exit qmon. SGE has been set up and is ready to be tested.


Part 3: Elementary Test Runs on SGE

As a matter of personal practice, I prefer to organize my submitted jobs with the help of wrapper scripts. Generally speaking, if the script containing the actual code is called scriptname.sh, I write another wrapper script called scriptname_qsub.sh which invokes qsub to submit scriptname.sh on SGE.

“Hello world” Test

To start with, submit a typical “hello world” bash script in order to check whether SGE works. This toy script simply echoes “Hello world” to a text file. Create a directory called hello_world. In the current example, its full path is

/home/theodore/tmp/hello_world

Select your own full path and modify subsequent paths accordingly. The hello_world folder contains two bash scripts, namely hello_world.sh and hello_world_qsub.sh. The former script holds the “echo” command

#!/bin/bash

echo "Hello world" > /home/theodore/tmp/hello_world/hello_world_output.txt

while the latter invokes the following “qsub” command

#!/bin/bash

qsub \
  -e /home/theodore/tmp/hello_world/hello_world_qsub.error \
  -o /home/theodore/tmp/hello_world/hello_world_qsub.log \
  ./hello_world.sh

The options “-e” and “-o” set respectively the paths used for the standard error and standard output stream of the job. While in the hello_world directory, submit the job by executing the hello_world_qsub.sh script:

./hello_world_qsub.sh

You will get some output similar to

Your job 1 ("hello_world.sh") has been submitted

You can use the “qstat” command to check the progress of the submitted job. As soon as execution completes successfully, three new files will be present in the hello_world folder:

  • the file hello_world_output.txt will contain a single line with the string “Hello world”,
  • the file hello_world.error will be empty and
  • the file hello_world.log will be empty.
Job Array Test

To confirm that submission of job arrays operates without complications, a relevant example is provided. The job array consists of five tasks, each of which echoes its ID, which is held by the environment variable $SGE_TASK_ID. Create a directory called “job_array”, whose full path is

/home/theodore/tmp/job_array

Amend the path to fit your realization of the job array. Create two scripts in job_array, the job_array.sh and its wrapper counterpart job_array_qsub.sh. The contents of the former are

#!/bin/bash

echo "SGE_TASK_ID=$SGE_TASK_ID" > \
  /home/theodore/tmp/job_array/job_array_$SGE_TASK_ID.output.txt

sleep 3

while the contents of the latter are

#!/bin/bash

qsub \
  -e /home/theodore/tmp/job_array \
  -o /home/theodore/tmp/job_array \
  ./job_array.sh

From the jog_array directory, execute the wrapper script:

./job_array_qsub.sh

Note the double appearance of the $SGE_TASK_ID variable in job_array.sh, once as part of the name of the output files and once as a variable in the “echo” command. The resulting output will be five files job job_array_i.output.txt, for i=1, 2, 3, 4, 5, each containing a single line populated by the correpsonding string “SGE_TASK_ID=i”.

As for the job_array_qsub.sh script, notice that the argument to the “-e” and “-o” options is the full path of the job_array directory. Therefore, the standard error and output streams of the tasks of the job array will be saved in this directory. The default file name will be used, which follows the form job_name.ejob_id.task_id. In my run of the example, five empty standard error and five standard output (log) files were generated in hello_world_array, with names hello_world_array.sh.e23.[1-5] and hello_world_array.sh.o23.[1-5] respectively, where 23 happened to be the job ID at the time of execution.

A few closing remarks on the job array test run:

  • The order of arguments to qsub can be critical. For instance, I made the mistake to enter the “-t 1-5” option after the script invocation “./job_array.sh”, which threw an error.
  • If you want to throttle the number of concurrently running jobs, you can use the “-tc” options. For instance, if I had defined four slots on my dual-core machine with qmon and wanted to avoid having more than one running jobs in each core, I would have used the command
    qsub -t 1-5 -tc 2 ./job_array.sh
    
  • It is possible to overwrite the default file name for the standard error and standard output files of the job array, although the default naming convention is admittedly handy in several occasions. To facilitate the choice of custom file name, SGE offers the pseudo-environment variables $JOB_ID, $TASK_ID, $JOB_NAME, $HOME, $USER and $HOSTNAME. I tried to use the first four of them in the wrapper script job_array_qsub.sh, yet unfortunately they returned empty strings. I haven’t found a solution for this issue so far. I rely on the default file name as a temporary solution.
“Hello world” Job for Testing email Notification

As a final example, the “Hello world” job is rerun with a couple more qsub options so as to send an email notification upon job completion. Mail delivery is achieved in this example by using the exim4 message transfer agent. Follow the instructions of this post in order to set up exim4 to route emails through the Gmail SMTP servers. You can confirm that exim4 has been successfully configured with the help of a test “mail” command:

echo "Mail body: an exim4 test" | mail -s "Mail subject: exim4 test" "recipient@somewhere.com"

Once exim4 is set up, create a directory called “mail_test” and populate it with the mail_test.sh script

#!/bin/bash

echo "Hello world" > /home/theodore/tmp/mail_test/mail_test_output.txt

and with the mail_test_qsub.sh wrapper script

#!/bin/bash

qsub \
  -e /home/theodore/tmp/mail_test/mail_test_qsub.error \
  -o /home/theodore/tmp/mail_test/mail_test_qsub.log \
  -m e \
  -M "recipient@somewhere.com"
  ./mail_test.sh

The “e” argument of the “-m” qsub option instructs SGE to send the email at the end of the job.

From the mail_test directory, execute the wrapper script:

./mail_test_qsub.sh

If everything works well, you will receive an email notifying you that job execution is complete. I received an email with subject “Job 8 (mail_test.sh) Complete” containing the following lines:

Job 8 (mail_test.sh) Complete
 User             = theodore
 Queue            = mainqueue@localhost
 Host             = localhost
 Start Time       = 02/06/2012 21:02:43
 End Time         = 02/06/2012 21:02:43
 User Time        = 00:00:00
 System Time      = 00:00:00
 Wallclock Time   = 00:00:00
 CPU              = 00:00:00
 Max vmem         = NA
 Exit Status      = 0

As a word of warning, use the mail related qsub options with care. An erroneous submission of a single job or of a job array can potentially clutter your mailbox.

29 thoughts on “Installing and setting up Sun Grid Engine on a single multi-core PC

  1. I’ve adapted your great post into a script that can be used to install SGE on Travis-CI automatically without any user interaction. It’s very handy for running unit tests for code that uses SGE.

  2. Just a small comment about the beginning of this article…

    SGE stands for both Sun Grid Engine and Son of Grid Engine. What the Ubuntu packages above provide is Sun Grid Engine, the older one when Oracle stopped development (around version 6.2). Son of Grid Engine is one of the two open source forks and that is now up to version 8.x .

    Son of Grid Engine is harder to set up since it isn’t part of Ubuntu officially, but they do provide .deb packages for download. And it is (at the moment) actively maintained.

    Thanks for the instructions!

  3. Throwing ” error writing object “mainq/abhnav” to spooling database. ” Please help!

    1. Hi Mittal, I haven’t installed SGE on my laptop for a while, plus I am now using a different Linux distro, so I am not sure I can follow your error at the moment.

  4. Great post! I have a question that may be beyond the scope of your expertise. Or it may be you’re too busy. I’m trying to set up the SGE on a 12 core machine with 32 gb of ram via ubantu on virtual box (If you know how to do it on another distro/system – I’m happy to switch) and it sets up beautifully. But when it runs, it only uses one core. I’m about to try over VM ware, but as I said if you have suggestions (or a way you know that works) please let me know.

    Thanks again! – jeff roizen

    1. Hi Jeff, thanks for your comment. I have had the same issue with seeing all the cores from an Ubuntu installation on a virtual box. I am afraid I don’t know the answer to your question.

  5. Hi, I’m using Ubuntu 12.04 on a 12 core machine. Your instructions worked great – thank you so much! I am now trying to install a CPAN module Schedule::DRMAAc for a pipeline that I’m trying to set up (developed elsewhere). The module needs the drmaa.h file that is supposed to be included in the SGE but I can’t seem to find it anywhere. Also the SGE_ROOT/lib/ and SGE_ROOT/util/arch directories…..Do you have any advice/help? (Any suggestions/advice is very very much appreciated as I’m a newbie with this stuff).

    1. Hi Tarah, have you tried to install gridengine-drmaa-dev or gridengine-drmaa1.0? I am not sure this will resolve your issue, but it may be worth trying it.

      1. Hi Theodore – thanks for the quick response. Those sound like it would be the right thing to install but both of those give error: “unable to locate package”. Only gridengine-client/common/exec/master/qmon are available….any thoughts?

      2. gridengine-client – Utilities for Grid Engine queue management
        gridengine-common – Distributed resource management – common files
        gridengine-exec – Distributed resource management – Execution Server
        gridengine-master – Distributed resource management – Master Server
        gridengine-qmon – Graphical utilities for Grid Engine queue management
        python-drmaa – interface to DRMAA-compliant distributed resource management systems

        Oh! So is that python-drmaa the one I need?

      3. I am not entirely sure if python-drmaa is the one you need. My guess is that this is only a Python interface to DRMAA, which may not be relevant. For some reason, the packages I had suggested above do not appear to be in your OS. This may be due to not having added the required repositories.

  6. Hi!
    Thanks for your great post.
    But I’m now installing SGE on different PCs.(Both with Ubuntu12.04)
    I install it on both PCs, but when I check with “qstat -f” on master node,
    It can not see the information of its client.
    The results is as following:
    queuename qtype resv/used/tot. load_avg arch states
    ———————————————————————————
    main.q@52-0960646-01 BIP 0/0/1 -NA- -NA- au
    ———————————————————————————
    main.q@kk-Precision-T5600 BIP 0/0/3 0.01 lx26-amd64

    (Master hostname:kk-Precision-T5600 client hostname:52-0960646-01)
    I have been stuck on this problem for several days, so I’m looking forward to your help.
    Thanks very much!———Jamie

    1. Hi!
      Thanks for your great post.
      But I’m now installing SGE on different PCs.(Both with Ubuntu12.04)
      I install it on both PCs, but when I check with “qstat -f” on master node,
      It can not see the information of its client.
      The results is as following:
      queuename qtype resv/used/tot. load_avg arch states
      ———————————————————————————
      main.q@52-0960646-01 BIP 0/0/1 -NA- -NA- au
      ———————————————————————————
      main.q@kk-Precision-T5600 BIP 0/0/3 0.01 lx26-amd64

      (Master hostname:kk-Precision-T5600 client hostname:52-0960646-01)
      I have been stuck on this problem for several days, so I’m looking forward to your help.
      Thanks very much!———Jamie

  7. Thanks for this excellent guideline, but I got the following socket communication error during install. Could any one point out how to solve it? Thanks.
    error: commlib error: can’t bind socket (no additional information available)
    ……………………..
    critical error: abort qmaster registration due to communication errors
    daemonize error: child exited before sending daemonize state

  8. I could install SGE , but there is some problem with running job

    I am trying to run a job on SGE … but its failing showing error redirection unexpected !!
    My command line looks like this, where I am mainly running script orderedpainting.sh for finding recombination in genomes and the following are its arguments.

    qsub -cwd -o example1.log -e example1.log <<< "/bin/bash ./orderedPainting.sh -g input/aln_mod.hap -l input/strainName_N11.txt -n 40"

    The authors of the script informed me that its failing bcoz my computational cluster doen't support the "<<<" operation.
    I am not clear on how to do this

    can anyone pls help me with resolving this error on SGE ??

  9. Can we install multiple executors on the same machine?
    I have a setup where a complete SGE is installed on a machine. But our machine is pretty powerful so we want to try out installing more than one executor on this machine. Can this be done? how?

  10. Hi, Good afternoon!

    I am beginner in cluster configuration and I would like know if it’s possible limit the CPU load. For example: When a user submit a job, if the CPU is above 50% the job stay waiting…

    Regards,
    Juliano

  11. Everything worked great! Thanks so much. One question, when I run qmon, it doesn’t have any of the icons defined such as the arrow icons. Has anyone seen this before?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s