Table of Contents
This is a brief tutorial on running programs on the bioinfo cluster. The cluster consists of thirty two, dual processor nodes with 1GB of RAM. The cluster is on a private network, only accessible by logging onto bioinfo.ucr.edu though ssh. Programs can be run on the cluster by the use a queueing system, torque.
2005 May 19.
Added multiple input example
2005 May 03.
Added limitations
Added interative job
2005 April 27.
Changing /data layout
Clarified Torque scripts
2005 January 10.
Users can now view the entire queue
2004 December 08.
Some minor corrections
Added Script Reuse Example
2004 December 03.
Added ChangeLog
Added SMP Clarification
The cluster has access to all user-writable areas on bioinfo in addition to several standard databases. While the cluster has a fast network connection, it is quickly saturated when multiple nodes request access to large files. For instance, if 8 nodes want to access a 2GB files, the server must send 16GB of data, one copy per node. because of this certain considerations need to be made when dealing with files larger than 1GB minus the RAM required by the programs running on the node. For instance, if the program uses 128MB of RAM, and the file is 128MB both will fit into the RAM of the node. The file will be saved in the memory of the node and everything will move quickly.
However, if the process needs 512MB of RAM and the file is 1GB, then only half the file will fit in memory. Initially, the first half will be copied and accessed, then this will be discarded and the second half will be copied. If the first half is needed again, the second half will be discarded and the first will be copied. Quickly, the program will spend more time waiting for files to be transfered than doing any work. To get around this, each node has a small mount of local storage of about 20GB. This is user writable under /scratch/. Large files should be copied here on each node prior to use. AND Deleted after they are no longer needed. This data will be periodically deleted from the nodes. Specifically, files which haven't been accessed for 7 days will be deleted.
The nodes have remote access, through NFS, to the following locations on bioinfo, mapped to the same location, ie /home/laurichj on bioinfo is the same file as /home/laurichj on each node.
While the cluster consists of 32 dual processor computers, the majority of programs will not be able to use them all for a number of reasons. First, not all of them can use more than one processor. Many programs (clustalw, for instance), are not "multi-threaded" and will only use one processor. Telling them to use two processors will result in a processor being idle. Many more programs are not programmed using any multi-computer systems. For instance, while you can tell blast to use multiple processors, all of those processors must be in one computer. However, programs like hmmer can use a library to use multiple processors on multiple machines.
A final limitation is simply how well programs will scale. Some programs, such as hmmer will easily use all 64 processors while others, such as charmm will only be able to really use a few. This is because hmmer needs to perform little communication between the computers (it is loosely coupled), while charmm needs to constantly send messages back and forth (it is tightly coupled). After a certain number of computers, this causes charmm to spend more time waiting for data than actually running.
Torque is a derivative of a queueing system named OpenPBS. A similar queuing system, called the Sun Grid Engine or SGE is also a PBS derivative. Torque allows you to run jobs on the cluster without dealing with details such as where the job is running, which nodes are available and so on. To run a job, you first write a script which contains details on running the program in the way in which you wish it to be ran.
A script file is simply a file, typically with a .sh extension, which is used to tell Torque how to run a job; this can be as simple as a couple lines. The file starts with the line "#!/bin/sh" telling Linux to use /bin/sh to run the following list of Linux commands which will setup and run the application. For a simple single processor job, the following job will run blast on a single node, using ONE processor. If you would have ran:
laurichj@bioinfo:~/$ blastall -p blastp -d /data/NCBI/blast/nr/nr -i ~/blast/proteins.fa -e 1e-19 |
on the command line, you can transform that into a Torque script as:
#!/bin/sh blastall -p blastp -d /data/NCBI/blast/nr/nr -i ~/blast/proteins.fa -e 1e-19 |
This script might be written into a file named blast.sh. However, a more flexible version would be:
#!/bin/sh |
This script will run a blast on a free node. All output from the program is saved into files. You no longer have to find a free node, or wait for nodes to be free or even log into them. All you need to do is submit that file to torque, using qsub as follows:
laurichj@bioinfo:~/$ qsub blast.sh
123.bioinfo.ucr.edu |
The job is now queued, and will be ran on the next free node. The second line is the output from qsub. The number, 123, is the Job ID Number which is used to track the progress of the job with qstat. qstat will list the current running and queued jobs.
laurichj@bioinfo:~/others/blast$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
123.bioinfo blast.sh laurichj 0 R batch |
Once the job is no longer listed, it has finished. The files blast.sh.o123 and blast.sh.e123 will now be available. The blast.sh will be whatever the name of your job script is, and the number at the end will be the job id number. The o file is the normal output, and in the case of blast will be the file we are interested in. The e file is the error output. Both of these will always exist, but the error output will usually be empty.
If the BLAST database is larger than 2GB, then the file needs to be moved from bioinfo to the local node. For a single processor job, using cp over NFS is sufficient. By copying the file prior to running blast and removing it after, the job will run faster.
Moving the files in will only increase performance if the file(s) is/are too large to fit into memory and the program scans through the file(s) more than once. If we were to be blasting a single sequence, blast would only scan the database once therefore, copying the database would be wasted effort.
Sometimes a program needs user input to run, or you are just running a single blast and don't want to bother with writing a script file. In this case, in place of a script file you can tell Torque to run an interactive job by giving it the -I (capital i) switch. An interactive job behaves in much the same way as a normal batched job in that it will mark the node(s) as used, wait for enough nodes to become free and so forth. The difference is that qsub will no longer immediatly return. In stead it will wait for your job to execute and then provide you a shell in which you can run commands. For example, to run a simple blast you could:
laurichj@bioinfo:~/$ qsub -I
qsub: waiting for job 124059.bioinfo.ucr.edu to start
qsub: job 124059.bioinfo.ucr.edu ready
laurichj@node01:~$ cd blast/
laurichj@node01:~$ blastall -p blastp -d /data/TIGR/arabidopsis/blast/ATH1.pep -i seq.fa |
Each node in our cluster has two processors, however the previous example is only allowed to use a single processor. If that script were to be used to run a blast with two threads (with -a 2), then it would conflict with someone else's job. For instance, two such blasts on an otherwise empty cluster will both run on node01. If they both use two processors, they will run very slowly. If a process is multi-threaded or otherwise able to use multiple processors on a single node special care needs to be given.
If the previous blast were changed to use multiple processors, you now need to specify the number of nodes and processors per node to use. In the case of blast, this will be 1 node and 2 processors per node.
#!/bin/sh DIR=~/blast INPUT=proteins.fa PROGRAM=blastp DATABASE=/data/NCBI/blast/nr/nr OPTIONS="-e 1e-10" cd $DIR; blastall -a 2 -p $PROGRAM -d $DATABASE -i $INPUT $OPTIONS |
This script file changes to tell blast to use two processors, then the qsub needs to change as well to specify the node and processor count to torque.
laurichj@bioinfo:~/$ qsub -l nodes=1:ppn=2 blast.sh
123.bioinfo.ucr.edu |
Which will tell Torque to use one node, but both processors on that node
While many times it is useful to just use one node, a cluster is most useful using many nodes. However, the programs need to know which nodes to use and how to communicate between nodes. One common way is using PVM. Using PVM is relatively simple, for instance to execute a HMMER job using 4 nodes, the following would work.
#!/bin/sh DIR=~/hmmer |
The final change, is the requirement to tell torque how many nodes to use. This is done with the -l flag, as follows:
laurichj@bioinfo:~/$ qsub -l nodes=4:ppn=2 hmmer.sh |
That will allocate four nodes for this job, and run it when finished. As with the BLAST example, this will save the output to hmmer.sh.oXXX. Since many programs will use the same number of nodes for every run, an easy way to specify the number of nodes is in the script file, for hummer the script will become:
#!/bin/sh #PBS -l nodes=4:ppn=2 DIR=~/hmmer INPUT=proteins.fa DATABASE=/data/local/PFAM/Pfam_fs OPTIONS="-E 1e-10" cd $DIR pvmboot hmmpfam --pvm $OPTIONS $DATABASE $INPUT pvmhalt |
Use the #PBS directive in a script file, is just like passing an argument to qsub. In this case, it will tell torque to use four nodes.
The Message Passing Interface is the other common way to run a program on multiple nodes. For Linux there are two available implementations of MPI: MPICH and LAM. Locally, we are using LAM because it offers greater flexibility and slightly higher performance. To run an MPI job on the cluster, the following script would work:
#!/bin/sh DIR=~/namd INPUT=apoa1.namd cd $DIR lamboot $PBS_NODEFILE |
The worst, but simplest, way to reuse shell scripts has already be hinted at: use multiple directories, one for each job, each with a slightly modified script. Using the previous BLAST script, it is easy to illustrate this idea. Imagine you wished to run a set of proteins against multiple databases. In this case we will use Arabidopsis thaliana and Oryza sativa. To do this you can simply make two directories: ~/blast/ath and ~/blast/osa. Then place a copy of the blast script in each directory, named blast.sh with the INPUT and DATABASE variables modified. However, for more than a few combinations of input and databases, this becomes increasingly tedious.
Rather, a simple evolution of this will allow you to use the same file for multiple jobs. By leaving the options unspecified you can make a script act as a generic schema for a job. For instance, blast is usually ran relatively similar each time. The following file would make a generic schema for blast:
#!/bin/sh #PBS -v DB,IN PROGRAM=blastp OPTIONS="-e 1e-10" cd $PBS_O_WORKDIR blastall -p $PROGRAM -d $DB -i $IN $OPTIONS |
This variant makes a couple changes. First, the script sets the PROGRAM and OPTIONS variables since they tend to be the same for many jobs. Secondly, the script will change directories not to a directory specified in the script, but rather to $PBS_O_WORKDIR which is the directory qsub was run in. Finally, blastall is run. As a general schema, this should be saved in an easy location such as ~/torque/blastp.sh where future schemes can go.
However, DB and IN were not specified in the script. Where do they come from? The answer lies in the second line. That is a directive telling qsub to send the DB and IN environmental variables to the remote host. These are set on the command line to qsub. However, they are environmental variable so they go before the command:
laurichj@bioinfo:~/blast/foo$ DB=/data/TIGR/arabidopsis/ATH1.pep IN=input.fa qsub ~/torque/blastp.sh |
Which will run input.fa against the ATH database.
Many programs, such as blast run relativly quickly, but usually have a large number of input chunks. In the case of blast a chunk would be a single sequence. In general, a chunk is the smallest amount of input that can be ran independantly. In order to speed up the processing of a large number of blast, a little bit of shell scripting can be used. The idea is to split the data into a few smaller groups of chunks, then run these chunks as independant jobs. For example, a blast job of 1000 sequences can be speed up by running 5 jobs of 200 sequences.
The first step is to create the smaller input files. This is very dependant on the program, however almost any program that takes sequences as input can make use of seqsplit. This utility will take a sequence file and convert it into many smaller files. It can do this in one of two ways, first it can create files with a given number of sequences (with the last file possilby being smaller) or it can create a number of approximatly even files. To do this, you specify the input file, the number of sequences per file or the number of files, a prefix for the output files and a file format (if not fasta).
laurichj@bioinfo:~/blast/$ seqsplit -i proteins.fa -l 100 -p proteins- |
This will create files of 100 sequences named proteins-01, proteins-02 and so on. In order to create 100 files, simply change the -l to -L, like so:
laurichj@bioinfo:~/blast/$ seqsplit -i proteins.fa -L 100 -p proteins- |
That will create the files: proteins-01, proteins-02 to proteins-100. To make the file names uniform you can extend the length of the numeric suffix:
laurichj@bioinfo:~/blast/$ seqsplit -i proteins.fa -L 100 -s 3 -p proteins- |
The additional -s 3 will make the files proteins-001, proteins-002 and so forth.
If each segment of the input has a constant number of lines or bytes, then split can be used to split the input. If the data is small enough, it can always be done by hand.
Now, we need to write a script file capable of taking any input we want. To do this, we are just going to use the blast.sh from the Reuse Examples, modified to only need the input files specified, and to operate on multiple files:
#!/bin/sh #PBS -v IN DB=/data/NCBI/blast/nr/nr PROGRAM=blastp OPTIONS="-e 1e-10" cd $PBS_O_WORKDIR for input in $IN; do blastall -p $PROGRAM -d $DB -i $input $OPTIONS done |
If only a few jobs need to be submitted, it will be simplist to do this by hand:
laurichj@bioinfo:~/blast/$ IN=proteins-01 qsub blast.sh |
laurichj@bioinfo:~/blast/$ IN=proteins-02 qsub blast.sh |
laurichj@bioinfo:~/blast/$ IN=proteins-03 qsub blast.sh |
laurichj@bioinfo:~/blast/$ IN=proteins-04 qsub blast.sh |
However, if there are dozens or hundreds of files, a rather complicated command line needs to be run. While this can be ran straight from the command line (write it all on one line), it will likely be easier to save it to a file, and make that file executable. First, we use find to create a list of filenames to run. Then we pipe that output into xargs to print a certain number of them per line, we finally send that into a small while loop in order to submit our jobs. The full command line will look like (formated for some readability):
find -name "proteins-*" | xargs -n 10 echo | while read IN; do export IN; qsub blast.sh; done |
That will list all files that start with "proteins-" and then print those out 10 to a line. For each line it will "export" the IN variable so that qsub can see it. qsub will pass that variable to the script. The result will be that a single job will be submited for every 10 input files.
SSH/DSH Prompts for Password. If you get lines asking you to type in your password when you login, your Kerberos ticket has most likely expired and you need a new one. Use kinit to get a new ticket:
laurichj@bioinfo:~$ kinit |
That will prompt you for your password. Try to login again, if that fails check the permissions on your ~/.k5login file and make sure its world readable:
laurichj@bioinfo:~$ ls -l ~/.k5login
-rw-r--r-- 1 laurichj Users 56 Mar 20 2003 /home/laurichj/.k5login |
If the first section is not -rw-r--r-- then execute:
laurichj@bioinfo:~$ chmod 644 ~/.k5login |
If that still doesn't work, make sure the ~/.k5login file exists and contains:
YOUR_USER_NAME@BIOINFO.UCR.EDU |
If that still doesn't work, e-mail laurichj@bioinfo.ucr.edu