•About Galileo
•Software
•How to Connect
•How to Use
•Cluster Status
•Change Password


•SSH to Galileo (java client)


•Start NXplugin Session

•Download NX Session Profile

•Download Standalone SSH Client for Windows (putty)


•Physics CA Certificate


•Documentation

Batch Queues on Galileo

  • Restrictions:
    Galileo's nodes are divided into two groups. Most of the cluster is made up of "batch" nodes, but a few nodes are reserved as "login" nodes. When you ssh into galileo, you'll automatically be connected to one of the login nodes. Interactive logins are not allowed on the batch nodes. To see which nodes are login nodes, use the command:
                                                     roles -t login
    

    Long-running jobs are not allowed on the login nodes.(The limit is currently two hours of CPU time.) As a job approaches that limit, you'll be sent a warning by e-mail. After that limit has been exceeded, the job will be killed, and you'll be notified by e-mail. On all nodes, shorter jobs have a slightly higher priority. After one hour of CPU time, a job's priority is lowered.

  • Condor:
    Long-running jobs should be run on the cluster's batch nodes. There are several ways to do this, but it is best to use the Condor batch queues. Condor distributes jobs around the cluster, attempting to maximize throughput (i.e., the number of jobs completed per unit time). In its simplest form, Condor will choose one of the least-loaded CPUs in the cluster, and start running your job on that CPU. More sophisticated users can configure their jobs so that Condor can checkpoint the job (i.e., save its current state) and move the job to a different CPU in response to changes in the load on the cluster. See the condor manual at http://www.cs.wisc.edu/condor/manual/v6.8/ for complete information.

  • The "run" Command:
    On Galileo, the simplest way to submit a job to the Condor queue is by using the "run" command. For example, say you have a program that you can run interactively by typing "myprogram -a -b -c file.dat". To use the batch queues to run the same command, you could just enclose it in quotes and prepend the "run" command, like this:
                                     run 'myprogram -a -b -c file.dat'
    
    This would create a batch job, submit it and send you an e-mail notification when it's done. Log, output and error files would be created in the current directory.

    Here's what a user would typically see when running a program ("myprogram", in this case):

     $ run 'myprogram -a -b -c file.dat'
     Log will be written into condor_myprogram-070924145301-1-node1.94.log.
    
    Note that the "run" command is designed so that anything a user would normally type at the command line can just be enclosed in quotes and prefixed with "run". It preserves the current working directory and the user's current environment variables. Shell metacharacters like ">" should work fine. For example, the following is a perfectly valid invocation of "run":
    	run "ls -al | grep condor > junk.condor.files"
    
  • Advanced Condor Usage:
    If you use the "run" command you'll find that, while the program is running, there are five job-related files in the current working directory. For the example above, these would be called:
    	condor_myprogram-070924145301-1.sh
    	condor_myprogram-070924145301-1.cmd
    	condor_myprogram-070924145301-1-node1.94.0.out
    	condor_myprogram-070924145301-1-node1.94.0.err
    	condor_myprogram-070924145301-1-node1.94.log
    
    Two of these files (*.sh and *.cmd) are temporary files used by "run", and will automatically be deleted when the job completes unless you ask "run" to preserve them by adding the "-k" (for "keep") switch. (For example, you could type: run -k 'myprogram -a -b -c file.dat'.) If you want more control over your Condor batch jobs, you might start by looking at these files and using them as templates for creating more sophisticated condor jobs. If you do this, you'll want to begin submitting your jobs directly with the condor_submit command, rather than using "run". You may also want to re-compile your program with condor_compile to take advantage of Condor's checkpointing and migration capabilities.

  • Submitting Many Jobs, Each with Different Parameters:
    You may find that you need to submit many similar jobs, each only slightly different from the others. For example, you may want to analyze data from many different files, running one job for each file. Or you may want to run the same simulation many times, using different input parameters each time. There are at least two ways to do this on Galileo, described below.

    Before you start submitting many jobs, it's a good idea to start out with some simple tests. First, write a program that takes some parameters and generates some output. Try running it by hand, like this:

    myprogram 1 2 3 4
    
    (where your program is "myprogram" and "1 2 3 4" are some command-line parameters.)

    To run this program under condor, you'd just pre-pend the word "run":

    run myprogram 1 2 3 4
    
    This would create a condor job, submit it to the batch queues, write its output into some files in the current directory, and send you an e-mail when its done.

    Once you're satisfied that this is working correctly, try using "run" to submit the program several times with different parameters. Submit, say, five or ten jobs. Take a look at the output and make sure it looks the way it should.

    Then you'll want to automate the process of submitting jobs with different parameters. There a couple of handy ways of doing this.

    One way is to write a script that just executes "run" commands like the one above, each with the appropriate parameters. This is pretty straghtforward, but there's some extra overhead in doing it this way. If your script submits lots of jobs, you may find that things bog down for a while as the condor system deals with all of these separate queue submissions. If you decide to do it this way, be sure to at least put a delay between the "run" commands (a few seconds).

    Another way to do it is by fiddling with condor directly (not through the "run" command). First, see the Galileo batch queue documentation, where it talks about the "-k" switch on "run". This will let you preserve the condor job file after you've submitted a job. This file will have a name ending in ".cmd". You can copy this file and modify it to do things more sophisticated than "run" is capable of.

    For example, you can modify the "queue" line in the cmd file. Normally the line says "queue 1", meaning "submit a single job". You can submit multiple identical copies of your job by just replacing "1" with the desired number of copies.

    You can also have multiple "queue 1" statements, each of which submits a copy of your job with slightly different parameters. For example, you might have something like:

    executable = myprogram
    log = myprogram.$(Cluster).$(Process).log
    output = myprogram.$(Cluster).$(Process).out
    error = myprogram.$(Cluster).$(Process).err
    getenv = TRUE
    notification = Always
    notify_user = bkw1a@virginia.edu
    universe = vanilla
    
    input = file1.in
    arguments = 100.0 200.0
    queue 1
    
    input = file2.in
    arguments = 200.0 300.0
    queue 1
    
    input = file3.in
    arguments = 400.0 500.0
    queue 1
    
    ...etc.
    
    (You could generate this file programmatically, if you need to run a lot of jobs.) You can then tell condor to submit your jobs by typing a command like:
    condor_submit myprogram.cmd
    
    where myprogram.cmd is the file above.

    In any case, start out slowly, with a few jobs at a time, until you're sure things are working.

  • 64-bit Jobs:
    Galileo has a small number of nodes running a 64-bit operating system. These nodes are available for running batch jobs. In order to submit a job to run on the 64-bit nodes, add the switch "-t batch-64bit" to the "run" command. For example:
         run -t batch-64bit 'myprogram -a -b -c file.dat'
    

    Because mixing 64-bit and 32-bit versions of your programs may cause confusion, the "run" command requires that you submit 64-bit jobs from within a directory whose name contains the string "64bit". For example, you might have a directory called "64bit" underneath your home directory, where you store all of your 64-bit programming projects.

    Note that the 64-bit nodes only have a small subset of the software that's installed on the other Galileo nodes. If you find that libraries you need aren't available on the 64-bit nodes, please let us know and we'll consider adding them.

    The best way to compile your code for use on the 64-bit nodes is to submit a job to these nodes that actually does the compilation and linking there. For example, in the case of a small, self-contained C program, you might type a command like this:

         run -t batch-64bit 'gcc -o myprogram myprogram.c'
    
    If you have a more complicated project with a Makefile, you could do this:
         run -t batch-64bit 'make'
    

    In order to make compiling and debugging easier in this environment, the "run" command can be instructed to show you the output from a batch job as the job is running. To do this, use the "-w" (for "watch") switch. For example, the command:

         run -w -t batch-64bit 'gcc -o myprogram myprogram.c'
    
    will show you the output from gcc immediately. When you're finished looking at the output, you'll need to type Ctrl-C to exit. If the output is long, and you don't want to wait for it to finish, you can type Ctrl-C at any time and your job will continue to run normally in the batch queue until it finishes, and you are notified by e-mail as usual.
If you have forgotten your password or need other help, please click here to submit a request.