•Software
•How to Use
•Cluster Status
•Register
•Change Password


•SSH to Galileo (java client)

•Download Standalone SSH Client for Windows (putty)


•Physics CA Certificate

Batch Queues on Galileo

  • Restrictions:
    Galileo's nodes are divided into two groups. Most of the cluster is made up of "batch" nodes, but a few nodes are reserved as "login" nodes. When you ssh into galileo, you'll automatically be connected to one of the login nodes. Interactive logins are not allowed on the batch nodes. To see which nodes are login nodes, use the command:
                                                     roles -t login
    

    Long-running jobs are not allowed on the login nodes.(The limit is currently two hours of CPU time.) As a job approaches that limit, you'll be sent a warning by e-mail. After that limit has been exceeded, the job will be killed, and you'll be notified by e-mail. On all nodes, shorter jobs have a slightly higher priority. After one hour of CPU time, a job's priority is lowered.

  • Condor:
    Long-running jobs should be run on the cluster's batch nodes. There are several ways to do this, but it is best to use the Condor batch queues. Condor distributes jobs around the cluster, attempting to maximize throughput (i.e., the number of jobs completed per unit time). In its simplest form, Condor will choose one of the least-loaded CPUs in the cluster, and start running your job on that CPU. More sophisticated users can configure their jobs so that Condor can checkpoint the job (i.e., save its current state) and move the job to a different CPU in response to changes in the load on the cluster. See the condor manual at http://www.cs.wisc.edu/condor/manual/v6.8/ for complete information.

  • The "run" Command:
    On Galileo, the simplest way to submit a job to the Condor queue is by using the "run" command. For example, say you have a program that you can run interactively by typing "myprogram -a -b -c file.dat". To use the batch queues to run the same command, you could just enclose it in quotes and prepend the "run" command, like this:
                                     run 'myprogram -a -b -c file.dat'
    
    This would create a batch job, submit it and send you an e-mail notification when it's done. Log, output and error files would be created in the current directory.

    Here's what a user would typically see when running a program ("myprogram", in this case):

     $ run 'myprogram -a -b -c file.dat'
     Log will be written into condor_myprogram-070924145301-1-node1.94.log.
    
    Note that the "run" command is designed so that anything a user would normally type at the command line can just be enclosed in quotes and prefixed with "run". It preserves the current working directory and the user's current environment variables. Shell metacharacters like ">" should work fine. For example, the following is a perfectly valid invocation of "run":
    	run "ls -al | grep condor > junk.condor.files"
    
  • Advanced Condor Usage:
    If you use the "run" command you'll find that, while the program is running, there are five job-related files in the current working directory. For the example above, these would be called:
    	condor_myprogram-070924145301-1.sh
    	condor_myprogram-070924145301-1.cmd
    	condor_myprogram-070924145301-1-node1.94.0.out
    	condor_myprogram-070924145301-1-node1.94.0.err
    	condor_myprogram-070924145301-1-node1.94.log
    
    Two of these files (*.sh and *.cmd) are temporary files used by "run", and will automatically be deleted when the job completes unless you ask "run" to preserve them by adding the "-k" (for "keep") switch. (For example, you could type: run -k 'myprogram -a -b -c file.dat'.) If you want more control over your Condor batch jobs, you might start by looking at these files and using them as templates for creating more sophisticated condor jobs. If you do this, you'll want to begin submitting your jobs directly with the condor_submit command, rather than using "run". You may also want to re-compile your program with condor_compile to take advantage of Condor's checkpointing and migration capabilities.

If you have forgotten your password or need other help, please click here to submit a request.

Online Survey Tool Powered By QuestionPro
Web Poll Powered By MicroPoll
Email Marketing Powered By ContactPro