
|
|
- Restrictions:
Galileo's nodes are divided into two groups. Most of the cluster is made
up of "batch" nodes, but a few nodes are reserved as "login" nodes. When
you ssh into galileo, you'll automatically be connected to one of the
login nodes. Interactive logins are not allowed on the batch nodes.
To see which nodes are login nodes, use the command:
roles -t login
Long-running jobs are not allowed on the login nodes.(The limit
is currently two hours of CPU time.) As a job approaches that limit,
you'll be sent a warning by e-mail. After that limit has been exceeded,
the job will be killed, and you'll be notified by e-mail.
On all nodes, shorter jobs have a slightly higher priority. After
one hour of CPU time, a job's priority is lowered.
- Condor:
Long-running jobs should be run on the cluster's batch nodes. There
are several ways to do this, but it is best to use the Condor
batch queues. Condor distributes jobs around the cluster, attempting
to maximize throughput (i.e., the number of jobs completed per unit
time). In its simplest form, Condor will choose one of the least-loaded
CPUs in the cluster, and start running your job on that CPU. More
sophisticated users can configure their jobs so that Condor can
checkpoint the job (i.e., save its current state) and move the job
to a different CPU in response to changes in the load on the cluster.
See the condor manual at http://www.cs.wisc.edu/condor/manual/v6.8/ for complete information.
- The "run" Command:
On Galileo, the simplest way to submit a job to the Condor queue is
by using the "run" command. For example, say you have a program that
you can run interactively by typing "myprogram -a -b -c file.dat".
To use the batch queues to run the same command, you could just enclose
it in quotes and prepend the "run" command, like this:
run 'myprogram -a -b -c file.dat'
This would create a batch job, submit it and send you an e-mail
notification when it's done. Log, output and error files would be
created in the current directory.
Here's what a user would typically see when running a
program ("myprogram", in this case):
$ run 'myprogram -a -b -c file.dat'
Log will be written into condor_myprogram-070924145301-1-node1.94.log.
Note that the "run" command is designed so that anything a user
would normally type at the command line can just be enclosed
in quotes and prefixed with "run". It preserves the current
working directory and the user's current environment variables.
Shell metacharacters like ">" should work fine. For example,
the following is a perfectly valid invocation of "run":
run "ls -al | grep condor > junk.condor.files"
- Advanced Condor Usage:
If you use the "run" command you'll find that, while the program is running,
there are five job-related files in the current working directory. For the
example above, these would be called:
condor_myprogram-070924145301-1.sh
condor_myprogram-070924145301-1.cmd
condor_myprogram-070924145301-1-node1.94.0.out
condor_myprogram-070924145301-1-node1.94.0.err
condor_myprogram-070924145301-1-node1.94.log
Two of these files (*.sh and *.cmd) are temporary files used by "run",
and will automatically be deleted when the job completes unless you
ask "run" to preserve them by adding the "-k" (for "keep") switch. (For
example, you could type: run -k 'myprogram -a -b -c file.dat'.) If you
want more control over your Condor batch jobs, you might start by
looking at these files and using them as templates for creating more
sophisticated condor jobs. If you do this, you'll want to begin
submitting your jobs directly with the condor_submit
command, rather than using "run". You may also want to re-compile your
program with condor_compile to take advantage of Condor's checkpointing and
migration capabilities.
|