
|
|
- Restrictions:
Galileo's nodes are divided into two groups. Most of the cluster is made
up of "batch" nodes, but a few nodes are reserved as "login" nodes. When
you ssh into galileo, you'll automatically be connected to one of the
login nodes. Interactive logins are not allowed on the batch nodes.
To see which nodes are login nodes, use the command:
roles -t login
Long-running jobs are not allowed on the login nodes.(The limit
is currently two hours of CPU time.) As a job approaches that limit,
you'll be sent a warning by e-mail. After that limit has been exceeded,
the job will be killed, and you'll be notified by e-mail.
On all nodes, shorter jobs have a slightly higher priority. After
one hour of CPU time, a job's priority is lowered.
- Condor:
Long-running jobs should be run on the cluster's batch nodes. There
are several ways to do this, but it is best to use the Condor
batch queues. Condor distributes jobs around the cluster, attempting
to maximize throughput (i.e., the number of jobs completed per unit
time). In its simplest form, Condor will choose one of the least-loaded
CPUs in the cluster, and start running your job on that CPU. More
sophisticated users can configure their jobs so that Condor can
checkpoint the job (i.e., save its current state) and move the job
to a different CPU in response to changes in the load on the cluster.
See the condor manual at http://www.cs.wisc.edu/condor/manual/v6.8/ for complete information.
- The "run" Command:
On Galileo, the simplest way to submit a job to the Condor queue is
by using the "run" command. For example, say you have a program that
you can run interactively by typing "myprogram -a -b -c file.dat".
To use the batch queues to run the same command, you could just enclose
it in quotes and prepend the "run" command, like this:
run 'myprogram -a -b -c file.dat'
This would create a batch job, submit it and send you an e-mail
notification when it's done. Log, output and error files would be
created in the current directory.
Here's what a user would typically see when running a
program ("myprogram", in this case):
$ run 'myprogram -a -b -c file.dat'
Log will be written into condor_myprogram-070924145301-1-node1.94.log.
Note that the "run" command is designed so that anything a user
would normally type at the command line can just be enclosed
in quotes and prefixed with "run". It preserves the current
working directory and the user's current environment variables.
Shell metacharacters like ">" should work fine. For example,
the following is a perfectly valid invocation of "run":
run "ls -al | grep condor > junk.condor.files"
- Advanced Condor Usage:
If you use the "run" command you'll find that, while the program is running,
there are five job-related files in the current working directory. For the
example above, these would be called:
condor_myprogram-070924145301-1.sh
condor_myprogram-070924145301-1.cmd
condor_myprogram-070924145301-1-node1.94.0.out
condor_myprogram-070924145301-1-node1.94.0.err
condor_myprogram-070924145301-1-node1.94.log
Two of these files (*.sh and *.cmd) are temporary files used by "run",
and will automatically be deleted when the job completes unless you
ask "run" to preserve them by adding the "-k" (for "keep") switch. (For
example, you could type: run -k 'myprogram -a -b -c file.dat'.) If you
want more control over your Condor batch jobs, you might start by
looking at these files and using them as templates for creating more
sophisticated condor jobs. If you do this, you'll want to begin
submitting your jobs directly with the condor_submit
command, rather than using "run". You may also want to re-compile your
program with condor_compile to take advantage of Condor's checkpointing and
migration capabilities.
- 64-bit Jobs:
Galileo has a small number of nodes running a 64-bit operating
system. These nodes are available for running batch jobs.
In order to submit a job to run on the 64-bit nodes, add the
switch "-t batch-64bit" to the "run" command. For example:
run -t batch-64bit 'myprogram -a -b -c file.dat'
Because mixing 64-bit and 32-bit versions of your programs
may cause confusion, the "run" command requires that you
submit 64-bit jobs from within a directory whose name
contains the string "64bit". For example, you might have
a directory called "64bit" underneath your home directory,
where you store all of your 64-bit programming projects.
Note that the 64-bit nodes only have a small subset of
the software that's installed on the other Galileo nodes.
If you find that libraries you need aren't available on
the 64-bit nodes, please let us know and we'll consider
adding them.
The best way to compile your code for use on the 64-bit nodes
is to submit a job to these nodes that actually does the
compilation and linking there. For example, in the
case of a small, self-contained C program, you might
type a command like this:
run -t batch-64bit 'gcc -o myprogram myprogram.c'
If you have a more complicated project with a Makefile,
you could do this:
run -t batch-64bit 'make'
In order to make compiling and debugging easier in
this environment, the "run" command can be instructed
to show you the output from a batch job as the job
is running. To do this, use the "-w" (for "watch")
switch. For example, the command:
run -w -t batch-64bit 'gcc -o myprogram myprogram.c'
will show you the output from gcc immediately. When
you're finished looking at the output, you'll need to
type Ctrl-C to exit. If the output is long, and you don't
want to wait for it to finish, you can type Ctrl-C at
any time and your job will continue to run normally
in the batch queue until it finishes, and you are notified
by e-mail as usual.
|