Galileo's nodes are divided into two groups. Most of the cluster is made
up of "batch" nodes, but a few nodes are reserved as "login" nodes. When
you ssh into galileo, you'll automatically be connected to one of the
login nodes. Interactive logins are not allowed on the batch nodes.
To see which nodes are login nodes, use the command:
roles -t login
Long-running jobs are not allowed on the login nodes.(The limit
is currently two hours of CPU time.) As a job approaches that limit,
you'll be sent a warning by e-mail. After that limit has been exceeded,
the job will be killed, and you'll be notified by e-mail.
On all nodes, shorter jobs have a slightly higher priority. After
one hour of CPU time, a job's priority is lowered.
Long-running jobs should be run on the cluster's batch nodes. There
are several ways to do this, but it is best to use the Condor
batch queues. Condor distributes jobs around the cluster, attempting
to maximize throughput (i.e., the number of jobs completed per unit
time). In its simplest form, Condor will choose one of the least-loaded
CPUs in the cluster, and start running your job on that CPU. More
sophisticated users can configure their jobs so that Condor can
checkpoint the job (i.e., save its current state) and move the job
to a different CPU in response to changes in the load on the cluster.
See the condor manual at http://www.cs.wisc.edu/condor/manual/v6.8/ for complete information.
- The "run" Command:
On Galileo, the simplest way to submit a job to the Condor queue is
by using the "run" command. For example, say you have a program that
you can run interactively by typing "myprogram -a -b -c file.dat".
To use the batch queues to run the same command, you could just enclose
it in quotes and prepend the "run" command, like this:
run 'myprogram -a -b -c file.dat'
This would create a batch job, submit it and send you an e-mail
notification when it's done. Log, output and error files would be
created in the current directory.
Here's what a user would typically see when running a
program ("myprogram", in this case):
$ run 'myprogram -a -b -c file.dat'
Log will be written into condor_myprogram-070924145301-1-node1.94.log.
Note that the "run" command is designed so that anything a user
would normally type at the command line can just be enclosed
in quotes and prefixed with "run". It preserves the current
working directory and the user's current environment variables.
Shell metacharacters like ">" should work fine. For example,
the following is a perfectly valid invocation of "run":
run "ls -al | grep condor > junk.condor.files"
- Advanced Condor Usage:
If you use the "run" command you'll find that, while the program is running,
there are five job-related files in the current working directory. For the
example above, these would be called:
Two of these files (*.sh and *.cmd) are temporary files used by "run",
and will automatically be deleted when the job completes unless you
ask "run" to preserve them by adding the "-k" (for "keep") switch. (For
example, you could type: run -k 'myprogram -a -b -c file.dat'.) If you
want more control over your Condor batch jobs, you might start by
looking at these files and using them as templates for creating more
sophisticated condor jobs. If you do this, you'll want to begin
submitting your jobs directly with the condor_submit
command, rather than using "run". You may also want to re-compile your
program with condor_compile to take advantage of Condor's checkpointing and
- Submitting Many Jobs, Each with Different Parameters:
You may find that you need to submit many similar jobs, each only slightly
different from the others. For example, you may want to analyze data from
many different files, running one job for each file. Or you may want to
run the same simulation many times, using different input parameters each
time. There are at least two ways to do this on Galileo, described below.
Before you start submitting many jobs, it's a good idea to start
out with some simple tests. First, write a program that takes some
parameters and generates some output. Try running it by hand, like this:
myprogram 1 2 3 4
(where your program is "myprogram" and "1 2 3 4" are some
To run this program under condor, you'd just
pre-pend the word "run":
run myprogram 1 2 3 4
This would create a condor job, submit it to the batch
queues, write its output into some files in the current
directory, and send you an e-mail when its done.
Once you're satisfied that this is working correctly,
try using "run" to submit the program several times with
different parameters. Submit, say, five or ten jobs. Take
a look at the output and make sure it looks the way it should.
Then you'll want to automate the process of submitting
jobs with different parameters. There a couple of handy
ways of doing this.
One way is to write a script that just executes
"run" commands like the one above, each with the appropriate
parameters. This is pretty straghtforward, but there's
some extra overhead in doing it this way. If your script
submits lots of jobs, you may find that things bog down
for a while as the condor system deals with all of these
separate queue submissions. If you decide to do it this
way, be sure to at least put a delay between the "run"
commands (a few seconds).
Another way to do it is by fiddling with condor
directly (not through the "run" command). First, see
the Galileo batch queue documentation, where it talks about
the "-k" switch on "run". This will let you preserve
the condor job file after you've submitted a job. This
file will have a name ending in ".cmd". You can copy this
file and modify it to do things more sophisticated than
"run" is capable of.
For example, you can modify the "queue" line
in the cmd file. Normally the line says "queue 1",
meaning "submit a single job". You can submit multiple
identical copies of your job by just replacing "1" with
the desired number of copies.
You can also have multiple "queue 1" statements,
each of which submits a copy of your job with slightly
different parameters. For example, you might have something
executable = myprogram
log = myprogram.$(Cluster).$(Process).log
output = myprogram.$(Cluster).$(Process).out
error = myprogram.$(Cluster).$(Process).err
getenv = TRUE
notification = Always
notify_user = email@example.com
universe = vanilla
input = file1.in
arguments = 100.0 200.0
input = file2.in
arguments = 200.0 300.0
input = file3.in
arguments = 400.0 500.0
(You could generate this file programmatically, if you need
to run a lot of jobs.) You can then tell condor to submit your
jobs by typing a command like:
where myprogram.cmd is the file above.
In any case, start out slowly, with a few jobs at a time,
until you're sure things are working.
- 64-bit Jobs:
Galileo has a small number of nodes running a 64-bit operating
system. These nodes are available for running batch jobs.
In order to submit a job to run on the 64-bit nodes, add the
switch "-t batch-64bit" to the "run" command. For example:
run -t batch-64bit 'myprogram -a -b -c file.dat'
Because mixing 64-bit and 32-bit versions of your programs
may cause confusion, the "run" command requires that you
submit 64-bit jobs from within a directory whose name
contains the string "64bit". For example, you might have
a directory called "64bit" underneath your home directory,
where you store all of your 64-bit programming projects.
Note that the 64-bit nodes only have a small subset of
the software that's installed on the other Galileo nodes.
If you find that libraries you need aren't available on
the 64-bit nodes, please let us know and we'll consider
The best way to compile your code for use on the 64-bit nodes
is to submit a job to these nodes that actually does the
compilation and linking there. For example, in the
case of a small, self-contained C program, you might
type a command like this:
run -t batch-64bit 'gcc -o myprogram myprogram.c'
If you have a more complicated project with a Makefile,
you could do this:
run -t batch-64bit 'make'
In order to make compiling and debugging easier in
this environment, the "run" command can be instructed
to show you the output from a batch job as the job
is running. To do this, use the "-w" (for "watch")
switch. For example, the command:
run -w -t batch-64bit 'gcc -o myprogram myprogram.c'
will show you the output from gcc immediately. When
you're finished looking at the output, you'll need to
type Ctrl-C to exit. If the output is long, and you don't
want to wait for it to finish, you can type Ctrl-C at
any time and your job will continue to run normally
in the batch queue until it finishes, and you are notified
by e-mail as usual.