|
|
|
An Overview of Galileo
Design Goals
Even though Galileo is
cheap as supercomputers go, it still represents
a large monetary investment for our department. Because of this,
we've designed Galileo with the intent that almost everone in the department
will benefit from it in some way. Most supercomputing clusters
are useful to only a few talented programmers, who know how to write
parallel code that takes full advantage of the cluster. These users
are only a small fraction of our user base. In designing Galileo,
we've also kept in mind the average grad student or undergrad
(or faculty member) who doesn't want or need to spend time parallelizing code,
but needs more computing power than that provided by our previous
"compute server", an IBM PowerServer 370 RS6000. Our intent is that
everyone using the RS6000, in whatever capacity, will realize an
immediate benefit by migrating to Galileo.
Fast Serial Performance
To satisfy the
needs of these users, we've built Galileo from
fast nodes and implemented a number of load-balancing schemes.
Each node of Galileo is PII-300 with 128 MB RAM. Various
benchmarks show that a single node is from 1.3 to 2 times as fast
as our RS6000. Thus, even users who use only a single node of the
cluster will see improved performance.
Load Balancing
Performance is further
improved by spreading the user load around
the cluster. Galileo's nodes communicate through an internal 100 Mbps
ethernet network. One of the nodes has a second ethernet card, through
which the cluster communicates with the outside world. This node
acts as firewall, mediating traffic into and out of the cluster.
Incoming connections to selected services (currently telnet, ftp,
http, ssh, rlogin, rsh and xdm) are automatically forwarded to
the currently least-loaded node. For example, with twelve nodes in the
cluster, each of the first twelve users who telnet into Galileo might
find that he has an entire node all to himself.
Once a user has logged
on to a cluster node, she is free to
use other nodes as well. Security has been set up so that users can
use other nodes transparently, without a password. For example,
the user might start running the same application on two nodes by
typing:
ssh node1 "myprogram 1 2 3 > outfile &"
ssh node2 "myprogram 4 5 6 > outfile &"
To help with load-balancing, we've written an application called "run",
which will execute a command on the currently least-loaded node. For
example, intead of invoking her program by typing:
myprogram 1 2 3
a user could type:
run myprogram 1 2 3
"Run" preserves the current working directory (all user directories are
available across the cluster) and the user's current environment variables.
Thirdly, the Mosix
system provides load-balancing for each process on each node, without
user intervention. Mosix allows processes to move to other nodes of the
cluster automatically. When the Mosix system determines that performance
could be improved by migrating a process to another node, it does so.
As far as the user is concerned, the process still looks like it's
executing locally. The process may migrate around the cluster, running
on several different nodes before it finishes. Mosix is installed on
all of the Galileo nodes, and runs automatically, without requiring any
special commands from the user. Users who want to manually control the
action of Mosix should look at the man page for the
mosrun
command.
# A third method of load balancing is
# provided by the Condor queue
# system. Condor allows users to submit jobs to a pool of networked
# computers. When a job is submitted, Condor locates a relatively
# idle computer, and starts the job running there. If that computer
# becomes more heavily loaded, Condor will look around for a less
# loaded computer and automatically migrate the job to that machine.
!>
Fast Parallel Performance
The features described above
satisfy the needs of many of our users, but some users
really do have large problems which require the full power of the
cluster. To make this possible, we've built Galileo
with fast network connections between nodes, and taken
care that each node is well-designed for fast communication
over that network. The computers which compose
Galileo are connected in a star topology, centered on
a 16-port 100 megabit per second ethernet switch. Since
networking speed can be limited by memory bandwidth, each
computer is built with SDRAM memory instead of the slower
FPM or EDO memory.
We've also installed several software
packages which make the task of writing parallel programs
easier. These include
PVM
("Parallel Virtual Machine") and
MPI
("Message Passing Interface"), two programming
environments for parallel computing. A "High Performance
Fortran" compiler
(pghpf)
is also available. High Performance
Fortran is a dialect of fortran with specialized features for
use in parallel applications.
For More Information about Galileo, contact Bryan Wright.
|