Using the Rivanna Cluster
About the Rivanna Cluster:
In 2014, the College of Arts and Sciences, the Engineering
School, the UVa Library and the Data Science Institute purchased a
powerful new computing cluster called "rivanna". If you do
computationally intensive work, you might benefit from using it.
You can find a slideshow presentation about using the cluster here:
IntroductionToRivanna.pdf
The new cluster has 7,000 cores, 1.4 PBs of temporary storage
and fast Infiniband interconnects between nodes. It runs a minimal
version of CentOS operating system. The cluster is managed by
the UVa Advanced Research Computing Services (ARCS) group.
Usage of rivanna is metered in terms of "service
units" (currently equal to the number of hours used), with
each research group being allocated some number of service units.
Anyone can get a free one-time trial allocation of 5,000 service units.
Research groups can get a free "standard allocation" of
100,000 service units by submitting a request that includes
a description of their research. Standard allocations can optionally
be renewed. Researchers can also purchase more service units. (The
current price is $0.015 per service unit.)
For the most up-to-date information about rivanna, see
the ARCS web page: http://arcs.virginia.edu.
The following information is intended to help Physics users get
up and running quickly on the cluster.
Getting Started:
You'll need to do two things before you can begin using
rivanna:
- Create a "MyGroups" group to identify the members of your
research group. The people in this group will be allowed to
use part of your research group's allocation of service units
on rivanna. If you have an existing MyGroups group, you can use it.
To create a new MyGroups group, visit:
http://mygroups.virginia.edu
- Request an allocation of service units:
https://www.rc.virginia.edu/userinfo/rivanna/allocations/
The form for requesting or renewing a 100,000-unit "standard allocation" is here:
https://www.rc.virginia.edu/form/allocation-standard/
Using the Cluster:
- Logging in:
Once you've been granted an allocation, you can ssh into the cluster at:
rivanna.hpc.virginia.edu
use your Eservices password to log in.
You can try things out interactively there, and it won't count against
your allocation.
- Storage Space:
At least two chunks of storage space will be available to you on
the cluster:
- Available Software:
- Root:
I've installed root 6 (the default) and 5.34.36.
In order to use these, people will need to create or modify their
.bashrc and .bash_profile files and log in again. You can download
an example .bashrc file here and an example
.bash_profile here. The .bashrc
must contain the following lines, and .bash_profile must source
.bashrc.
# Load modules:
module purge
module use /share/apps/modulefiles
module load cmake/3.12.3
module load gcc/7.1.0
module load anaconda/5.2.0-py3.6
module load physics/root
If you'd rather use root version 5.34.36, sustitute this for the
final command:
module load physics/root/5.34.36
- Geant4:
I've compiled geant4 versions 10.04 (the default), 10.02, and 10.02.p03, and associated packages. In
order to use these, people will need to create or modify their
.bashrc and .bash_profile files and log in again. You can download
an example .bashrc file here and an example
.bash_profile here. The .bashrc
must contain the following lines, and .bash_profile must source
.bashrc.
# Load modules:
module purge
module use /share/apps/modulefiles
module load cmake/3.5.2
module load physics/xerces-c
module load physics/root
module load physics/hepmc/2.06.09
module load physics/gccxml
module load physics/openscientist
module load physics/cernlib
module load physics/pythia8
module load physics/pythia6
module load physics/tbb
module load physics/geant4
- Other loadable modules:
If you need other software besides whatever's available by default,
type "module avail" to see optional pre-installed software. If you
see something you need in the resulting list, you can make it
available to you by typing "module load whatever". The command
"module list" will show you a list of the modules you've loaded.
Here are some of the modules currently available:
- Matlab
- Mathematica
- Ansys
- IDL
- Gaussian
- Rstudio
- Stata
- Totalview
...and many others.
If you need software that isn't available, please contact
physics-comp@virginia.edu and we'll work with ARCS and try to
get it installed for you
Submitting Batch Jobs:
Once you have a program you'd like to submit as a batch job, you'll
need to write a "slurm" script. (Slurm is the batch queue management
system used on rivanna.)
The required slurm script will be different depending on whether you
want to submit a single job, or a group of related jobs. Here is
an example for each.
- First, here's an example of a slurm script for a single task:
#!/bin/sh
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=12:00:00
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mst3k@virginia.edu
#SBATCH --partition=standard
pawX11 -b testing.kumac
The #SBATCH lines tell the queue system about the job. Initially
you'll want to at least set the --output file and --error files, which
define where stdout and stderr from your job will go. The --mail-user doesn't
currently do anything, but it might be good to put the user's
address there, just in case this feature is enabled later. The
last line is just the command you want to run. The current working
directory will automatically be set to the directory you're in
when you submit the job.
- Now, here's an example showing a slurm script appropriate for
submitting a group of related tasks (for example,
to analyze a group of related data sets, or to run a simulation
several times with different parameters):
#!/bin/sh
#SBATCH --time=12:00:00
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mst3k@virginia.edu
#SBATCH --partition=standard
# Set the number of tasks in the following line:
#SBATCH --ntasks=10
for ((i=0;i<$SLURM_NTASKS;i++))
do
# ---------------------------------------------------------------------
# Customize this section to meet your needs.
#
# In this example, we submit tasks to analyze many data files.
# Each data file has a name like "run*.rz". We want to analyze
# runs 5001 through 5010 (10 runs) so we've set "ntasks" to 10,
# above.
#
firstrun="5001";
((runnumber=$firstrun+$i))
# Define the variable "command" so that it does what you want:
command="analyze run$runnumber.rz"
# ---------------------------------------------------------------------
echo "Submitting task $i: $command"
srun --cpus-per-task=1 --cpu_bind=cores --exclusive --nodes=1 --ntasks=1 $command 1> slurm-$i.out 2> slurm-$i.err &
done
wait
Assuming that your slurm script is called "testing.slurm", you could
submit it to the batch queues by typing:
sbatch testing.slurm
The slurm system divides the cluster's resources into several
sections called "partitions". (Many other batch systems would
use the term "queues" for these.) Most users will want to use
either the "standard" partition, which
includes most of the
cluster's computing resources, or the "dev" partition.
The "dev" partition is for development and testing. It can be used
at no charge, but is limited to 2 hours. You can choose which partition your jobs will
use by modifying the "--partition" line in your slurm files.
(See the example above.)
Note that there's also a "parallel" partition, intended for
parallel jobs. It's important to remember that "parallel" is just a name.
Your jobs don't need to
be parallelized (e.g., by using MPI) in order to use this partition,
and using this partition won't do any magic parallelization of
your jobs.
For more information about each of these partitions, and a few
others, see:
http://arcs.virginia.edu/rivanna
You can watch the progress of your batch jobs with:
squeue
The command "sacct" will also give you a summary of your running
and completed jobs, along with their exit status.
To get an overview of the queues, type "sview". (This is an X
program, so you'll need to be sure you ssh in with X forwarding
turned on.) Remember that, in slurm's terminology, a batch queue is called a
"partition".
ARCS provides more information about using slurm on rivanna here:
http://arcs.virginia.edu/user-info/simple-linux-utility-resource-management-slurm
Complete documentation about slurm can be found here:
http://slurm.schedmd.com.
Running Graphical Programs:
If you need to use graphical tools while developing the program, the
cluster supports the "FastX" protocol. This requires that you install
a proprietary client, available here:
https://arcs.virginia.edu/fastx
Viewing Your Remaining Allocation:
The command "allocations" will show you how many CPU hours remain
in your allocation.