How to copy data from NERSC

by Fred Ross
Last updated July 28, 2004

Most of our data is on the HPSS mass storage system at NERSC. They encourage local users to access HPSS with their screwed up little program hsi. Don't listen to them. HPSS is also accessible by FTP as archive.nersc.gov, including remotely.

Access archive.nersc.gov Remotely

You need to set up a .netrc file in your home directory on the machine from which you wish to access HPSS. Traditionally the .netrc file contains three line sets of instructions, of the form:

machine some.machine.somewhere
login username
password whatever

FTP clients are supposed to read this at startup (unless specifically told not to). If you FTP to a server mentioned in the file, it should use the username and password pair defined there to log in.

For security reasons, NERSC doesn't allow this plaintext form of things. The HPSS folks provide full instructions for setting up your .netrc. Everything you need to know is in there, just read it carefully.

Layout of Data on HPSS

All of HyperCP's data is in the subdirectory /nersc/projects/e871.

bin                  - raw datafiles after initial farming
cas                  - cascades split with new_ctrack
histo          
k3pi                 - k3pi split with new_ctrack
k4decays
lis
old_ctrack           - contains subdirectories of data split with old_ctrack
  |-- 97             - data from 1997 runs
      |-- neg 
      |-- pos
  |-- 99             - data from 1999 runs
      |-- pol        - data with a polarized beam (Melin's thesis)
      |-- unpol      - data with an unpolarized beam
            |-- neg 
            |-- pos
omega                - omegas split with new_ctrack
rare                 - rares split with new_ctrack
stat
weff

Each of the neg and pos directories contains run directories. For example,

4225
  |--  cas
         |-- UKEM50.UI5021.108.old.cas.gz
         |-- UKEM51.UI5021.083.old.cas.gz
         ...
  |--  k3pi
         |-- UKEM50.UI5021.108.old.k3pi.gz
         ...
  |--  om
  |--  piotr
  |--  rare

Note that all the data files in the old_ctrack are gzipped. You will need to uncompress them before you use them.

Automated Copying of Data from HPSS

The easiest way to copy data is to hack out a little bit of Perl. I have also begun experimenting with Scheme (in particular scsh) with great initial success.

Perl has a package which provides FTP access called Net::FTP. If it's not there, bug the admin until he installs it. It should be there. The Net bundle of packages is vital for getting work done.

At the head of your Perl script you need to include the module: use Net::FTP; Then I use the following code to log into HPSS:

my $hpss = Net::FTP->new("archive.nersc.gov",
                         Debug => 1,
                         Timeout => 1000000);

$hpss->login()
    or die "Could not connect to archive.nersc.gov: $!";
$hpss->binary();
$hpss->cwd("/nersc/projects/e871/old_ctrack/");

You will need to change the directory in the last line to what you need. The cwd method stands for change working directory. It's the equivalent of cd in commandline clients. The login command normally would take arguments, but our .netrc takes care of it. We must use binary mode for the file transfer, since this is not armored data.

$hpss->ls() returns an array of strings containing the results of an ls command in the working directory. $hpss->get(REMOTE_FILE [, LOCAL_FILE]) fetches REMOTE_FILE to LOCAL_FILE. LOCAL_FILE (and the associated comma) is optional; if it is missing, the file is given the same name as remotely and placed in the current local directory. To log out of the FTP session, use $hpss->quit();.

You can find indepth information about the Net::FTP module (or any Perl module on a proper installation) by typing perldoc Net::FTP at the command line (replace Net::FTP with the module name if you're looking for something else).

If you have only a run number, a useful piece of code to get to the directory you need without knowing the details of 97 vs. 99, pos vs. neg, etc. is

$hpss->cwd("97/neg/$run/rare")
    or $hpss->cwd("97/pos/$run/rare")
    or $hpss->cwd("99/pol/$run/rare")
    or $hpss->cwd("99/unpol/pos/$run/rare")
    or $hpss->cwd("99/unpol/neg/$run/rare")
    or die "Could not find rare directory for $run";

An Example

Here is a script that takes pos or neg on the command line, then a series of run numbers, and copies those runs from the pos or neg directory, depending on the first argument (it doesn't use the directory changer above, but that's because I hacked it out in a couple minutes, and the requirements were sufficiently strict that I didn't need it).

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;

use sigtrap qw/die untrapped normal-signals/;

use Net::FTP;
use IO::Handle;

my $pol = shift @ARGV;
die "Invalid pos/neg: $!" unless ($pol =~ /(pos|neg)/);

my $hpss = Net::FTP->new("archive.nersc.gov",
                         Debug => 1,
                         Timeout => 1000000);

$hpss->login()
    or die "Could not connect to archive.nersc.gov: $!";
$hpss->binary();
$hpss->cwd("/nersc/projects/e871/old_ctrack/99/unpol/$pol");

foreach my $run (@ARGV) {
    die "Invalid run number $run" 
        unless $run =~ /^[0-9][0-9][0-9][0-9]$/;
    print "Run $run ... ";
    $hpss->cwd($run . "/cas");
    my @entries = $hpss->ls();
    foreach my $filename (@entries) {
        if (defined($hpss->get($filename))) {
            print "$filename ";
        } else {
            print "FAILED ";
            $hpss->quit();
            die "Failed to get file filename: $!";
        }
        STDOUT->flush();
    }
    print "\n"; 
    STDOUT->flush();
    $hpss->cwd("../..");
}

If you're feeling masochistic, a much more involved script which automatically looks for places to stash the data and writes logs of it is,

#!/usr/bin/perl -w
use strict;
use diagnostics;

use sigtrap qw/die untrapped normal-signals/;

use Net::FTP;
use IO::Handle;

# Parse command line options
my $run = shift @ARGV;
die "Invalid run number $run" unless $run =~ /^[0-9][0-9][0-9][0-9]$/;

# Open log files
my $logdir = "/home/fred/new_rare_copy/log";
open(COPIED,"> $logdir/$run.succ") 
    or die "Could not open $logdir/$run.succ for writing";
COPIED->autoflush(1);
open(FAILED, "> $logdir/$run.fail")
    or die "Could not open $logdir/$run.fail for writing";
FAILED->autoflush(1);
open(LOG, "> $logdir/$run.log")
    or die "Could not open $logdir/$run.log for writing";
LOG->autoflush(1);

# Log into the FTP server
my $hpss = Net::FTP->new("archive.nersc.gov", Debug => 0,
                         Timeout => 10000);
$hpss->login();
$hpss->binary();
$hpss->cwd("/nersc/projects/e871/old_ctrack");
$hpss->cwd("97/neg/$run/rare")
    or $hpss->cwd("97/pos/$run/rare")
    or $hpss->cwd("99/pol/$run/rare")
    or $hpss->cwd("99/unpol/pos/$run/rare")
    or $hpss->cwd("99/unpol/neg/$run/rare")
    or die "Could not find rare directory for $run";
my $pol = ((my $path = $hpss->pwd()) =~ /pos/) ?
    "pos" : "neg";
print LOG "Selected $pol\n";

my @files = $hpss->ls();
my $size = 0;
foreach my $file (@files) {
    $size += $hpss->size($file); # in bytes
}
$size /= 1024; # change to kb
print LOG "Run is $size kb\n";

# Select destination drive
my @lines = `df -k | grep duser`;
my %drives;
foreach my $line (@lines) {
    chomp $line;
    my (undef,undef,undef,$size,undef,$mount) =
        split /\s+/, $line, 6;
    $drives{$mount} = $size;
}
my @results = sort {$drives{$b} <=> $drives{$a}} keys %drives;
shift @results until ($results[0] ne "/duser7"
                      and $results[0] ne "/duser13");
my $drive = $results[0];
if ($drives{$drive} < $size) {
    die "Insufficient space to copy files on $drive";
}
print LOG "Selected $drive with ". $drives{$drive} ." kb\n";

# Prepare local directory for copying
my $destdir = "$drive/userdata/fred/$pol/$run";
mklongdir($destdir) unless -e $destdir;
die "Failed to create $destdir" unless -e $destdir;
chdir($destdir);

# Fetch files
foreach my $file (@files) {
    my $i = 0;
    print LOG "$file fetched: ";
    while ($i < 3) {
        if ($hpss->get($file, "$destdir/$file")
            and -e "$destdir/$file") {
            last;
        } else {
            $i++;
        }
    }
    if (-e "$destdir/$file") {
        print LOG "yes\n";
        print COPIED "$file\n";
    } else {
        print LOG "no\n";
        print FAILED "$file\n";
    }
}

exit(0);

sub mklongdir {
    my $olddir = `pwd`;
    chdir "/";
    my $path = shift;
    my @dirs = split /\//, $path;
    my $junk = shift @dirs; # Remove initial blank
    foreach my $dir (@dirs) {
        mkdir $dir unless -e $dir;
        chdir $dir or die "Failed to change to $dir";
    }
    chdir($olddir);
}

END {
    $hpss->quit();
}

As always, you should go read the Camel book if you intend to write Perl. If you're interested in Scheme, once you have the appropriate SUNet module loaded, the following code does the same thing as the first script above:


(define runlist '(4463 4464 4465))
(define hpss (ftp-connect "archive.nersc.gov" #f #f #t))
(ftp-set-type! hpss (ftp-type binary))
(ftp-cd hpss "/nersc/projects/e871/bin")

(define (fetch-first dir)
  (let ((olddir (ftp-pwd hpss)))
    (and (ftp-cd hpss dir)
         (display (car (ftp-ls conn)))
         (ftp-cd hpss olddir))))

(map fetch-first runlist)