Breadcrumb

Scone user manual

26/01/09


Contents

1 What is Scone

Scone is a cluster of Linux servers designed to fulfill the High Performance Computing needs of the department of Mathematics. This consists of 23 64-bit machines, all of which run Linux 2.6.28.

There are five machines for general use on the scone system, seven machines for use by the statistics group which run under a condor job submission scheme, and seven machines for the applied group which do not use job submission software. A summary of the situation is shown below.

Machine Group Condor CPU RAM
node1 General No 8 X 2.6 GHz opteron 32Gb Ram
node2 General No 8 X 2.6 GHz opteron 32Gb Ram
node3 General No 8 X 2.6 GHz opteron 32Gb Ram
node4 General No 8 X 2.6 GHz opteron 32Gb Ram
node5 General No 8 X 2.6 GHz opteron 64Gb Ram
zeppo Statistics Yes 4 X 2.2 GHz opteron 8Gb Ram
chico Statistics Yes 4 X 2.2 GHz opteron 8Gb Ram
harpo Statistics Yes 4 X 2.2 GHz opteron 8Gb Ram
groucho Statistics Yes 4 X 2.2 GHz opteron 8Gb Ram
barker Statistics Yes 4 X 2.2 GHz opteron 8Gb Ram
morecambe Statistics Yes 2 X 2.6 GHz opteron 8Gb Ram
wise Statistics Yes 2 X 2.6 GHz opteron 8Gb Ram
jake Statistics No 8 X 2.3 GHz opteron 16Gb Ram
elwood Statistics No 8 X 2.3 GHz opteron 16Gb Ram
kelvin Applied No 4 X 2.6 GHz opteron 8Gb Ram
reynolds Applied No 4 X 2.6 GHz opteron 8Gb Ram
riemann Applied No 4 X 2.6 GHz opteron 16Gb Ram
darcy Applied No 4 X 3 GHz opteron 12Gb Ram
rayleigh Applied No 4 X 3 GHz opteron 12Gb Ram
hardy Applied No 4 X 2.6 GHz opteron 16Gb Ram
bernoulli Applied No 4 X 3 GHz opteron 16Gb Ram
Taylor Fluid No 8 X 2.6 GHz opteron 32Gb Ram
Heilbronn Pure No 4 X 2.6 GHz opteron 10Gb Ram

1.1 Software available

The software below is available in Scone. Requests can be made via support-maths@bristol.ac.uk.

  • Gnu C, C++ and gfortran compilers. These may be invoked by the commands gcc, g++ and gfortran. For more details type man gcc, man g++ or man gfortran.
  • Matlab, currently R2009b.
  • Maple 12
  • The R statistical Package
  • Python, numpy and scipy
  • Gnuplot
  • gsview
  • Java - sun-jdk-1.6.0.15
  • Mathematica,7.0 (only on node1)
  • Nag libraries (32 bit version on /usr/local/nag-21/fll3a21dfl/lib/)
  • g95
  • ghc

1.2 Things to note

  • Do not run any jobs on the head machine (Scone). You may compile and run very small test programs. Anything else will be killed without notice.
  • Any user may submit a job through condor. However members of the statistics group have a vastly increased priority.
  • Only members of the appropriate group may log on to their machines directly.

2 Logging in

Access to Scone is done via ssh. Authentication is based on the UOB user-password provided by the University. You must first log in to the head machine scone.maths.bris.ac.uk before accessing other machines. Remember to enable X-forwarding on your ssh client if you want to use a the graphical user interface (GUI) provided by such programs as matlab

2.1 Logging in to Scone from the University Network

2.1.1 Linux Machines

From Linux machines, assuming you logged in to your machine using your UOB user, you can simply do the following:

$ ssh scone 

If you are logged in using a local account try:

$ ssh username@scone 

The first time you log-in to scone you will get the next message:

The authenticity of host 'scone (137.222.80.37)' can't be established. 
RSA key fingerprint is de:8b:77:8d:d7:af:07:d2:8f:6e:64:4c:6f:ec:2b:cd. 
Are you sure you want to continue connecting (yes/no)? 

Please verify the fingerprint matches the one above and proceed by answering 'yes'.

2.1.2 Windows machines

From Windows open your sshclient...

2.2 Logging in to the Scone nodes

Once you're logged in to Scone you will see the Linux command prompt. You can now login to any of the nodes available to your group via ssh. To avoid entering your password every time, you can create a ssh keypair. To do so you can execute the next commands:

user@scone ~ $ ssh-keygen 

When prompted for the file location to save the key, just hit enter. You will also be prompted for a passphrase, you can leave this empty for convenience, but keep in mind that this means anyone with access to your private key can login as yourself.

Then copy your public key into the authorized_keys file, if you don't have such a file in your .ssh director, the easiest way is to just copy it with:

mahfrv@scone ~ $ cp .ssh/id_rsa.pub .ssh/authorized_keys 

After this you should be able to login to the node machines using the passphrase provided.

3 Getting your files on scone

Every user is assigned a home directory in scone, the default location is in /home/local/UOB/user-name. This space is intended to hold user files relevant for HPC purposes.

3.1 Via scp/sftp

From Linux you can copy files to scone using the commands scp or sftp. Please see the manual pages for these commands for details on their usage.

Windows clients can make use of the file transfer utility included with the ssh client...

3.2 Via samba

Home directories in Scone are shared on the network via samba. Linux clients in the maths department are configured to automatically mount Scone on the user's home directory. Scone files should be available in the scone directory within the user home. Alternatively the share can be mounted using a command like:

$ mount -t cifs -o ``user=user-name'' //scone.maths.bris.ac.uk/homes 

On windows machines the location: \\scone.maths.bris.ac.uk\homes should direct the user to his home directory.

4 Using the condor queue

Remember if you are not in the statistics group you currently have very low priority on the condor queue.

4.1 Submitting a job

When submitting jobs via condor it is not necessary to actually login to any of the compute servers. There are two stages to submitting a job:

  1. Write a file that describes the job to be submitted.
  2. Submit the job via condor_submit

4.1.1 Writing the condor file

Please note there is a restriction on condor jobs.

Chico, harpo, groucho and barker are short job machines (with a two week computation limit), whereas morecambe, wise and zeppo are long job machines, without this hard limit.

If you think your job will last for more than two weeks, you must specify to run it on a long job machine in your condor job submission file using:

Requirements = \ 
Machine ==morecambe.private2.maths.bris.ac.uk \ 
|| Machine == wise.private2.maths.bris.ac.uk \ 
|| Machine == zeppo.private2.maths.bris.ac.uk

IF YOU DO NOT SPECIFY THIS, AND YOUR JOB LASTS FOR MORE THAN TWO WEEKS, IT WILL BE KILLED.

Here is an example of condor file. This runs a command from the current directory:

#################### 
## 
## Test Condor command file 
## 
#################### 
executable = ustone6 
Universe = vanilla 
error = ustone6.err 
output = ustone6.out 
log = ustone6.log 
Queue 

This file tells condor that the executable ``ustone6'' is to be run. Standard output from this executable is to go into the file ``ustone6.out''and standard error is to go into the file ``ustone6.err''. The file ``ustone6.log'' contains any messages from the condor system (job status any error and so on).

4.1.2 Submitting and managing the condor jobs

Once you condor file is ready then run the condor_submit command.

user@scone:~> condor_submit ustone6.cmd 
Submitting job(s). Logging submit event(s). 
1 job(s) submitted to cluster 28. 
user@scone:~>

The condor_status command lists the status of the condor clusteras below.

marfc@scone:~> condor_status 
Name OpSys Arch State Activity LoadAv Mem ActvtyTime 
vm1@barker.pr LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:45:17 
vm2@barker.pr LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:45:14 
vm3@barker.pr LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:45:11 
vm4@barker.pr LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:45:08 
vm1@chico.pri LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:45:17 
vm2@chico.pri LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:45:14 
vm3@zeppo.pri LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:28:04 
vm4@zeppo.pri LINUX x86_64 Unclaimed Idle 0.000 2031 0+00:28:01

The fields have the following meanings.

  • [Name] - Lists the name of the processor/machine combination. So vm1@groucho.private2.maths.bris.ac.ukis the first processor on the machine groucho.
  • [OpSys] - The operating system.
  • [Arch] - The CPU architecture. Currently all nodes run AMD opteronsx86_64. In the future this may change as more machines are added.In a multi architecture array jobs may be submitted requesting a specificarchitecture.
  • [State] - Lists the current state of the machine as far as from theviewpoint of condor scheduling. This may be one of the following:

    • Owner - The machine is being used by the owner of the machine (forexample a member of the appropriate research group), and/or is notavailable to run Condor jobs. When the machine first starts up, itbegins in this state.
    • Matched - The machine is available to run jobs, and it has been matchedto a specific job. Condor has has not yet claimed this machine. Inthis state, the machine is unavailable for further matches.
    • Claimed - The machine has been claimed by condor. No further jobswill be allocated by condor to this machine until the current jobhas ended.
    • Preempting -The machine was claimed , but is now preempting thatclaim. This is most likely because someone has logged on to the machineand is running jobs directly.
  • [Activity] - Lists what the machine is actually doing. The details depend upon the condor State, but in general they can be summarised as below.

    • Idle - The machine is not doing anything that was initiated by condor.
    • Busy - The machine is running a job that was initiated by condor.
    • Suspended - The current job has been suspended. This is most likelybecause of a user logging on to the machine and running jobs directly.
    • Killing - The job is being killed.
  • [LoadAv] - Lists the load average on the machine.
  • [Mem]Lists the memory per CPU on the machine.
  • [ActvtyTime]Node activity time.
So in the above example we can see that the whole cluster is quiet except for two processors on the machine groucho which are running jobs. The state of the condor queue can also be examined by the command condor_q (local to current machine) or condor_q -global (across all machines) as below.

user@scone:~/benchmarks> condor_q 
- Submitter: scone.private2.maths.bris.ac.uk : <172.16.80.65:33772> 
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 
31.0 marfc 1/13 10:04 0+00:00:20 R 0 0.0 ustone6 
32.0 marfc 1/13 10:04 0+00:00:01 R 0 0.0 ustone6 
2 jobs; 0 idle, 2 running, 0 held 

This shows that there are two jobs on scone, with job number 31 and 32, owned by user marfc, running at the moment.

To delete a job, use the condor_rm command on the machine from which you submitted the job. The full sequence for submitting, listing and removing a job is shown below.

user@scone:~/benchmarks> condor_submit ustone6.cmd 
Submitting job(s). Logging submit event(s). 
1 job(s) submitted to cluster 33. 
user@scone:~/benchmarks> condor_q 
- Submitter: scone.private2.maths.bris.ac.uk : <172.16.80.65:33772> 
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 
33.0 marfc 1/13 10:07 0+00:00:03 R 0 0.0 ustone6 
1 jobs; 0 idle, 1 running, 0 held 
user@scone:~/benchmarks> condor_rm 33.0 
Job 33.0 marked for removal