Getting Started on Dogwood

Table of Contents
Introduction
System Information
Getting an Account
Logging In
Directory Spaces
Mass Storage
Development and Application Environment: Modules
Job Submission

Additional Help

Introduction

  • The Dogwood cluster is a Linux-based computing system available to researchers across the campus. This cluster is intended to run large-way, distributed memory, multi-node, parallel jobs. The system has a fast switching fabric (Infiniband EDR interconnect) for this purpose. Here, by large way jobs we mean jobs that span beyond a node (here > 44 cores).

System Information

Getting an Account

You can visit the Onyen Services page, then click on the Subscribe to Services button and select Dogwood Cluster.

Logging In

Linux:

Linux users can use ssh from within their Terminal application to connect to Dogwood.

If you wish to enable x11 forwarding use the “–X” ssh option. Be sure to use your UNC ONYEN and password for the login:

ssh -X <onyen>@dogwood.unc.edu

Windows:

Windows users should download MobaXterm (Home Edition). Then use the Session icon to create a Dogwood SSH session using dogwood.unc.edu for “Remote host” and your ONYEN for the “username” (Port should be left at 22).

Mac:

Mac users can use ssh from within their Terminal application to connect to Dogwood. Be sure to use your UNC ONYEN and password for the login:

ssh -X <onyen>@dogwood.unc.edu

To enable x11 forwarding Mac users will need to download, install, and run Xquartz on their local machine in addition to using the “–X” ssh option. Furthermore, in many instances for x11 forwarding to work properly Mac users need to use the Terminal application that comes with Xquartz instead of the default Mac terminal application.

A successful login takes you to “login node” resources that have been set aside for user access. The login node is where you will edit your code, execute basic UNIX commands, and submit your jobs from to the SLURM job scheduler.

DO NOT RUN YOUR CODE OR RESEARCH APPLICATIONS DIRECTLY ON THE LOGIN NODE. THESE MUST BE SUBMITTED TO SLURM!

Directory Spaces

NAS home space

Your home directory will be in /nas/longleaf/home/. Your home
directory has a 75 GB soft limit and a 50 GB hard limit. Note that the
Dogwood and Longleag clusters share the same home file spaces. Thus
if you are someone who uses both clusters, we strongly recommend
creating a longleaf and/or dogwood subdirectory under your home
directory to keep the files separated as needed.

Work/Scratch Space

Your scratch directory will be in

/21dayscratch/scr/o/n/onyen

(the “o/n/” are the first two letters of your onyen).This is the scatch space for working with large files.The following apply to the scratch space:

  • Scratch space uses standard UNIX permissions to control access to files and directories. By default other users in your group (graduate sthttps://help.unc.edu/217udents, faculty, employees) have read access to your scratch directory. You can easily remove this read permission with the “chmod” command.
  • A policy has been established for cleaning out files. Scratch file deletion will be enforced with files older than 21 days being removed. Any file not used or modified in the last 21 days will be deleted.
  • Scratch space is a shared, temporary work space. Please not that scratch space is not backed up and is, therefore, not intended for permanent data storage. See the “Mass Storage” section below about how to store permanent data.
  • Note it is a violation of research computing policy to use artificial means, such as the “touch” command, to maintain unused files in the scratch directory beyond their natural lifetime. Violators will be warned and repeat violators are subject to loss of privileges and access. This is a shared resource, please be courteous to other users.

What follows are suggested “best practices” to keep in mind when using scratch space on the Dogwood cluster:

  • Try to avoid using “ls -l” and use “ls” with no options instead. ?? drop this ??
  • Never have a large number of files (>1000) in a single directory.
  • Avoid submitting jobs in a way that will access the same file(s) at the same point(s) in time.
  • Limit the number of processes performing parallel I/O work or other highly intensive I/O jobs.

Mass Storage

The Mass Storage system (also known as StorNext and mounted as /ms) is intended for archiving files and storing very large files. Files located in mass storage are not accessible to jobs running in Slurm. Mass storage is not to be used as a work directory or as a backup location for local disk drives, operating systems, or software. In general, files that change often or directories with more than a thousand files in them will cause performance problems and consume tape resources. The PC backup software provided by UNC might be an alternative solution rather than having to copy your PC files to mass storage.

Mass Storage is similar to an ordinary disk file system in that it keeps an inode (for recording data location, etc.) and data blocks for each file. Files can be moved in and out of mass storage by using simple UNIX commands such as “cp” and “mv” or by using sftp/scp. As the Mass Storage system is optimized for archiving data, your programs should not directly read or write from the Mass Storage system. Instead copy your data from “~/ms” to scratch space such as “/netscr/”.

If you are routinely storing large numbers of small files (more than several hundred files at a time) in Mass Storage, you should “tar” or “zip” those smaller files into one tarball or zip file outside of mass storage and then move that tarball or zip file to mass storage. You are not required to compress the tarball or zip file since the mass storage tape drive hardware will compress your data. Reducing the number of individual small files will help the overall performance of the StorNext Mass Storage system. See the more detailed list of things to avoid.

To access Mass Storage from Dogwood, type:

cd ~/ms

Any files in the scratch space that you wish to save, can be moved to the mass storage preferably in tar or zip format.

IMPORTANT NOTE:

Development and Application Environment: Modules

The environment on Dogwood is managed as modules. The basic module commands are

module [ add | avail | help | list | load | unload | show ]

When you first log in you should run

module list

And the response should be

1) null

To add a module for this session only, use “module add [application]” where “[application]” is the name given on the output of the “module avail” command.

To add a module for every time you login, use “module save”. This does not change your current session, only later logins.

Please refer to the Help document on modules for further information.

This page describes the various MPI modules available on Dogwood. 

 

 

Job Submission

Once you have decided what software you need to use, added those packages to your environment using modules, and you have successfully compiled your serial or parallel code, you can then submit your jobs to run on Dogwood. We use the Slurm workload manager software to schedule and manage jobs that are submitted to run on Dogwood.

To submit a job to run, you will need to use the SLURM sbatch command as shown below. SLURM submits jobs to particular job partitions you specify.

A short description of the partitions available to users in the Dogwood
cluster can be found here.

Monitoring and Controlling Jobs:

You can check the status of your submitted SLURM jobs with the command “squeue -u ” (note squeue shows jobs from all users, provide your onyen to just show your jobs) The output of that command will include a Job ID, the state of your job (e.g. pending or running), the partition to which you submitted the job, the job name, and other information. See “man squeue” for more information on using this command.
If you need to kill/end a running job, use the “scancel” command:

scancel [JobID]

Where JobID is the SLURM job ID displayed with the “squeue” command.

Finally, you can provide an output file to SLURM (“-o filename” in the sbatch command). For regular jobs, if you don’t provide a name the default file name is “slurm-%j.out”, where the “%j” is replaced by the SLURM job ID.

Note
Jobs running outside the SLURM partitions will be killed. The logon privileges of users who repeatedly run jobs outside of the SLURM partitions will be suspended.

Additional Help

Be sure to check the Research Computing home page for information about other resources available to you.

We encourage you to attend a training session on “Using Dogwood” and other related topics. Please refer to the Research Computing Training site for further information.

If you have any questions, please feel free either to call 962-HELP, email research@unc.edu, or submit an Online Web Ticket.