LSF - Monitoring and Controlling Jobs

With LSF, a number of commands are available for you to monitor and control job status and progress after your serial/parallel job is submitted. Below are some of the most commonly used commands.

bjobs

The “bjobs” command displays the current status of one or more jobs. If used without any options, it displays all of the pending, running or suspended jobs that you own. If you want to check the status of a specific job, use “bjobs JobID”. For example:

% bjobs 125532
JOBID    USER    STAT  QUEUE  FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
125532  joeuser  DONE  int    bc01-n13   bc03-n01   uname -a  Mar 28 13:37

This was a rather trivial job consisting of only one command, so it ran very quickly. Its status (STAT) is DONE, which means it completed successfully. If a job returns anything other than a normal completion code, its status will be EXIT. For example, to check all jobs on the machine, use:

bjobs –u all –m bc03-n01  

To find out as much as possible about a job use the long listing:

emerald% bjobs -l 783947
Job <783947>, User <mason>, Project <noproj>, Status <RUN>, Queue <week>, Job P
                     riority <50>, Command <sleep 60>
Tue Jul 21 10:42:50: Submitted from host <bc09-n13>, CWD </nas/uncch/home/m/a/m
                     ason/conifers>;
Tue Jul 21 10:42:54: Started on <bc01-n04>, Execution Home </afs/isis.unc.edu/h
                     ome/m/a/mason>, Execution CWD </nas/uncch/home/m/a/mason/c
                     onifers>;
Tue Jul 21 10:43:26: Resource usage collected.
                     MEM: 5 Mbytes;  SWAP: 202 Mbytes;  NTHREAD: 5
                     PGID: 19169;  PIDs: 19169 19170 19172 19175 
 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -    8.2    -     -       -     -    -     -     -      -      -  
 loadStop    -    9.4    -     -       -     -    -     -     -      -      -  
          adapter_windows    css0    csss gm_ports nrt_windows ntbl_windows 
 loadSched             -       -       -        -           -            -  
 loadStop              -       -       -        -           -            -  
              poe 
 loadSched     -  
 loadStop      -  
Note: the section on “Resource usage collected.” Shows memory (MEM: 5 Mbytes), 
Swap space used (SWAP: 202 Mbytes), and the number of processes (NTHREAD: 5).

bhist

The “bhist” command shows the amount of time a job was pending, running and suspended. If you specify the “-l” option, “bhist” also shows a chronological summary of each change in the status of the job.

bhosts

The “bhosts” command checks availability of machines in the LSF cluster. If a host is “ok” then it has one or more job slots available for accepting submitted jobs.

bpeek

The “bpeek” command displays the stdout and stderr of a job while it is running. Usually this is only the most recent 10 lines of output.

If you use the “-f” option, “bpeek” will continue to show additional lines as they are produced. It uses the “tail –f” command to do this, so you can stop the display of the output at any time by using <Ctrl-C>.

bkill

The “bkill” command is used to kill a running, pending or suspended job. You can kill only your own jobs.

More precisely, “bkill” causes LSF to send the SIGINT and SIGTERM signals to a job to give it a chance to clean up, and then LSF sends the SIGKILL signal to kill the job.

Note: If you use the “bjobs” command immediately after the “bkill” command on a running job, it will often show the job as still being in the RUN state. This is normal. There is no need to issue another “bkill” command. Doing so will not kill the job any faster. It sometimes takes several minutes for a “bkill” command to end a large parallel job.

bstop

The “bstop” command suspends a job by sending it the SIGSTOP signal. After you use the “bstop” command on a running job, the status will be USUSP.

If you use the “bstop” command on a pending job, its status will change to PSUSP. Most users will rarely need to use the “bstop” command. Of course, you can stop only your own jobs.

bresume

The “bresume” command resumes a job that was suspended by the “bstop” command. It does this by sending the SIGCONT signal to the job.

Additional help

More on LSF

Research Computing home page


Top
University of North Carolina - Chapel Hill