1. Home
  2. Docs
  3. Jobs Management
  4. Jobs Monitoring

Jobs Monitoring

So as the progression and status of a job to be monitored, a user could make use of the following commands

user> squeue -o "%.5i %.10P %.10j %.4u %.8N %.5C %.10m %.8M"
JOBID  PARTITION NAME USER NODELIST  CPUS MIN_MEMORY     TIME
20216  minibatch sina-7 user1 node-20  12  48294M 56-23:02:00
20217  minibatch sina-8 user2 node-21  12  48294M 56-23:01:37
20230  minibatch sina-11 user3 node-24 12  48294M 55-23:20:20
21072      batch sina-3  user4 node-2  20  128723M 29-00:50:53
21611      batch sina-5  user5 node-4  20  128723M 16-12:38:47

This extension of squeue command gives a thorough view of the current jobs queue. For an even more thorough extension of squeue one could make use of the following.

user> squeue -o "%.10i %.10P %.20j %.15u %.15T %.15R %.20E %.2D %.15N %.5C %.15m %.15M"

The following command indicates the cpu load of each computing node (for minibatch, for other partitions please change properly). The node load index is sometimes useful for infering if a job is consuming more resources than the specified ones in the sbatch file.

user>sinfo -O all | cut -d\| -f10,11,23,33,34 | grep minibatch
48294|node-25|0/0/12/12 |N/A |minibatch 
48294|node-20|12/0/0/12 |12.19 |minibatch 
48294|node-21|12/0/0/12 |12.19 |minibatch 
48294|node-24|12/0/0/12 |10.92 |minibatch 
48294|node-22|0/12/0/12 |0.00 |minibatch 
48294|node-23|0/12/0/12 |1.00 |minibatch

The column CPUS(A/I/O/T) shows the allocated (A), idle (I), other (O) and total (T) number of cores.

The scontrol command can be used for revealing extra information about a running job. For completed jobs, scontrol keeps in cache these info’s only for a short period of time. The below exampe demontrates the use of scontrol for the job with id 21611.

user> scontrol show jobs 21611
JobId=21611 JobName=sina-5
   UserId=user5 GroupId=users(1000) MCS_label=N/A
   Priority=4294899077 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=16-12:49:19 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-05-05T22:24:44 EligibleTime=2019-05-05T22:24:44
   StartTime=2019-05-05T22:24:45 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=zorba:18656
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node-4
   BatchHost=node-4
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,mem=128723M,node=1
   Socks/Node=* NtasksPerN:B:S:C=20:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=128723M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home100/user5/batch5.sh
   WorkDir=/home100/user5
   StdErr=/home100/user5/sina_5_output
   StdIn=/dev/null
   StdOut=/home100/user5/sina_5_output
   Power=

The sacct command give various status information of completed or running jobs.

user> sacct -o reqmem,maxrss,ncpus,elapsed,state -j 1350
    ReqMem     MaxRSS      NCPUS    Elapsed      State 
---------- ---------- ---------- ---------- ---------- 
      40Gc                     1   01:53:19  COMPLETED 
      40Gc      7208K          1   01:53:19  COMPLETED

As shown above, the completed job 1350 consumed 7208KB of memory while the required capacity was 40GB (waste of memory). In the following example the i/o data size for a running job are presented.

user> sstat -p --format=MaxDiskRead,MaxDiskWrite,MaxRSS,JobID  -j 21611.batch
MaxDiskRead|MaxDiskWrite|MaxRSS|JobID|
4753612.74M|1182762.50M|36274380K|21611.batch|

For more details someone could reference to the manual pages of sinfo, squeue, scontrol, sacct and sstat.