Jobs Monitoring

So as the progression and status of a job to be monitored, a user could make use of the following commands

user> squeue -o "%.5i %.10P %.10j %.4u %.8N %.5C %.10m %.8M"
JOBID  PARTITION NAME USER NODELIST  CPUS MIN_MEMORY     TIME
20216  minibatch sina-7 user1 node-20  12  48294M 56-23:02:00
20217  minibatch sina-8 user2 node-21  12  48294M 56-23:01:37
20230  minibatch sina-11 user3 node-24 12  48294M 55-23:20:20
21072      batch sina-3  user4 node-2  20  128723M 29-00:50:53
21611      batch sina-5  user5 node-4  20  128723M 16-12:38:47

This extension of squeue command gives a thorough view of the current jobs queue. For an even more thorough extension of squeue one could make use of the following.

user> squeue -o "%.10i %.10P %.20j %.15u %.15T %.15R %.20E %.2D %.15N %.5C %.15m %.15M"

The following command indicates the cpu load of each computing node (for minibatch, for other partitions please change properly). The node load index is sometimes useful for infering if a job is consuming more resources than the specified ones in the sbatch file.

user>sinfo -O all | cut -d\| -f10,11,23,33,34 | grep minibatch
48294|node-25|0/0/12/12 |N/A |minibatch 
48294|node-20|12/0/0/12 |12.19 |minibatch 
48294|node-21|12/0/0/12 |12.19 |minibatch 
48294|node-24|12/0/0/12 |10.92 |minibatch 
48294|node-22|0/12/0/12 |0.00 |minibatch 
48294|node-23|0/12/0/12 |1.00 |minibatch

The column CPUS(A/I/O/T) shows the allocated (A), idle (I), other (O) and total (T) number of cores.

The scontrol command can be used for revealing extra information about a running job. For completed jobs, scontrol keeps in cache these info’s only for a short period of time. The below exampe demontrates the use of scontrol for the job with id 21611.

user> scontrol show jobs 21611
JobId=21611 JobName=sina-5
   UserId=user5 GroupId=users(1000) MCS_label=N/A
   Priority=4294899077 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=16-12:49:19 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-05-05T22:24:44 EligibleTime=2019-05-05T22:24:44
   StartTime=2019-05-05T22:24:45 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=zorba:18656
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node-4
   BatchHost=node-4
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,mem=128723M,node=1
   Socks/Node=* NtasksPerN:B:S:C=20:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=128723M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home100/user5/batch5.sh
   WorkDir=/home100/user5
   StdErr=/home100/user5/sina_5_output
   StdIn=/dev/null
   StdOut=/home100/user5/sina_5_output
   Power=

The sacct command give various status information of completed or running jobs.

user> sacct -o reqmem,maxrss,ncpus,elapsed,state -j 1350
    ReqMem     MaxRSS      NCPUS    Elapsed      State 
---------- ---------- ---------- ---------- ---------- 
      40Gc                     1   01:53:19  COMPLETED 
      40Gc      7208K          1   01:53:19  COMPLETED

As shown above, the completed job 1350 consumed 7208KB of memory while the required capacity was 40GB (waste of memory). In the following example the i/o data size for a running job are presented.

user> sstat -p --format=MaxDiskRead,MaxDiskWrite,MaxRSS,JobID  -j 21611.batch
MaxDiskRead|MaxDiskWrite|MaxRSS|JobID|
4753612.74M|1182762.50M|36274380K|21611.batch|

For more details someone could reference to the manual pages of sinfo, squeue, scontrol, sacct and sstat.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Jobs Management