So as the progression and status of a job to be monitored, a user could make use of the following commands
user> squeue -o "%.5i %.10P %.10j %.4u %.8N %.5C %.10m %.8M"
JOBID PARTITION NAME USER NODELIST CPUS MIN_MEMORY TIME
20216 minibatch sina-7 user1 node-20 12 48294M 56-23:02:00
20217 minibatch sina-8 user2 node-21 12 48294M 56-23:01:37
20230 minibatch sina-11 user3 node-24 12 48294M 55-23:20:20
21072 batch sina-3 user4 node-2 20 128723M 29-00:50:53
21611 batch sina-5 user5 node-4 20 128723M 16-12:38:47
This extension of squeue command gives a thorough view of the current jobs queue. For an even more thorough extension of squeue one could make use of the following.
user> squeue -o "%.10i %.10P %.20j %.15u %.15T %.15R %.20E %.2D %.15N %.5C %.15m %.15M"
The following command indicates the cpu load of each computing node (for minibatch, for other partitions please change properly). The node load index is sometimes useful for infering if a job is consuming more resources than the specified ones in the sbatch file.
user>sinfo -O all | cut -d\| -f10,11,23,33,34 | grep minibatch
48294|node-25|0/0/12/12 |N/A |minibatch
48294|node-20|12/0/0/12 |12.19 |minibatch
48294|node-21|12/0/0/12 |12.19 |minibatch
48294|node-24|12/0/0/12 |10.92 |minibatch
48294|node-22|0/12/0/12 |0.00 |minibatch
48294|node-23|0/12/0/12 |1.00 |minibatch
The column CPUS(A/I/O/T) shows the allocated (A), idle (I), other (O) and total (T) number of cores.
The scontrol command can be used for revealing extra information about a running job. For completed jobs, scontrol keeps in cache these info’s only for a short period of time. The below exampe demontrates the use of scontrol for the job with id 21611.
user> scontrol show jobs 21611
JobId=21611 JobName=sina-5
UserId=user5 GroupId=users(1000) MCS_label=N/A
Priority=4294899077 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=16-12:49:19 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-05-05T22:24:44 EligibleTime=2019-05-05T22:24:44
StartTime=2019-05-05T22:24:45 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=batch AllocNode:Sid=zorba:18656
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node-4
BatchHost=node-4
NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=128723M,node=1
Socks/Node=* NtasksPerN:B:S:C=20:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=128723M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home100/user5/batch5.sh
WorkDir=/home100/user5
StdErr=/home100/user5/sina_5_output
StdIn=/dev/null
StdOut=/home100/user5/sina_5_output
Power=
The sacct command give various status information of completed or running jobs.
user> sacct -o reqmem,maxrss,ncpus,elapsed,state -j 1350
ReqMem MaxRSS NCPUS Elapsed State
---------- ---------- ---------- ---------- ----------
40Gc 1 01:53:19 COMPLETED
40Gc 7208K 1 01:53:19 COMPLETED
As shown above, the completed job 1350 consumed
7208KB of memory while the required capacity was 40GB (waste of memory). In the following example the i/o data size for a running job are presented.
user> sstat -p --format=MaxDiskRead,MaxDiskWrite,MaxRSS,JobID -j 21611.batch
MaxDiskRead|MaxDiskWrite|MaxRSS|JobID|
4753612.74M|1182762.50M|36274380K|21611.batch|
For more details someone could reference to the manual pages of sinfo, squeue, scontrol, sacct and sstat.