[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
[ptp-dev] Resource Managment Design Issue: Status Codes
|
PTPers,
I decided to send out the previous resource management design update
without first consulting the current state of affairs within PTP. I
looked at what LSF and SLURM provide for node and job status
conditions instead of what PTP provides. I'd like to contrast the
three status systems of PTP, LSF, and SLURM.
NODE STATUS SYSTEMS:
PTP:
DOWN
UNALLOCATED
ALLOCATED TO YOU EXCLUSIVELY, BUT IDLE
ALLOCATED TO YOU SHARED, BUT IDLE
ALLOCATED TO SOMEONE ELSE EXCLUSIVELY
ALLOCATED TO SOMEONE ELSE SHARED
JOB RUNNING
JOB STOPPED
ERROR
UNKNOWN/UNDEFINED
LSF: (from bhosts command man page)
only when a host is in ok status, can batch jobs be
dispatched to it. The possible values for host status
are as follows:
ok
The host is available to accept batch jobs.
unavail
The host is down, or the Load Information Manager
(LIM) and the slave batch daemon (sbatchd) on the
host are unreachable.
unreach
The LIM on the host is running but the slave batch
daemon (sbatchd) is unreachable.
closed
The host is not allowed to accept any remote batch
job. There are several reasons causing the host
to be closed. The long format shown by the -l
option gives the possible reasons:
closed_Adm
The host is closed by the LSF administrator
or root (see badmin(8)). No job can be
dispatched to it but jobs that are executing
on it will not be affected.
closed_Lock
The host is locked by the LSF administrator
or root (see lsadmin(8)). All batch jobs on
the host are suspended by LSF.
closed_Wind
The host is closed by its dispatch windows,
which are defined in the configuration file
lsb.hosts(5). All batch jobs on the host are
suspended by the LSF system.
closed_Full
The configured maximum number of batch job
slots on the host has been reached (see MAX
field below).
closed_Excl
The host is currently running an exclusive
job.
closed_Busy
The host is overloaded because some load
indices go beyond the configured thresholds
(see lsb.hosts(5)). The displayed thresholds
that cause the host to be busy are preceded
by a `*'.
closed_LIM
The LIM on the host is unreachable, but the
sbatchd is ok.
SLURM: (from SLURM documentation on SINFO command)
ALLOC[ATED] means that this node (or set, or
partition) has already been assigned
to one or more jobs.
COMP[LEATING] means that job(s) assigned to this node
are already terminating. COMPLETING
disappears when all of the job's
processes as well as the SLURM epilog
program (if any) have terminated. See
the slurm.conf MAN page for details.
DOWN means that this node is unvailable for
jobs. SLRUM automatically declares nodes
DOWN if some failure occurs. Also,
system administrators may declare a node
DOWN. If a node resumes normal
operation, SLRUM can automatically
return it to service. See
ReturnToService and SlurmdTimeout
descriptions in the slurm.conf MAN page
for more details.
DRAIN[ED] means that this node has been declared
unavailable by a system administrator
using SCONTROL's UPDATE command.
DRAINING[DRNG] means that this node is currently
running a job, but it will not be
allocated to additional jobs. The node
state changes to DRAINED when the last
job on it completes. System
administrators put nodes in this state
by using SCONTROL's UPDATE command.
IDLE means that this node is not currently
assigned to any jobs and it available
for use.
UNK[NOWN] means that the SLRUM controller has just
started and hence this node's real
status has not yet been determined.
Because of the lack of overlap in the LSF and SLURM node status
systems I chose just three states for the node status, UP, DOWN, and
UNAVAILABLE. Perhaps I should have added UNKNOWN. I'm thinking about
including the state that represents the machine being up, but fully
allocated to others. Do you have a good name for that one? Maybe
that just falls under UNAVAILABLE.
JOB STATUS SYSTEMS:
There were fewer variations in Job status systems.
PTP's current process states do not have the concept of PENDING. The
Pending state shows up in both LSF and SLURM, and I'm sure is fairly
universal in resource management systems.
PTP's STARTING state is found in neither SLURM nor LSF.
PTP, LSF, and SLURM define unique states for running, normal exit, and
abnormal exit. PTP distinguishes EXITED WITH SIGNAL from ERROR.
SLURM distinguishes FAILED, non-zero code exit, from NODE_FAIL, one or
more nodes failed during run, and TIMEOUT, job has reached time limit.
LSF does not distinguish between a job that exited with a signal
from other abnormal job exits.
LSF has more job states than either SLURM or PTP. LSF has three
versions of the status for a suspended job, a state of UNKNOWN, and
a ZOMBI state.
I think that these job states are best abstracted to PENDING, RUNNING,
SUSPENDED, DONE, EXIT (abnormal), and UNKNOWN. These do not mesh with
the current PTP process status codes. Should I try to use the existing
PTP process status codes or should I use this other abstraction?
(A note: The PTP process status codes were intended for the monitoring
of a "process." The job status codes discussed here are intended for
the monitoring of a "job." Should there be a difference?)
Thank you,
Randy