[ptp-dev] Resource Managment Design Issue: Status Codes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[ptp-dev] Resource Managment Design Issue: Status Codes

From: "Randy M. Roberts" <rsqrd@xxxxxxxx>
Date: Wed, 26 Apr 2006 10:18:57 -0600
Delivered-to: ptp-dev@xxxxxxxxxxx
List-archive: <http://eclipse.org/pipermail/ptp-dev>
List-help: <mailto:ptp-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=unsubscribe>

PTPers,

I decided to send out the previous resource management design update
without first consulting the current state of affairs within PTP.  I
looked at what LSF and SLURM provide for node and job status
conditions instead of what PTP provides.  I'd like to contrast the
three status systems of PTP, LSF, and SLURM.

NODE STATUS SYSTEMS:

	PTP:
		DOWN
		UNALLOCATED
		ALLOCATED TO YOU EXCLUSIVELY, BUT IDLE
		ALLOCATED TO YOU SHARED, BUT IDLE
		ALLOCATED TO SOMEONE ELSE EXCLUSIVELY
		ALLOCATED TO SOMEONE ELSE SHARED
		JOB RUNNING
		JOB STOPPED
		ERROR
		UNKNOWN/UNDEFINED

	LSF:    (from bhosts command man page)

	       	only when a host is in ok status, can batch jobs be
	       	dispatched to it. The possible values for host status
	       	are as follows:

	       ok
                    The host is available to accept batch jobs.

	       unavail
		    The host is down, or the Load Information Manager
		    (LIM) and the slave batch daemon (sbatchd) on the
		    host are unreachable.

	       unreach
		    The LIM on the host is running but the slave batch
		    daemon (sbatchd) is unreachable.

	       closed
		    The host is not allowed to accept any remote batch
		    job.  There are several reasons causing the host
		    to be closed. The long format shown by the -l
		    option gives the possible reasons:

		    closed_Adm
                         The host is closed by the LSF administrator
                         or root (see badmin(8)).  No job can be
                         dispatched to it but jobs that are executing
                         on it will not be affected.

		    closed_Lock
			 The host is locked by the LSF administrator
			 or root (see lsadmin(8)).  All batch jobs on
			 the host are suspended by LSF.

		    closed_Wind
			 The host is closed by its dispatch windows,
			 which are defined in the configuration file
			 lsb.hosts(5). All batch jobs on the host are
			 suspended by the LSF system.

		    closed_Full
			 The configured maximum number of batch job
			 slots on the host has been reached (see MAX
			 field below).

		    closed_Excl
			 The host is currently running an exclusive
			 job.

		    closed_Busy
			 The host is overloaded because some load
			 indices go beyond the configured thresholds
			 (see lsb.hosts(5)).  The displayed thresholds
			 that cause the host to be busy are preceded
			 by a `*'.

		    closed_LIM
			 The LIM on the host is unreachable, but the
			 sbatchd is ok.

	SLURM:  (from SLURM documentation on SINFO command)

		ALLOC[ATED]   means that this node (or set, or
		              partition) has already been assigned
			      to one or more jobs.

		COMP[LEATING] means that job(s) assigned to this node
                              are already terminating. COMPLETING
                              disappears when all of the job's
                              processes as well as the SLURM epilog
                              program (if any) have terminated. See
                              the slurm.conf MAN page for details.

		DOWN          means that this node is unvailable for
		              jobs. SLRUM automatically declares nodes
		              DOWN if some failure occurs. Also,
		              system administrators may declare a node
		              DOWN. If a node resumes normal
		              operation, SLRUM can automatically
		              return it to service. See
		              ReturnToService and SlurmdTimeout
		              descriptions in the slurm.conf MAN page
		              for more details.

		DRAIN[ED]     means that this node has been declared
                              unavailable by a system administrator
                              using SCONTROL's UPDATE command.

                DRAINING[DRNG] means that this node is currently
                               running a job, but it will not be
                               allocated to additional jobs. The node
                               state changes to DRAINED when the last
                               job on it completes. System
                               administrators put nodes in this state
                               by using SCONTROL's UPDATE command.

		IDLE          means that this node is not currently
                              assigned to any jobs and it available
                              for use.

                UNK[NOWN]     means that the SLRUM controller has just
			      started and hence this node's real
			      status has not yet been determined.


Because of the lack of overlap in the LSF and SLURM node status
systems I chose just three states for the node status, UP, DOWN, and
UNAVAILABLE.  Perhaps I should have added UNKNOWN.  I'm thinking about
including the state that represents the machine being up, but fully
allocated to others.  Do you have a good name for that one?  Maybe
that just falls under UNAVAILABLE.




JOB STATUS SYSTEMS:

There were fewer variations in Job status systems.

PTP's current process states do not have the concept of PENDING.  The
Pending state shows up in both LSF and SLURM, and I'm sure is fairly
universal in resource management systems.

PTP's STARTING state is found in neither SLURM nor LSF.

PTP, LSF, and SLURM define unique states for running, normal exit, and
abnormal exit.  PTP distinguishes EXITED WITH SIGNAL from ERROR.
SLURM distinguishes FAILED, non-zero code exit, from NODE_FAIL, one or
more nodes failed during run, and TIMEOUT, job has reached time limit.
LSF does not distinguish between a job that exited with a signal
from other abnormal job exits.

LSF has more job states than either SLURM or PTP.  LSF has three
versions of the status for a suspended job, a state of UNKNOWN, and
a ZOMBI state.

I think that these job states are best abstracted to PENDING, RUNNING,
SUSPENDED, DONE, EXIT (abnormal), and UNKNOWN.  These do not mesh with
the current PTP process status codes.  Should I try to use the existing
PTP process status codes or should I use this other abstraction?

(A note:  The PTP process status codes were intended for the monitoring
of a "process."  The job status codes discussed here are intended for
the monitoring of a "job."  Should there be a difference?)

Thank you,
Randy

Follow-Ups:
- Re: [ptp-dev] Resource Managment Design Issue: Status Codes
  - From: Greg Watson

Prev by Date: Re: [ptp-dev] Update to design
Next by Date: Re: [ptp-dev] Resource Managment Design Issue: Status Codes
Previous by thread: [ptp-dev] Update to design
Next by thread: Re: [ptp-dev] Resource Managment Design Issue: Status Codes
Index(es):
- Date
- Thread

Breadcrumbs