Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Resource Managment Design Issue: Status Codes


On Apr 26, 2006, at 10:18 AM, Randy M. Roberts wrote:


			      status has not yet been determined.


Because of the lack of overlap in the LSF and SLURM node status
systems I chose just three states for the node status, UP, DOWN, and
UNAVAILABLE.  Perhaps I should have added UNKNOWN.  I'm thinking about
including the state that represents the machine being up, but fully
allocated to others.  Do you have a good name for that one?  Maybe
that just falls under UNAVAILABLE.

UNAVAILABLE to me means that the machine is down. I'd suggest something like IN_USE or ALLOCATED_OTHER or something.





JOB STATUS SYSTEMS:

There were fewer variations in Job status systems.

PTP's current process states do not have the concept of PENDING.  The
Pending state shows up in both LSF and SLURM, and I'm sure is fairly
universal in resource management systems.

PTP's STARTING state is found in neither SLURM nor LSF.

STARTING is a ptp internal state. It is set when the job is created (internally) and only changes when the runtime signals that the job has started running.


PTP, LSF, and SLURM define unique states for running, normal exit, and
abnormal exit.  PTP distinguishes EXITED WITH SIGNAL from ERROR.
SLURM distinguishes FAILED, non-zero code exit, from NODE_FAIL, one or
more nodes failed during run, and TIMEOUT, job has reached time limit.
LSF does not distinguish between a job that exited with a signal
from other abnormal job exits.

LSF has more job states than either SLURM or PTP.  LSF has three
versions of the status for a suspended job, a state of UNKNOWN, and
a ZOMBI state.

I think that these job states are best abstracted to PENDING, RUNNING,
SUSPENDED, DONE, EXIT (abnormal), and UNKNOWN.  These do not mesh with
the current PTP process status codes. Should I try to use the existing
PTP process status codes or should I use this other abstraction?

(A note: The PTP process status codes were intended for the monitoring
of a "process."  The job status codes discussed here are intended for
the monitoring of a "job."  Should there be a difference?)

Probably, since process status is going to be provided by the control/ monitoring system while job status will be provided by the RM system. However it would make sense for them use similar terminology where appropriate.

I assume the mapping is:

DONE -> EXITED (zero exit status)
EXIT -> EXITED WITH SIGNAL or EXITED (non-zero exit status)

I would like to preserve the distinction between signal and non-zero exit status in some way (at least for processes). We have different icons for these states. Otherwise I'm happy changing names.

Greg



Back to the top