[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [ptp-dev] Re: Some questions about runtime model,sdm
|
On Sep 5, 2006, at 7:51 PM, yang ke wrote:
What we currently have issues a run command and blocks on the
return of
an associated job ID. It then blocks again waiting for information
about
the processes associated with that job - what are their PIDs, what
nodes
did they start on, etc? The version I'm using right now I changed
*BOTH*
of these blocks so be async. Instead, you issue a run and sometime
later
a "new job event" comes back with the job ID. Then, you can ask for
info
about the processes and that info will come back async over time -
likely quickly but it's designed so that if it's staggered it will be
OK. Now, the problem is since the run command no longer blocks and
returns a JobID a few other systems are acting up (like the debugger)
for reasons I won't go into. I'm toying with the idea of making the
run
block on a jobID again, I think it's reasonable . . . any thoughts? It
would fix this problem with these other systems which were expecting
something like:
jobid = controlSystem.run(my_job);
/* do something with jobid */
-- Nathan
Correspondence
I agree with you! Every parallel runtime would return a job id
immediately it allocates nodes for the new application. But to a
1000 processes job, PID of each process may be returned several
seconds(even half a minute) later. Before we get the jobid, we have
already known the number of processes belonging to this job, so we
can create a blank job object with only jobid and num_procs fields
filled. From then, if a PID event or a PROC_OUT event occurs, we
only add a process member to the job object or only update the
process member. It all depends whether the parallel runtime
supplies a process-state-notify mechnism(that is, parallel runtime
will notify about each process status,like
initializing,running,proc_out,error,etc.)
So I think we should block until jobid returned, or else we may not
differ from each other JOBID when we simultaneously launch several
application instances.
Actually, the problem is that we should be referring to a 'job' which
is an object in the model, rather than a 'jobid' which is an
attribute allocated by the control system. The sequence should be:
job = modelManager.newJob();
controlSystem.run(job, my_job);
This way a 'job' can still be passed around, but it's status will
change depending on progress in the external runtime. Tasks that only
operate on a job when it is in the correct state could register a
listener, etc.
Greg