Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Re: Some questions about runtime model,sdm


On Sep 5, 2006, at 7:51 PM, yang ke wrote:

What we currently have issues a run command and blocks on the return of an associated job ID. It then blocks again waiting for information about the processes associated with that job - what are their PIDs, what nodes did they start on, etc? The version I'm using right now I changed *BOTH* of these blocks so be async. Instead, you issue a run and sometime later a "new job event" comes back with the job ID. Then, you can ask for info
about the processes and that info will come back async over time -
likely quickly but it's designed so that if it's staggered it will be
OK. Now, the problem is since the run command no longer blocks and
returns a JobID a few other systems are acting up (like the debugger)
for reasons I won't go into. I'm toying with the idea of making the run
block on a jobID again, I think it's reasonable . . . any thoughts? It
would fix this problem with these other systems which were expecting
something like:

jobid = controlSystem.run(my_job);
/* do something with jobid */

-- Nathan
Correspondence
I agree with you! Every parallel runtime would return a job id immediately it allocates nodes for the new application. But to a 1000 processes job, PID of each process may be returned several seconds(even half a minute) later. Before we get the jobid, we have already known the number of processes belonging to this job, so we can create a blank job object with only jobid and num_procs fields filled. From then, if a PID event or a PROC_OUT event occurs, we only add a process member to the job object or only update the process member. It all depends whether the parallel runtime supplies a process-state-notify mechnism(that is, parallel runtime will notify about each process status,like initializing,running,proc_out,error,etc.)

So I think we should block until jobid returned, or else we may not differ from each other JOBID when we simultaneously launch several application instances.


Actually, the problem is that we should be referring to a 'job' which is an object in the model, rather than a 'jobid' which is an attribute allocated by the control system. The sequence should be:

	job = modelManager.newJob();
	controlSystem.run(job, my_job);

This way a 'job' can still be passed around, but it's status will change depending on progress in the external runtime. Tasks that only operate on a job when it is in the correct state could register a listener, etc.

Greg


Back to the top