[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [ptp-dev] Some questions about runtime model,sdm
|
yang ke wrote:
Hi,all
I ' ve just managed to launch a parallel job across multiple nodes by
a small change in orte_server.c. :-)
I know that our v1.1 is coming in the near weeks, and it has been made
great change from v1.0, still I have some questions, hope it will be
helpful.
1. About Runtime Model
Nathan, I think asychronous updating will be better for the runtime
model esp. for job launching. Currently we use a <RUN ...> command
follwed by a <GETPATTR ...> command to BLOCKLY launch a job and then
to construct its job structure, which often makes the user staring at
a Progress Bar for a boringly long time to go on operate. The runtime
can quickly return a job id, but other attributes,like process
pid(real pid),would not be returned soon(maybe after all tasks
launched). We can make better use of this knowledge for finer
monitoring of job status: process status.
First construct the job structure on higher layer model(java),then
LISTEN to lower layer(such as slurm or orte) to return attribute
events.The lower layer should have good sense of process attributes
and status changes, and report process attributes and status changes
timely. Anytime there occurs some error,we can find out which process
fails instead of a whole job error.
I have tested that on SLURM,but I doesn't know if ORTE supports such
process status report.Will we get such changes?
I have in my workspace a working version of this - an asynchronous
version of the job system. There is a slight problem with how it
interacts with the debugger so I'm trying to work that out, which is
marginally complex since I have had no participation in the debugger
parts of this project.
What we currently have issues a run command and blocks on the return of
an associated job ID. It then blocks again waiting for information about
the processes associated with that job - what are their PIDs, what nodes
did they start on, etc? The version I'm using right now I changed *BOTH*
of these blocks so be async. Instead, you issue a run and sometime later
a "new job event" comes back with the job ID. Then, you can ask for info
about the processes and that info will come back async over time -
likely quickly but it's designed so that if it's staggered it will be
OK. Now, the problem is since the run command no longer blocks and
returns a JobID a few other systems are acting up (like the debugger)
for reasons I won't go into. I'm toying with the idea of making the run
block on a jobID again, I think it's reasonable . . . any thoughts? It
would fix this problem with these other systems which were expecting
something like:
jobid = controlSystem.run(my_job);
/* do something with jobid */
-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@xxxxxxxx
---------------------------------------------------------------------
2. About Scalable Debugger Manager
recently I study a paper-
"Extending a traditional debugger to debug massively parallel
applications",Susanne M.Balle,Journal of Parallel and Distributed
Computing 64(2004) 617-628
It suggests some good practice on improving debugging preformance.
PS: Greg's "architeture of a parallel relative debugger" is listed in
the References. :-)
I looked into the source of sdm, and I am afraid sdm client may be a
bottleneck when issuing debug instructions to servers and receiving
debug responses from servers, for it has to finish a set of processes
one by one. If we are to debug 10K processes, sdm client will be busy
dispatching a deubg instruction and will be buried in a flood of
reponses. Fortunately, implementation of sdm in MPI way fills me with
courage. We can dispatch debug instructions in a way similar with
MPI_BCAST() and aggregate debug responses with MPI_GATHER().
Alternately, we can adopt some idea from Susanne's paper,for
example,we use a tree-like network to dispatch and aggregate.
Again,from that paper, I find out there are 3 types of debugging messages:
Type1------Identical outputs from each of the debugger/aggregators
Type2------Identical outputs apart from containing different numbers
Type3------Widely differing outputs
sdm has made Type 1 aggregations,using hash table(so cool!),but Type 2
still needs to be aggregated,so I hope future sdm work out some way to
aggregate Type 2 messages.
3. About Open MP debugging support
PTP now supports MPI debugging well.Does future PTP support Open MP
debugging? If so, how?
Can we still use gdb?
Hope to see ptp v1.1 soon, and good luck to you.
------------------------------------------------------------------------
How low will we go? Check out Yahoo! Messenger’s low PC-to-Phone call
rates.
<http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com>
<http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com>
------------------------------------------------------------------------
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
<http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com>