Re: [ptp-dev] Some questions about runtime model,sdm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [ptp-dev] Some questions about runtime model,sdm

From: Nathan DeBardeleben <ndebard@xxxxxxxx>
Date: Tue, 05 Sep 2006 08:56:22 -0600
Delivered-to: ptp-dev@xxxxxxxxxxx
List-archive: <http://eclipse.org/pipermail/ptp-dev>
List-help: <mailto:ptp-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Thunderbird 1.5.0.5 (Macintosh/20060719)

yang ke wrote:

Hi,all
I ' ve just managed to launch a parallel job across multiple nodes bya small change in orte_server.c. :-)I know that our v1.1 is coming in the near weeks, and it has been madegreat change from v1.0, still I have some questions, hope it will behelpful.
1. About Runtime Model
Nathan, I think asychronous updating will be better for the runtimemodel esp. for job launching. Currently we use a <RUN ...> commandfollwed by a <GETPATTR ...> command to BLOCKLY launch a job and thento construct its job structure, which often makes the user staring ata Progress Bar for a boringly long time to go on operate. The runtimecan quickly return a job id, but other attributes,like processpid(real pid),would not be returned soon(maybe after all taskslaunched). We can make better use of this knowledge for finermonitoring of job status: process status.First construct the job structure on higher layer model(java),thenLISTEN to lower layer(such as slurm or orte) to return attributeevents.The lower layer should have good sense of process attributesand status changes, and report process attributes and status changestimely. Anytime there occurs some error,we can find out which processfails instead of a whole job error.I have tested that on SLURM,but I doesn't know if ORTE supports suchprocess status report.Will we get such changes?

I have in my workspace a working version of this - an asynchronousversion of the job system. There is a slight problem with how itinteracts with the debugger so I'm trying to work that out, which ismarginally complex since I have had no participation in the debuggerparts of this project.

What we currently have issues a run command and blocks on the return ofan associated job ID. It then blocks again waiting for information aboutthe processes associated with that job - what are their PIDs, what nodesdid they start on, etc? The version I'm using right now I changed *BOTH*of these blocks so be async. Instead, you issue a run and sometime latera "new job event" comes back with the job ID. Then, you can ask for infoabout the processes and that info will come back async over time -likely quickly but it's designed so that if it's staggered it will beOK. Now, the problem is since the run command no longer blocks andreturns a JobID a few other systems are acting up (like the debugger)for reasons I won't go into. I'm toying with the idea of making the runblock on a jobID again, I think it's reasonable . . . any thoughts? Itwould fix this problem with these other systems which were expectingsomething like:


jobid = controlSystem.run(my_job);
/* do something with jobid */

-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@xxxxxxxx
---------------------------------------------------------------------

2. About Scalable Debugger Manager
recently I study a paper-
"Extending a traditional debugger to debug massively parallelapplications",Susanne M.Balle,Journal of Parallel and DistributedComputing 64(2004) 617-628
It suggests some good practice on improving debugging preformance.
PS: Greg's "architeture of a parallel relative debugger" is listed inthe References. :-)I looked into the source of sdm, and I am afraid sdm client may be abottleneck when issuing debug instructions to servers and receivingdebug responses from servers, for it has to finish a set of processesone by one. If we are to debug 10K processes, sdm client will be busydispatching a deubg instruction and will be buried in a flood ofreponses. Fortunately, implementation of sdm in MPI way fills me withcourage. We can dispatch debug instructions in a way similar withMPI_BCAST() and aggregate debug responses with MPI_GATHER().Alternately, we can adopt some idea from Susanne's paper,forexample,we use a tree-like network to dispatch and aggregate.
Again,from that paper, I find out there are 3 types of debugging messages:
Type1------Identical outputs from each of the debugger/aggregators
Type2------Identical outputs apart from containing different numbers
Type3------Widely differing outputs
sdm has made Type 1 aggregations,using hash table(so cool!),but Type 2still needs to be aggregated,so I hope future sdm work out some way toaggregate Type 2 messages.
3. About Open MP debugging support
PTP now supports MPI debugging well.Does future PTP support Open MPdebugging? If so, how?
Can we still use gdb?
Hope to see ptp v1.1 soon, and good luck to you.

------------------------------------------------------------------------
How low will we go? Check out Yahoo! Messenger’s low PC-to-Phone callrates.<http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com>
 <http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com>
------------------------------------------------------------------------

_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
 <http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com>

References:
- [ptp-dev] Some questions about runtime model,sdm
  - From: yang ke

Prev by Date: [ptp-dev] Chris Recoskie is out of the office.
Next by Date: [ptp-dev] Re: Some questions about runtime model,sdm
Previous by thread: [ptp-dev] Some questions about runtime model,sdm
Next by thread: Re: [ptp-dev] Some questions about runtime model,sdm
Index(es):
- Date
- Thread

Breadcrumbs