Re: [ptp-dev] Some questions about runtime model,sdm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [ptp-dev] Some questions about runtime model,sdm

From: Greg Watson <g.watson@xxxxxxxxxxxx>
Date: Wed, 6 Sep 2006 08:40:29 -0600
Delivered-to: ptp-dev@xxxxxxxxxxx
List-archive: <http://eclipse.org/pipermail/ptp-dev>
List-help: <mailto:ptp-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=unsubscribe>

Great questions.

On Sep 3, 2006, at 10:03 PM, yang ke wrote:

Hi,all
I ' ve just managed to launch a parallel job across multiple nodesby a small change in orte_server.c. :-)I know that our v1.1 is coming in the near weeks, and it has beenmade great change from v1.0, still I have some questions, hope itwill be helpful.
1. About Runtime Model
Nathan, I think asychronous updating will be better for the runtimemodel esp. for job launching. Currently we use a <RUN ...> commandfollwed by a <GETPATTR ...> command to BLOCKLY launch a job andthen to construct its job structure, which often makes the userstaring at a Progress Bar for a boringly long time to go onoperate. The runtime can quickly return a job id, but otherattributes,like process pid(real pid),would not be returned soon(maybe after all tasks launched). We can make better use of thisknowledge for finer monitoring of job status: process status.First construct the job structure on higher layer model(java),then LISTEN to lower layer(such as slurm or orte) to returnattribute events.The lower layer should have good sense of processattributes and status changes, and report process attributes andstatus changes timely. Anytime there occurs some error,we can findout which process fails instead of a whole job error.I have tested that on SLURM,but I doesn't know if ORTEsupports such process status report.Will we get such changes?


See Nathans/my reply in separate email.

2. About Scalable Debugger Manager
recently I study a paper-
"Extending a traditional debugger to debug massively parallelapplications",Susanne M.Balle,Journal of Parallel and DistributedComputing 64(2004) 617-628
It suggests some good practice on improving debugging preformance.
PS: Greg's "architeture of a parallel relative debugger" is listedin the References. :-)
I looked into the source of sdm, and I am afraid sdm client may bea bottleneck when issuing debug instructions to servers andreceiving debug responses from servers, for it has to finish a setof processes one by one. If we are to debug 10K processes, sdmclient will be busy dispatching a deubg instruction and will beburied in a flood of reponses. Fortunately, implementation of sdmin MPI way fills me with courage. We can dispatch debuginstructions in a way similar with MPI_BCAST() and aggregate debugresponses with MPI_GATHER(). Alternately, we can adopt some ideafrom Susanne's paper,for example,we use a tree-like network todispatch and aggregate.
Again,from that paper, I find out there are 3 types of debuggingmessages:
Type1------Identical outputs from each of the debugger/aggregators
Type2------Identical outputs apart from containing different numbers
Type3------Widely differing outputs
sdm has made Type 1 aggregations,using hash table(so cool!),butType 2 still needs to be aggregated,so I hope future sdm work outsome way to aggregate Type 2 messages.

I recently checked in a change to the SDM that does pretty much whatyou describe. I allocate task ID's based on the optimal broadcaststrategy, then each MPI process sends to a small set of processesrather than one controller process sending to everyone. So forexample, if there are 10 processes (task ID's 0-9), 0 would send to(1, 2, 4, 8), 1 would send to (3, 5, 9), 2 to 6, and 3 to 7.

For the return journey, I aggregate messages at each of theintermediate processes as they propagate back to the root process.This substantially reduces the amount of traffic in the network. Ofcourse, it is always possible for every message to be different, inwhich case this approach is no better than the simply passing eachmessage back. However in practice this seems to rarely occur.

Take a look at src/client/client_svr.c in the SDM project which iswhere most of this is implemented. Comments appreciated.

3. About Open MP debugging support
PTP now supports MPI debugging well.Does future PTP support Open MPdebugging? If so, how?
Can we still use gdb?

By Open MP debugging, I assume that you really mean a threaded model?The SDM is really designed for MPI debugging, so for threaded modelsand other more complex architectures (for example the Cell), I thinkthe SDM will need to be replaced with different backends. I don't seeany reason why any debugger with thread support (such as dbx)couldn't be used to support OpenMP debugging. The Eclipse UI portionof the debugger should (hopefully) not need to be changed much,though we might want to extend the parallel debug view to supportthreads as well as processes. The debugger API provides a bitmask ineach command to specify which processes the command applies to, so itshould work equally well for threads.


If gdb thread support is good enough then it should work too.

Hope to see ptp v1.1 soon, and good luck to you.

How low will we go? Check out Yahoo! Messenger’s low PC-to-Phonecall rates.

_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev

References:
- [ptp-dev] Some questions about runtime model,sdm
  - From: yang ke

Prev by Date: Re: [ptp-dev] Re: Some questions about runtime model,sdm
Next by Date: [ptp-dev] About sdm build
Previous by thread: Re: [ptp-dev] Some questions about runtime model,sdm
Next by thread: [ptp-dev] Re: Some questions about runtime model,sdm
Index(es):
- Date
- Thread

Breadcrumbs