Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Questions about PTP SDM debugger

Greg
Some additional questions
1) It looks like I don't pass the name of the application executable as a 
parameter on the top level SDM instance since the top level instance isn't 
directly invoking the SDM instances required for individual tasks.
2) What are the invocation parameters of the individual SDM? I'm sort of 
guessing I need the hostname and port of the top SDM, the pathname of the 
application and any parameters the application requires. I'm guessing then 
the individual SDM starts, starts a debugger instance and the debugger 
instance starts the application instance.
3) Is the routing file on a node a list of all tasks in the application or 
only the tasks running on that node? 
4) How does the routing file get loaded onto each individual node?
5) How does each individual SDM know how to connect back to the top SDM if 
the top SDM host/port is not a parameter?
6) If the individual SDM is passed the host/port that it connects to the 
top SDM, how do I find out what that top level SDM port is?

I think I understand how this is supposed to work, and it seems reasonable 
for the case where the user specifies a host list file. In the case where 
we use LoadLeveler to allocate nodes, I'm not sure how this will work 
since we have no way of knowing what nodes are allocated until the poe job 
(the SDMs) starts.
Dave



Greg Watson <g.watson@xxxxxxxxxxxx> 
Sent by: ptp-dev-bounces@xxxxxxxxxxx
08/25/2008 08:55 AM
Please respond to
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>


To
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
cc

Subject
Re: [ptp-dev] Questions about PTP SDM debugger






Dave,

On Aug 22, 2008, at 10:44 PM, Dave Wootton wrote:


I have a first attempt at changes to my PE proxy to allow a PE application 
to be debugged using the SDM debugger, and have some questions 
1) Around line 188 of SDMDebugger.java, I see code that sets up the 
--numnodes parameter to sdm with number of processes + 1. Right now, my 
code isn't setting up the right job attributes to satisfy 
JobAttributes.getNumberOfProcessesAttributeDefinition so I was getting a 
null pointer exception. I temporarily hard coded --numnodes as 2 to get 
around that.. 
Is there an assumption that there is only one application task per node 
when debugging or is this really number of application tasks + 1 and 
number of tasks per node doesn't matter? 
For PE, the user can run as many tasks per node as he likes as long as 
system resources are available. If the user specifies a hostlist, I could 
probably figure out the number of nodes used by the application by looking 
at the hostlist before starting the debugger. If the user is using 
LoadLeveler to allocate nodes, then I have no idea how many nodes, or even 
what nodes the application will run on since LoadLeveler doesn't get 
control to handle node allocation until sometime after the job is 
submitted. 

You're right, this parameter should be --numprocs rather than -numnodes. 
I've changed it now. I've also changed it so that you specify the number 
of processes being debugged rather than +1 since I think this makes more 
sense.


2) I think I have the argument list to sdm set up properly, where argv[0] 
is the sdm executable name (sdm), the next elements of argv are whatever 
are passed as debugArgs, then the pathname of the application executable 
and finally a NULL (to satisfy execve) 
When I try to invoke the debugger, I see the sdm process show up for a few 
seconds then it exits. If I'm quick enough, I can attatch to sdm with gdb 
then let it run to completion. It looks like sdm is just running for a few 
seconds then exits with an exit(1) call somewhere in main. 
Is there any way I can turn on some debug output to see what is going 
wrong with SDM? 

--debug will enable debug output. --debug=level will enable selective 
debug output. See config.h for the levels.

Note that the new debugger must be started in two steps. The first step is 
to run a master sdm on the head node. The second step is to start the 
server sdm's on the nodes using mpirun (or poe). All the sdm's will wait 
until they find a routing file formatted as:

numprocs
index address port
...

where numprocs is the number of processes being debugged, index in the 
rank of the process, address is the host address (node name) where the 
process is running, and port is a random port number (I'm not sure that 
this is used).

It's likely that the routing file format will change in the future.


3) I am intercepting stdout and stderr for teh sdm process so that they 
can either be sent to a console or redirected to a file. In either case, 
I'm not seeing anything from SDM. If I try to redirect stdio for sdm, will 
that cause problems? 

The only thing on stdout/err should be debugging I think.


Dave_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev




Back to the top