[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [ptp-dev] Questions about PTP SDM debugger
|
Greg
Some additional questions
1) It looks like I don't pass the name of the application executable as a
parameter on the top level SDM instance since the top level instance isn't
directly invoking the SDM instances required for individual tasks.
2) What are the invocation parameters of the individual SDM? I'm sort of
guessing I need the hostname and port of the top SDM, the pathname of the
application and any parameters the application requires. I'm guessing then
the individual SDM starts, starts a debugger instance and the debugger
instance starts the application instance.
3) Is the routing file on a node a list of all tasks in the application or
only the tasks running on that node?
4) How does the routing file get loaded onto each individual node?
5) How does each individual SDM know how to connect back to the top SDM if
the top SDM host/port is not a parameter?
6) If the individual SDM is passed the host/port that it connects to the
top SDM, how do I find out what that top level SDM port is?
I think I understand how this is supposed to work, and it seems reasonable
for the case where the user specifies a host list file. In the case where
we use LoadLeveler to allocate nodes, I'm not sure how this will work
since we have no way of knowing what nodes are allocated until the poe job
(the SDMs) starts.
Dave
Greg Watson <g.watson@xxxxxxxxxxxx>
Sent by: ptp-dev-bounces@xxxxxxxxxxx
08/25/2008 08:55 AM
Please respond to
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
To
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
cc
Subject
Re: [ptp-dev] Questions about PTP SDM debugger
Dave,
On Aug 22, 2008, at 10:44 PM, Dave Wootton wrote:
I have a first attempt at changes to my PE proxy to allow a PE application
to be debugged using the SDM debugger, and have some questions
1) Around line 188 of SDMDebugger.java, I see code that sets up the
--numnodes parameter to sdm with number of processes + 1. Right now, my
code isn't setting up the right job attributes to satisfy
JobAttributes.getNumberOfProcessesAttributeDefinition so I was getting a
null pointer exception. I temporarily hard coded --numnodes as 2 to get
around that..
Is there an assumption that there is only one application task per node
when debugging or is this really number of application tasks + 1 and
number of tasks per node doesn't matter?
For PE, the user can run as many tasks per node as he likes as long as
system resources are available. If the user specifies a hostlist, I could
probably figure out the number of nodes used by the application by looking
at the hostlist before starting the debugger. If the user is using
LoadLeveler to allocate nodes, then I have no idea how many nodes, or even
what nodes the application will run on since LoadLeveler doesn't get
control to handle node allocation until sometime after the job is
submitted.
You're right, this parameter should be --numprocs rather than -numnodes.
I've changed it now. I've also changed it so that you specify the number
of processes being debugged rather than +1 since I think this makes more
sense.
2) I think I have the argument list to sdm set up properly, where argv[0]
is the sdm executable name (sdm), the next elements of argv are whatever
are passed as debugArgs, then the pathname of the application executable
and finally a NULL (to satisfy execve)
When I try to invoke the debugger, I see the sdm process show up for a few
seconds then it exits. If I'm quick enough, I can attatch to sdm with gdb
then let it run to completion. It looks like sdm is just running for a few
seconds then exits with an exit(1) call somewhere in main.
Is there any way I can turn on some debug output to see what is going
wrong with SDM?
--debug will enable debug output. --debug=level will enable selective
debug output. See config.h for the levels.
Note that the new debugger must be started in two steps. The first step is
to run a master sdm on the head node. The second step is to start the
server sdm's on the nodes using mpirun (or poe). All the sdm's will wait
until they find a routing file formatted as:
numprocs
index address port
...
where numprocs is the number of processes being debugged, index in the
rank of the process, address is the host address (node name) where the
process is running, and port is a random port number (I'm not sure that
this is used).
It's likely that the routing file format will change in the future.
3) I am intercepting stdout and stderr for teh sdm process so that they
can either be sent to a console or redirected to a file. In either case,
I'm not seeing anything from SDM. If I try to redirect stdio for sdm, will
that cause problems?
The only thing on stdout/err should be debugging I think.
Dave_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev