Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Questions about PTP SDM debugger

Dave,

I didn't write this code, but it sounds like 10 seconds is probably too short for the timeout as you say. It would probably be better to have the master sdm wait forever since it can be killed if the debug launch needs to be aborted.

I believe the race condition should be dealt with already. The first line of the file contains the number of entries, so the SDM will not consider the file complete until it contains this many routing entries. I think it just re-reads the file after some delay until the count is correct.

Greg

On Aug 25, 2008, at 10:46 PM, Dave Wootton wrote:


Greg
I got far enough with my experimentation that I can now get a top level SDM started and not exit. I may have had other problems, but once I turned on sdm debug I found that there's code in the sdm_tcpip_init function that loops for 10 seconds trying to find the routing file (which it looks like is named 'routing_file' in the current directory). If the file isn't found within 10 seconds, sdm issues a timeout message and exits. I changed the timeout to 1000 seconds and the sdm does not exit. So I think I have a starting point to continue working on this.

From what I understand of the flow you explained, I don't think 10 seconds will be long enough even once I get my proxy to generate the routing_file. As I understand it, I need to create the top SDM, then start the individual task SDMs by 'poe sdm ...', wait for poe to generate the attach.cfg file that gives me the mapping from application task rank to node and pid for each task, then create the routing_file using the attach.cfg file as input, and then the debugger will take off. I think this approach would work for the LoadLeveler case as well, since the attach.cfg file still gets generated. However, on a slow system, or for an application with a large number of tasks, it could take several minutes for this processing to complete.

With a large enough number of tasks, the creation of the routing_file may not complete before the individual SDMs detect it and try to process it. What happens then? Is there logic in the SDM to retry reading the routing_file until it gets a complete copy?

Dave


Greg Watson <g.watson@xxxxxxxxxxxx>
Sent by: ptp-dev-bounces@xxxxxxxxxxx

08/25/2008 04:18 PM

Please respond to
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>

To
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
cc
Subject
Re: [ptp-dev] Questions about PTP SDM debugger






On Aug 25, 2008, at 11:00 AM, Dave Wootton wrote:

> Greg
> Some additional questions
> 1) It looks like I don't pass the name of the application executable  
> as a
> parameter on the top level SDM instance since the top level instance  
> isn't
> directly invoking the SDM instances required for individual tasks.

No this isn't necessary. The debugger protocol supplies the executable  
name and the application arguments.

>
> 2) What are the invocation parameters of the individual SDM? I'm  
> sort of
> guessing I need the hostname and port of the top SDM, the pathname  
> of the
> application and any parameters the application requires. I'm  
> guessing then
> the individual SDM starts, starts a debugger instance and the debugger
> instance starts the application instance.

The master sdm should be invoked with as 'sdm --host=address --
port=port --debugger=gdb-mi --numprocs=n' where address is the address  
of the machine running eclipse and port is a port number assigned by  
PTP. The servers will be started with something like 'mpirun sdm -
debugger=gdb-mi --numprocs=n'.


>
> 3) Is the routing file on a node a list of all tasks in the  
> application or
> only the tasks running on that node?

A list of all tasks.

>
> 4) How does the routing file get loaded onto each individual node?

At the moment it is assumed there is a shared filesystem. This  
requirement will be removed in a later version, and the sdm's  
themselves will be used to propagate the routing file.
>
> 5) How does each individual SDM know how to connect back to the top  
> SDM if
> the top SDM host/port is not a parameter?

Connections propagate up the tree (starting from the master). Each sdm  
knows the index of its children (computed as a binomial tree) so it  
just attempts to connect to its children using the address/port  
obtained from the routing file.

>
> 6) If the individual SDM is passed the host/port that it connects to  
> the
> top SDM, how do I find out what that top level SDM port is?

There is no easy way to do this at the moment, since it is generated  
internally and passed to the submitJob command as an argument. The  
easiest way would be to print out the arguments to the submitJob  
command either in the Java side of the RM or in your proxy.

>
>
> I think I understand how this is supposed to work, and it seems  
> reasonable
> for the case where the user specifies a host list file. In the case  
> where
> we use LoadLeveler to allocate nodes, I'm not sure how this will work
> since we have no way of knowing what nodes are allocated until the  
> poe job
> (the SDMs) starts.

The SDMs do nothing until they get the routing file. Would it be  
possible to launch the SDMs, get the node information from LL, then  
create the routing file? This is how the new OMPI RM works.

Greg
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev

_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev


Back to the top