Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Questions about PTP SDM debugger

Greg and Dave,

I think that Greg suggestion to launch SDM is reasonable. But are we considering race conditions? A am afraid that this approach might present several failure patterns depending on how long each sdm delays to start.

For example: The servers and the master are started nearly at the same time. All servers bind to a port as you described. The master receives the routing file and starts connecting to children that on their turn connect to grandchildren and so on. What happens if a children delays to start up for some reason? Its parent will try to connect (but the children will not be listening yet) and the parent will try the next ports, but will never try again the port that the children is actually listening to. I saw this happening, and that is the reason why the launcher is currently starting the master after the servers instead the opposite as described in the specification. I think other race conditions might be possible.

There is another issue in the strategy to launch the sdm master. After starting the sdm master, the launcher starts listening on a socket where sdm master is expected to connect. The port number is passed as parameter to sdm master. However, it may happen that sdm starts faster than the launcher creates the socket. The sdm master will try to connect, and on failure try the next ports. This approach does not make sense in this situation, since the port number passed to sdm master is guaranteed to be the port where the launcher is listening. Therefore, sdm master should not try the next ports, but try the same port again.

Another concern: Does the handshake consider the job ID? There could be a scenario were two users start a debug sessions on the same machine at the same time. Then, one might connect to the SDM server of the other, by accident, if the are listening on the same port range.

I agree that using a base port number is better than using a random number for each process. I think it is enough that the base port number is pseudo-random. I would avoid using a fixed port number because that would potentially cause port number collisions on two simultaneous debugging. I understand that sdm servers will know to handle this collision, but the start of sdm servers will take more time. By choosing the base port randomly, we reduce the probability of causing collisions.

My comments about who should write the hostfile: I see Dave concerns. I really did not consider that the amount of data to be transmitted would become that large. Couldn't we establish a standard file format to be used for all debuggers? Then the file could be written by the proxy, regardless which debugger is being used. I don't have a really good idea for this issue yet.

Best regards,
Daniel Felix Ferber


Greg Watson wrote:
Good, I'm glad we're in agreement :-). Daniel, do you have any comments on this?

Regarding the port numbers, this is not how I had intended the debugger startup to work, so I want to change this at some point. My approach is as follows, but any other suggestions would be welcome.

1. The SDM servers are given a "base" port number. At startup, they attempt to bind to this port. If that fails, they try to bind to base_port+1 after waiting a short random period (this is to avoid servers started on the same node from chasing each other up the port numbers). An alternative to this would be to bind to ((base_port +rank)%65536)+1024. A third alternative would be to use a pseudo random number generator seeded by the rank.

2. When the SDM master receives the routing file, it can then determine the location of it's children, so it attempts to connect to each in turn using the same port generation mechanism as in #1.

3. Once the connection is established to the server, a handshake is used to swap credentials, etc., then the routing file is sent. The routing file could be successively pruned as it propagates up the tree to reduce bandwidth.

4. Once the server receives the routing file, it does the same as #2.

5. This continues until all connections have been established, or there was a timeout or some other error.

Greg

On Aug 28, 2008, at 8:46 AM, Dave Wootton wrote:

Greg
I think the proxy should be responsible for building the routing file, in order to keep the traffic on the connection between the GUI and the proxy
down. With the current approach, you are sending node information across
the connection twice, once to populate the PTP runtime model, then a
second time to create the routing file on the nodes where the SDMs are
running. I'm not sure what the message length for the messages from the
proxy to the GUI are, but for the remote_file you have strlen(task_index)
+ strlen(hostname) + strlen(port_number) + 3 bytes per node. In my case
that's close to 20 bytes per task, minimum. With large numbers of tasks,
this could be a lot of data, and since all of these interactions between
the GUI, the proxy, and the SDMs are a serial process, they slow down
debugger startup.

The down side to this is the need for each proxy to implement support for each of unique debugger startup sequences it is willing to support, where
you could end up with some proxies not supporting a debugger. If you
implement all of the code in the GUI resource manager side though, I'm not
sure you don't have the same problem, where the RM needs to be aware of
the details of both the debugger startup sequence and the details of a
particular runtime environment/proxy.

The other question I have after seeing the contents of the routing file
you generate is the generation of random port numbers. If you end up
actually using these port numbers, do you run the risk of accidentally
using a port number reserved for some other application, unless you block out a range of port numbers and only use that range? Even if port numbers
are up for grabs with no expectation of reserved port numbers, what
happens if something else is using your port number?
Dave




Back to the top